Replies: 38 comments 1 reply
-
是呀,支持转成FasterTransformer吗 |
Beta Was this translation helpful? Give feedback.
-
INT4量化应该会快一点。。缩小输入与输出的长度应该会快一点。。 |
Beta Was this translation helpful? Give feedback.
-
同问,我一张1080的卡,问了一个旅游地点问题,都要50s才有结果返回。 |
Beta Was this translation helpful? Give feedback.
-
@dogvane 6B参数的模型,上下文窗口又这么大,慢是必然的。能勉强跑起来就不错了。想象一下,每秒钟输出4~5个字,是不是也说的过去了。跟打字速度差不多 |
Beta Was this translation helpful? Give feedback.
-
好奇chatGPT为啥可以这么快呢,有啥技术要点不 |
Beta Was this translation helpful? Give feedback.
-
几百亿美元堆出来的速度 |
Beta Was this translation helpful? Give feedback.
-
@huangtao36 我目前用4090跑FP16其实速度并不慢,给一个bench dp: 对于6B参数的LLM, 1080显存太小了 @dogvane |
Beta Was this translation helpful? Give feedback.
-
所以其实影响观感速度的原因是因为chatgpt流式输出... |
Beta Was this translation helpful? Give feedback.
-
业内人士不会以为单张卡跑6B大模型就能0.1秒出几百字结果吧? |
Beta Was this translation helpful? Give feedback.
-
我看设置里vocab的词表有15W,这很大。我曾经有个大规模分类的项目里用BERT,但鉴于最后一层输出的类别(几千)很多,发现最后的计算每个类别的概率还有softmax这一步很耗时和计算量。我觉得如果仅在中文场景的话可以基于BERT之类的词表做词表压缩,并且把第一层的token embedding layer进行重新改写(把不要的token embedding去掉,相应词表也要做改变)这应该能减少一些耗时 |
Beta Was this translation helpful? Give feedback.
-
好吧今天看了下,底层不是像以前BERT GPT之类的分字,而是实打实的分词了,那就没办法了 |
Beta Was this translation helpful? Give feedback.
-
我用的CPU推理的,macOS的intel U |
Beta Was this translation helpful? Give feedback.
-
好奇intel U的mac多久能跑一个样本 |
Beta Was this translation helpful? Give feedback.
-
苹果m1一个你好就这么慢 |
Beta Was this translation helpful? Give feedback.
-
多堆几张显卡能更快吗?确实快这件事情非常关键 |
Beta Was this translation helpful? Give feedback.
-
Hi, All. 速度和吞吐量上有需求,可以尝试: https://huggingface.co/TMElyralab/lyraChatGLM 。 这个是对 ChatGLM6B 的加速版本,已封装 Python 上层调用,兼容 A100, V100, A10, A30 等显卡。 |
Beta Was this translation helpful? Give feedback.
-
想问一下,如何加大推理batchsize |
Beta Was this translation helpful? Give feedback.
-
这个好像只能用原版,不支持加载自己微调后的模型 |
Beta Was this translation helpful? Give feedback.
-
对,且只支持N卡,我想在A卡上部署,您能否提供您增大推理时batch_size的源码和方法,谢谢。 |
Beta Was this translation helpful? Give feedback.
-
可以看一下我提的这个pull request,使用了batch推理,支持高并发。亲测一千条数据只需要30秒就能返回响应。#1244 |
Beta Was this translation helpful? Give feedback.
-
请问一下,您这个每个batch的大小是多少,以及需要的显存大概是多少? |
Beta Was this translation helpful? Give feedback.
-
我提的这个pull request里默认设的是100,我测的是占40G显存左右,你可以根据自己的数据和显卡调整。 |
Beta Was this translation helpful? Give feedback.
-
请问一下各路大神,我的4090显卡跑chatglm-6b,为啥GPU利用率一直只有20%???怎么提升一下GPU利用率呢? |
Beta Was this translation helpful? Give feedback.
-
4090显卡 推理时只占用30%,cpu单核100%,还有63个核在围观... 服了... 不知道如何提速 |
Beta Was this translation helpful? Give feedback.
-
我们发布了新的 ChatGLM2-6B,推理速度有大幅提升,可以尝试一下 |
Beta Was this translation helpful? Give feedback.
-
已经尝试过了 有提升 但是gpu仍然只占用30% 我希望他能100%运行 |
Beta Was this translation helpful? Give feedback.
-
貌似不太可能,这模型耗时似乎主要在访存而非计算,也没什么办法去做并行操作 |
Beta Was this translation helpful? Give feedback.
-
选你想加载的精度,然后看显卡在你选的精度上的浮点数的表现怎么样 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
All reactions