当前位置: 华文星空 > 新闻

现在LLM 的大小为什都设计成6/7B、13B和130B几个档次?

2023-10-22新闻

好问题,尝试答一下

对于最大尺寸的确定:

根据OpenAI的Scaling Law[1],模型的性能取决于:计算量,模型参数量和数据规模

对于Meta,OpenAI这种机构来说: 计算量肯定不是问题(有钱有卡),模型的参数也不是问题(谷歌的PaLM就堆到了540B),所以大概率是 数据的制约

即根据Scaling Law,在目前的数据规模下,能充分训练的模型大概就是100B左右

关于Scaling Law可以参考这篇文章:nghuyong:解析大模型中的Scaling Law

对于中小尺寸的确定:

这主要是基于 推理 场景来决定的模型尺寸而 非Scaling Law 了,截取LLaMa[2]的报告:

The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particulartraining compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Chinchilla Scaling Law[3]没有考虑推理场景。在实际应用中, 考虑到推理的成本,小一些的模型更有优势 。小模型在训练更多的数据后,性能可以继续提升,并且推理阶段的成本会更低。所以在同等计算量下,不如用更多的数据(1T而非200B),训练一个较小的模型(7B而非10B)。

至于为什么是7B和13B,推测 跟当前主流显卡的显存大小有关

当前主力的推理卡T4(16G显存)正好可以放下7B模型(fp16,显存占用14G);而V100(32G显存)可以放下13B模型(fp16,显存占用26G)

参考:

[1] Scaling Laws for Autoregressive Generative Modeling

[2] LLaMA: Open and Efficient Foundation Language Models

[3] Training Compute-Optimal Large Language Models