现在LLM 的大小为什都设计成6/7B、13B和130B几个档次？

2023-10-22新闻

好问题，尝试答一下

对于最大尺寸的确定:

根据OpenAI的Scaling Law[1]，模型的性能取决于：计算量，模型参数量和数据规模

对于Meta，OpenAI这种机构来说: 计算量肯定不是问题（有钱有卡），模型的参数也不是问题（谷歌的PaLM就堆到了540B），所以大概率是 数据的制约

即根据Scaling Law，在目前的数据规模下，能充分训练的模型大概就是100B左右

关于Scaling Law可以参考这篇文章：nghuyong：解析大模型中的Scaling Law

对于中小尺寸的确定:

这主要是基于推理场景来决定的模型尺寸而 非Scaling Law 了，截取LLaMa[2]的报告:

The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particulartraining compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Chinchilla Scaling Law[3]没有考虑推理场景。在实际应用中， 考虑到推理的成本，小一些的模型更有优势 。小模型在训练更多的数据后，性能可以继续提升，并且推理阶段的成本会更低。所以在同等计算量下，不如用更多的数据(1T而非200B)，训练一个较小的模型(7B而非10B)。

至于为什么是7B和13B，推测 跟当前主流显卡的显存大小有关

当前主力的推理卡T4（16G显存）正好可以放下7B模型(fp16，显存占用14G)；而V100（32G显存）可以放下13B模型（fp16，显存占用26G）

参考:

[1] Scaling Laws for Autoregressive Generative Modeling

[2] LLaMA: Open and Efficient Foundation Language Models

[3] Training Compute-Optimal Large Language Models