好问题,尝试答一下
对于最大尺寸的确定:
根据OpenAI的Scaling Law[1],模型的性能取决于:计算量,模型参数量和数据规模
对于Meta,OpenAI这种机构来说: 计算量肯定不是问题(有钱有卡),模型的参数也不是问题(谷歌的PaLM就堆到了540B),所以大概率是 数据的制约
即根据Scaling Law,在目前的数据规模下,能充分训练的模型大概就是100B左右
关于Scaling Law可以参考这篇文章:nghuyong:解析大模型中的Scaling Law
对于中小尺寸的确定:
这主要是基于 推理 场景来决定的模型尺寸而 非Scaling Law 了,截取LLaMa[2]的报告:
The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particulartraining compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.Chinchilla Scaling Law[3]没有考虑推理场景。在实际应用中, 考虑到推理的成本,小一些的模型更有优势 。小模型在训练更多的数据后,性能可以继续提升,并且推理阶段的成本会更低。所以在同等计算量下,不如用更多的数据(1T而非200B),训练一个较小的模型(7B而非10B)。
至于为什么是7B和13B,推测 跟当前主流显卡的显存大小有关
当前主力的推理卡T4(16G显存)正好可以放下7B模型(fp16,显存占用14G);而V100(32G显存)可以放下13B模型(fp16,显存占用26G)
参考:
[1] Scaling Laws for Autoregressive Generative Modeling
[2] LLaMA: Open and Efficient Foundation Language Models
[3] Training Compute-Optimal Large Language Models