現在LLM 的大小為什都設計成6/7B、13B和130B幾個檔次？

2023-10-22新聞

好問題，嘗試答一下

對於最大尺寸的確定:

根據OpenAI的Scaling Law[1]，模型的效能取決於：計算量，模型參數量和數據規模

對於Meta，OpenAI這種機構來說: 計算量肯定不是問題（有錢有卡），模型的參數也不是問題（谷歌的PaLM就堆到了540B），所以大概率是 數據的制約

即根據Scaling Law，在目前的數據規模下，能充分訓練的模型大概就是100B左右

關於Scaling Law可以參考這篇文章：nghuyong：解析大模型中的Scaling Law

對於中小尺寸的確定:

這主要是基於推理場景來決定的模型尺寸而 非Scaling Law 了，截取LLaMa[2]的報告:

The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particulartraining compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Chinchilla Scaling Law[3]沒有考慮推理場景。在實際套用中， 考慮到推理的成本，小一些的模型更有優勢 。小模型在訓練更多的數據後，效能可以繼續提升，並且推理階段的成本會更低。所以在同等計算量下，不如用更多的數據(1T而非200B)，訓練一個較小的模型(7B而非10B)。

至於為什麽是7B和13B，推測 跟當前主流顯卡的視訊記憶體大小有關

當前主力的推理卡T4（16G視訊記憶體）正好可以放下7B模型(fp16，視訊記憶體占用14G)；而V100（32G視訊記憶體）可以放下13B模型（fp16，視訊記憶體占用26G）

參考:

[1] Scaling Laws for Autoregressive Generative Modeling

[2] LLaMA: Open and Efficient Foundation Language Models

[3] Training Compute-Optimal Large Language Models