當前位置: 華文星空 > 新聞

現在LLM 的大小為什都設計成6/7B、13B和130B幾個檔次?

2023-10-22新聞

好問題,嘗試答一下

對於最大尺寸的確定:

根據OpenAI的Scaling Law[1],模型的效能取決於:計算量,模型參數量和數據規模

對於Meta,OpenAI這種機構來說: 計算量肯定不是問題(有錢有卡),模型的參數也不是問題(谷歌的PaLM就堆到了540B),所以大機率是 數據的制約

即根據Scaling Law,在目前的數據規模下,能充分訓練的模型大概就是100B左右

關於Scaling Law可以參考這篇文章:nghuyong:解析大模型中的Scaling Law

對於中小尺寸的確定:

這主要是基於 推理 場景來決定的模型尺寸而 非Scaling Law 了,截取LLaMa[2]的報告:

The objective of the scaling laws from Hoffmann et al. (2022) is to determine how to best scale the dataset and model sizes for a particulartraining compute budget. However, this objective disregards the inference budget, which becomes critical when serving a language model at scale. In this context, given a target level of performance, the preferred model is not the fastest to train but the fastest at inference, and although it may be cheaper to train a large model to reach a certain level of performance, a smaller one trained longer will ultimately be cheaper at inference. For instance, although Hoffmann et al. (2022) recommends training a 10B model on 200B tokens, we find that the performance of a 7B model continues to improve even after 1T tokens.

Chinchilla Scaling Law[3]沒有考慮推理場景。在實際套用中, 考慮到推理的成本,小一些的模型更有優勢 。小模型在訓練更多的數據後,效能可以繼續提升,並且推理階段的成本會更低。所以在同等計算量下,不如用更多的數據(1T而非200B),訓練一個較小的模型(7B而非10B)。

至於為什麽是7B和13B,推測 跟當前主流顯卡的視訊記憶體大小有關

當前主力的推理卡T4(16G視訊記憶體)正好可以放下7B模型(fp16,視訊記憶體占用14G);而V100(32G視訊記憶體)可以放下13B模型(fp16,視訊記憶體占用26G)

參考:

[1] Scaling Laws for Autoregressive Generative Modeling

[2] LLaMA: Open and Efficient Foundation Language Models

[3] Training Compute-Optimal Large Language Models