Relevant if you build with AI tools, APIs, or coding agents. Relevant als je bouwt met AI-tools, API's of coding agents.
AI and compute AI and compute
Title: AI and compute Title: AI and compute
Quick editorial signal Snelle redactionele duiding
- Track this as a OpenAI update, not just a standalone headline. Bekijk dit als OpenAI-update, niet alleen als losse headline.
- Useful for builders who need to understand API, coding, or workflow changes. Nuttig voor bouwers die API-, code- of workflowwijzigingen willen begrijpen.
- Use the reactions below to tell us if this needs follow-up coverage. Gebruik de reacties hieronder om aan te geven of dit opvolging verdient.
We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.
The total amount of compute, in petaflop/s-days,[^footnote-petaflops] used to train selected results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used.
Appendix: methods
Two methodologies were used to generate these data points. When we had enough information, we directly counted the number of FLOPs (adds and multiplies) in the described architecture per training example and multiplied by the total number of forward and backward passes during training. When we didn’t have enough information to directly count FLOPs, we looked GPU training time and total number of GPUs used and assumed a utilization efficiency (usually 0.33). For the majority of the papers we were able to use the first method, but for a significant minority we relied on the second, and we computed both whenever possible as a consistency check. In the majority of cases we also confirmed with the authors. The calculations are not intended to be precise but we aim to be correct within a factor 2-3. We provide some example calculations below.
Example of Method 1: Counting operations in the model
This method is particularly easy to use when the authors give the number of operations used in a forward pass, as in the Resnet paper (the Resnet-151 model in particular):
(add-multiplies per forward pass) * (2 FLOPs/add-multiply) * (3 for forward and backward pass) * (number of examples in dataset) * (number of epochs)
= (11.4 * 10^9) * 2 * 3 * (1.2 * 10^6 images) * 128
= 10,000 PF = 0.117 pfs-days
Operations can also be counted programmatically for a known model architecture in some deep learning frameworks, or we can simply count operations manually. If a paper gives enough information to make this calculation, it will be quite accurate, but in some cases papers don’t contain all the necessary information and authors aren’t able to reveal it publicly.
Example of Method 2: GPU Time
If we can’t count operations directly, we can instead look at how many GPUs were trained for how long, and use reasonable guesses at GPU utilization to try to estimate the number of operations performed. We emphasize that here we are not counting peak theoretical FLOPS, but using an assumed fraction of theoretical FLOPS to try to guess at actual FLOPS. We typically assume a 33% utilization for GPUs and a 17% utilization for CPU’s, based on our own experience, except where we have more specific information (e.g. we spoke to the author or the work was done at OpenAI).
As an example, in the AlexNet paper it’s stated that “our network takes between five and six days to train on two GTX 580 3GB GPUs”. Under our assumptions this implies a total compute of:
Number of GPUs * (peta-flops/GTX580) * days trained * estimated utilization
= 2 * (1.58 * 10 ^ -3 PF) * 5.5 * 0.33
= 500 PF = 0.0058 pfs-days
This method is more approximate and can easily be off by a factor of 2 or occasionally more; our aim is only to estimate the order of magnitude. In practice when both methods are available they often line up quite well (for AlexNet we can also directly count the operations, which gives us 0.0054 pfs-days vs 0.0058 with the GPU time method).
1.2M images * 90 epochs * 0.75 GFLOPS * (2 add-multiply) * (3 backward pass)
= 470 PF = 0.0054 pfs-days
Selected additional calculations
Dropout
1 GPU * 4 days * 1.54 TFLOPS/GTX 580 * 0.33 utilization
= 184 PF = 0.0021 pfs-days
Method 2
Visualizing and Understanding Conv Nets
1 GPU * 12 days * 1.54 TFLOPS/GTX 580 * 0.33 utilization
= 532 PF = 0.0062 pfs-days
DQN
Network is 84x84x3 input, 16, 8x8, stride 4, 32 4x4 stride 2, 256 fully connected
First layer: 20*20*3*16*8*8 = 1.23M add-multiplies
Second layer: 9*9*16*32*4*4 = 0.66M add-multiplies
Third layer: 9*9*32*256 = 0.66M add-mutliplies
Total ~ 2.55M add-multiplies
2.5 MFLOPs * 5M updates * 32 batch size * 2 multiply-add * 3 backward pass
= 2.3 PF = 2.7e-5 pfs-days
Method 1
Seq2Seq
(348M + 304M) words * 0.380 GF * 2 add-multiply * 3 backprop * 7.5 epoch
= 7,300 PF = 0.085 pfs-days
10 days * 8 GPU’s * 3.5 TFLOPS/ K20 GPU * 0.33 utilization
= 8,100 PF = 0.093 pfs-days
VGG
1.2 M images * 74 epochs * 16 GFLOPS * 2 add-multiply * 3 backward pass
= 8524 PF = 0.098 pfs-days
4 Titan Black GPU’s * 15 days * 5.1 TFLOPS/GPU * 0.33 utilization
= 10,000 PF = 0.12 pfs-days
DeepSpeech2
1 timestep = (1280 hidden units)^2 * (7 RNN layers * 4 matrices for bidirectional + 2 DNN layers) * (2 for doubling parameters from 36M to 72M) = 98 MFLOPs
20 epochs * 12,000 hours * 3600 seconds/hour * 50 samples/sec * 98 MFLOPs * 3 add-multiply * 2 backprop
= 26,000 PF = 0.30 pfs-days
16 TitanX GPU’s * 5 days * 6 TFLOPS/GPU * 0.50 utilization
= 21,000 PF = 0.25 pfs-days
Xception
60 K80 GPU’s * 30 days * 8.5 TFLOPS/GPU * 0.33 utilization
= 4.5e5 PF = 5.0 pfs-days
Neural Architecture Search
50 epochs * 50,000 images * 10.0 GFLOPSs * 12800 networks * 2 add-multiply * 3 backward pass
= 1.9e6 PF = 22 pfs-days
800 K40’s * 28 days * 4.2 TFLOPS/GPU * 0.33 utilization
= 2.8e6 PF = 31 pfs-days
Method 2. Details given in a later paper.
Neural Machine Translation
sqrt(10 * 100) factor added because production model used 2-3 orders of magnitude more data, but only 1 epoch rather than 10.
96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100)
= 6.9e6 PF = 79 pfs-days
Appendix: Recent novel results that used modest amounts of compute
Massive compute is certainly not a requirement to produce important results. Many recent noteworthy results have used only modest amounts of compute. Here are some examples of results using modest compute that gave enough information to estimate their compute. We didn’t use multiple methods to estimate the compute for these models, and for upper bounds we made conservative estimates around any missing information, so they have more overall uncertainty. They aren’t material to our quantitative analysis, but we still think they are interesting and worth sharing:
800 K40’s * 28 days * 4.2 TFLOPS/GPU * 0.33 utilization
= 2.8e6 PF = 31 pfs-days
Method 2. Details given in a later paper.
Neural Machine Translation
sqrt(10 * 100) factor added because production model used 2-3 orders of magnitude more data, but only 1 epoch rather than 10.
96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100)
We’ve updated ouranalysiswith data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train modelsB over the last half-century.
Method 2
Appendix: Recent novel results that used modest amounts of compute
Massive compute is certainly not a requirement to produce important results. Many recent noteworthy results have used only modest amounts of compute. Here are some examples of results using modest compute that gave enough information to estimate their compute. We didn’t use multiple methods to estimate the compute for these models, and for upper bounds we made conservative estimates around any missing information, so they have more overall uncertainty. They aren’t material to our quantitative analysis, but we still think they are interesting and worth sharing:
* Attention is all you need: 0.089 pfs-days (6/2017)
* Adam Optimizer: less than 0.0007 pfs-days (12/2014)
* Learning to Align and Translate: 0.018 pfs-days (9/2014)
* GANs: less than 0.006 pfs-days (6/2014)
* Word2Vec: less than 0.00045 pfs-days (10/2013)
* Variational Auto Encoders: less than 0.0000055 pfs-days (12/2013)
We’ve updated ouranalysiswith data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train modelsB over the last half-century.
Help shape what we cover next Help bepalen wat we hierna volgen
Anonymous feedback, no frontend account needed. Anonieme feedback, zonder front-end account.
More from OpenAI Meer van OpenAI
All updates Alle updatesIntroducing GPT-5.5 GPT-5.5 geïntroduceerd
Title: Introducing GPT-5.5 Titel: GPT-5.5 geïntroduceerd
GPT-5.5 Bio Bug Bounty GPT-5.5 Bio Bug Bounty
Title: GPT-5.5 Bio Bug Bounty Titel: GPT-5.5 Bio Bug Bounty
How to get started with Codex Zo begin je met Codex
Tips to set up Codex, create your first project, and start completing real tasks. Tips om Codex in te stellen, je eerste project te maken en echte taken af te ronden.
What is Codex? Wat is Codex?
Understand what Codex is and how it fits into your work Begrijp wat Codex is en hoe het in je werk past