AI and compute

We’re releasing an analysis showing that since 2012, the amount of compute used in the largest AI training runs has been increasing exponentially with a 3.4-month doubling time (by comparison, Moore’s Law had a 2-year doubling period). Since 2012, this metric has grown by more than 300,000x (a 2-year doubling period would yield only a 7x increase). Improvements in compute have been a key component of AI progress, so as long as this trend continues, it’s worth preparing for the implications of systems far outside today’s capabilities.

The total amount of compute, in petaflop/s-days,[^footnote-petaflops] used to train selected results that are relatively well known, used a lot of compute for their time, and gave enough information to estimate the compute used.

Appendix: methods

Two methodologies were used to generate these data points. When we had enough information, we directly counted the number of FLOPs (adds and multiplies) in the described architecture per training example and multiplied by the total number of forward and backward passes during training. When we didn’t have enough information to directly count FLOPs, we looked GPU training time and total number of GPUs used and assumed a utilization efficiency (usually 0.33). For the majority of the papers we were able to use the first method, but for a significant minority we relied on the second, and we computed both whenever possible as a consistency check. In the majority of cases we also confirmed with the authors. The calculations are not intended to be precise but we aim to be correct within a factor 2-3. We provide some example calculations below.

Example of Method 1: Counting operations in the model

This method is particularly easy to use when the authors give the number of operations used in a forward pass, as in the Resnet paper (the Resnet-151 model in particular):

(add-multiplies per forward pass) * (2 FLOPs/add-multiply) * (3 for forward and backward pass) * (number of examples in dataset) * (number of epochs)

= (11.4 * 10^9) * 2 * 3 * (1.2 * 10^6 images) * 128

= 10,000 PF = 0.117 pfs-days

Operations can also be counted programmatically for a known model architecture in some deep learning frameworks, or we can simply count operations manually. If a paper gives enough information to make this calculation, it will be quite accurate, but in some cases papers don’t contain all the necessary information and authors aren’t able to reveal it publicly.

Example of Method 2: GPU Time

If we can’t count operations directly, we can instead look at how many GPUs were trained for how long, and use reasonable guesses at GPU utilization to try to estimate the number of operations performed. We emphasize that here we are not counting peak theoretical FLOPS, but using an assumed fraction of theoretical FLOPS to try to guess at actual FLOPS. We typically assume a 33% utilization for GPUs and a 17% utilization for CPU’s, based on our own experience, except where we have more specific information (e.g. we spoke to the author or the work was done at OpenAI).

As an example, in the AlexNet paper it’s stated that “our network takes between five and six days to train on two GTX 580 3GB GPUs”. Under our assumptions this implies a total compute of:

Number of GPUs * (peta-flops/GTX580) * days trained * estimated utilization

= 2 * (1.58 * 10 ^ -3 PF) * 5.5 * 0.33

= 500 PF = 0.0058 pfs-days

This method is more approximate and can easily be off by a factor of 2 or occasionally more; our aim is only to estimate the order of magnitude. In practice when both methods are available they often line up quite well (for AlexNet we can also directly count the operations, which gives us 0.0054 pfs-days vs 0.0058 with the GPU time method).

1.2M images * 90 epochs * 0.75 GFLOPS * (2 add-multiply) * (3 backward pass)

= 470 PF = 0.0054 pfs-days

Selected additional calculations

Dropout

1 GPU * 4 days * 1.54 TFLOPS/GTX 580 * 0.33 utilization

= 184 PF = 0.0021 pfs-days

Method 2

Visualizing and Understanding Conv Nets

1 GPU * 12 days * 1.54 TFLOPS/GTX 580 * 0.33 utilization

= 532 PF = 0.0062 pfs-days

DQN

Network is 84x84x3 input, 16, 8x8, stride 4, 32 4x4 stride 2, 256 fully connected

First layer: 20*20*3*16*8*8 = 1.23M add-multiplies

Second layer: 9*9*16*32*4*4 = 0.66M add-multiplies

Third layer: 9*9*32*256 = 0.66M add-mutliplies

Total ~ 2.55M add-multiplies

2.5 MFLOPs * 5M updates * 32 batch size * 2 multiply-add * 3 backward pass

= 2.3 PF = 2.7e-5 pfs-days

Method 1

Seq2Seq

(348M + 304M) words * 0.380 GF * 2 add-multiply * 3 backprop * 7.5 epoch

= 7,300 PF = 0.085 pfs-days

10 days * 8 GPU’s * 3.5 TFLOPS/ K20 GPU * 0.33 utilization

= 8,100 PF = 0.093 pfs-days

VGG

1.2 M images * 74 epochs * 16 GFLOPS * 2 add-multiply * 3 backward pass

= 8524 PF = 0.098 pfs-days

4 Titan Black GPU’s * 15 days * 5.1 TFLOPS/GPU * 0.33 utilization

= 10,000 PF = 0.12 pfs-days

DeepSpeech2

1 timestep = (1280 hidden units)^2 * (7 RNN layers * 4 matrices for bidirectional + 2 DNN layers) * (2 for doubling parameters from 36M to 72M) = 98 MFLOPs

20 epochs * 12,000 hours * 3600 seconds/hour * 50 samples/sec * 98 MFLOPs * 3 add-multiply * 2 backprop

= 26,000 PF = 0.30 pfs-days

16 TitanX GPU’s * 5 days * 6 TFLOPS/GPU * 0.50 utilization

= 21,000 PF = 0.25 pfs-days

Xception

60 K80 GPU’s * 30 days * 8.5 TFLOPS/GPU * 0.33 utilization

= 4.5e5 PF = 5.0 pfs-days

Neural Architecture Search

50 epochs * 50,000 images * 10.0 GFLOPSs * 12800 networks * 2 add-multiply * 3 backward pass

= 1.9e6 PF = 22 pfs-days

800 K40’s * 28 days * 4.2 TFLOPS/GPU * 0.33 utilization

= 2.8e6 PF = 31 pfs-days

Method 2. Details given in a later paper.

Neural Machine Translation

sqrt(10 * 100) factor added because production model used 2-3 orders of magnitude more data, but only 1 epoch rather than 10.

96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100)

= 6.9e6 PF = 79 pfs-days

Appendix: Recent novel results that used modest amounts of compute

Massive compute is certainly not a requirement to produce important results. Many recent noteworthy results have used only modest amounts of compute. Here are some examples of results using modest compute that gave enough information to estimate their compute. We didn’t use multiple methods to estimate the compute for these models, and for upper bounds we made conservative estimates around any missing information, so they have more overall uncertainty. They aren’t material to our quantitative analysis, but we still think they are interesting and worth sharing:

800 K40’s * 28 days * 4.2 TFLOPS/GPU * 0.33 utilization

= 2.8e6 PF = 31 pfs-days

Method 2. Details given in a later paper.

Neural Machine Translation

sqrt(10 * 100) factor added because production model used 2-3 orders of magnitude more data, but only 1 epoch rather than 10.

96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100)

We’ve updated ouranalysis⁠with data that span 1959 to 2012. Looking at the data as a whole, we clearly see two distinct eras of training AI systems in terms of compute-usage: (a) a first era, from 1959 to 2012, which is defined by results that roughly track Moore’s law, and (b) the modern era, from 2012 to now, of results using computational power that substantially outpaces macro trends. The history of investment in AI broadly is usually told as a story of booms and busts, but we don’t see that reflected in the historical trend of compute used by learning systems. It seems that AI winters and periods of excitement had a small effect on compute used to train modelsB over the last half-century.

Method 2

Appendix: Recent novel results that used modest amounts of compute

* Attention is all you need: 0.089 pfs-days (6/2017)

* Adam Optimizer: less than 0.0007 pfs-days (12/2014)

* Learning to Align and Translate: 0.018 pfs-days (9/2014)

* GANs: less than 0.006 pfs-days (6/2014)

* Word2Vec: less than 0.00045 pfs-days (10/2013)

* Variational Auto Encoders: less than 0.0000055 pfs-days (12/2013)

Appendix: methods

Example of Method 1: Counting operations in the model

Example of Method 2: GPU Time

Selected additional calculations

Dropout

Visualizing and Understanding Conv Nets

Network is 84x84x3 input, 16, 8x8, stride 4, 32 4x4 stride 2, 256 fully connected

(348M + 304M) words * 0.380 GF * 2 add-multiply * 3 backprop * 7.5 epoch

4 Titan Black GPU’s * 15 days * 5.1 TFLOPS/GPU * 0.33 utilization

= 21,000 PF = 0.25 pfs-days

= 2.8e6 PF = 31 pfs-days

96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100)

Neural Machine Translation

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Just a moment...

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

AI and compute

Appendix: methods

Example of Method 1: Counting operations in the model

Example of Method 2: GPU Time

Selected additional calculations

Dropout

Visualizing and Understanding Conv Nets

Network is 84x84x3 input, 16, 8x8, stride 4, 32 4x4 stride 2, 256 fully connected

(348M + 304M) words * 0.380 GF * 2 add-multiply * 3 backprop * 7.5 epoch

4 Titan Black GPU’s * 15 days * 5.1 TFLOPS/GPU * 0.33 utilization

= 21,000 PF = 0.25 pfs-days

= 2.8e6 PF = 31 pfs-days

96 K80 GPU’s * 9 days * 8.5 TFLOPS * 0.33 utilization * sqrt(10 * 100)

Neural Machine Translation

More from ChatGPT

New usage analytics and updated spend controls for enterprises

Just a moment...

Using AI to help physicians diagnose rare genetic diseases affecting children

A near-autonomous AI chemist improves a challenging reaction in medicinal chemistry

Comments

The Next Input keeps optional media off until you say yes.