On the empirical distribution of numbers

Updated 2025-05-02 / Created 2025-05-02 / 812 words

At last, data-driven numerology.

What are some good numbers for someone just starting to get into math?

You've probably heard of Benford's Law – the nonuniform empirical distribution of the first digits of "real-world numbers". While thinking about other things, I realized that it would be very easy to investigate this more generally by counting all instances of "numbers" within a mid-sized LLM training dataset.

I did this, using a somewhat complex state-machine parser for "numbers" which attempts to correctly interpret numbers containing commas and dots and percentages, or separated by dashes, but which does not handle scientific notation or exponents[1]. You can get the resulting number data here; I have run some analyses myself, described below. The lack of scientific notation parsing likely cuts off lots of the top end, and the URLs in the dataset probably skew it too.

Run over the entire Pile training set, my code extracts 60633452 unique numbers (6.1e7)[2], and 12598942216 (1.2e10) total instances. Firstly, here are the 30 most common numbers. You can see that 1 makes up about 10% of all numbers, with 2 very close behind[3]. Only one negative number makes it into these rarefied heights, and we can see that "round numbers" are more salient than merely small numbers.

number count
1 1285091487
2 1245931860
0 759996917
3 658078393
4 442882667
5 342835723
6 246669458
8 207732046
10 204988418
7 195558391
9 161918480
12 129697054
11 111310452
20 101514897
15 91434782
16 87265477
13 79321181
14 75491719
30 72532266
-1 70044908
18 69886735
17 62675170
32 59691749
24 58510498
19 57800365
25 55035188
21 54834586
100 50744706
22 47568552
50 47307684

We might expect the distribution to be Zipfian, i.e. frequency1rank\mathrm{frequency} \propto \frac{1}{\mathrm{rank}}, although this doesn't hold for the top two numbers. I find that it's not: on a log/log plot, the best-fit line has a substantially steeper gradient than Zipf's law predicts. I don't know how to fit a Zipf-Mandelbrot law frequency1(rank+b)a\mathrm{frequency} \propto \frac{1}{(\mathrm{rank}+b)^a} with the bb parameter, so I haven't.

It's closer to Zipfian for the top thousand numbers.

It intuitively feels like the line should be shallower here, but the points are much denser toward the right.

No significant change here.

I also analyzed the number of numbers with numbers of occurrences falling into exponentially distributed bins:

The majority of seen numbers are seen very few times. I think the straight trend here implies number frequency is Pareto-distributed.

We can also look at properties of the numbers themselves, rather than just their frequencies, since they're less opaque than words. The most obvious one is their size (absolute value). Below 10010^0 (1), the results are somewhat unreliable, because percentages are parsed but not other units or fractions or scientific notation. Regardless:

I am not sure what causes the spikiness – possibly numerical issues.

By sorting the numbers, I also determined that the median number is 7, plus or minus some roundoff error (conversion to 64-bit floating points loses some precision over arbitrarily long decimal strings). I also have the frequency of small integers (0 to 100) and some plausible year numbers.

Spikes mostly correspond to "round numbers" in base 10 or 2 (marked with x-axis ticks).

The dataset was compiled in 2020, and I suppose it contains less forward-looking content than backward-looking.

Finally, first digits. I get results quite close to Benford's law across this dataset, though not a perfect fit. This is discarding anything which begins with 0, though.

"Noninteger" means I excluded every number without a . indicating fractional part.

In the future, it might be "useful" to see how well you can predict number popularity with a linear model based on (some transforms of) magnitude, sign, number of trailing zeroes and first digit.


  1. These have very nonstandard formats. I don't know how to do this without writing hundreds of separate regexes. Already the parser is convoluted enough due to trying to normalize numbers. ↩︎

  2. 1% is counted as separate from 0.01 for reasons, so this is a slight overestimate. ↩︎

  3. In some sense this is undercounting because "a" and "the" refer to one thing as well. ↩︎

Cite this post

Comments