A new elementary proof of the Prime Number Theorem

  1. A bit of history on the Prime Number Theorem

One of the most fundamental results in mathematics is the Prime Number Theorem, which describes the asymptotic law of the distribution of prime numbers in the integers.

Theorem 1 Let {\pi(N)} denote the number of primes smaller or equal to a positive integer {N}. Then we have
\displaystyle \lim_{N\rightarrow\infty}\frac{\pi(N)}{{N}/{\log N}}=1. \ \ \ \ \ (1)

Roughly speaking, Theorem 1 states that any natural number, taken at random, has probability {1/\log N} of being prime and {1-1/\log N} of being composite.

A stronger form of Theorem 1 is the following.

Theorem 2 There exists a constant {c>0} such that for any {N\geq 2} we have
\displaystyle \pi(N)=\int_{2}^{N}\frac{dt}{\log t}+O\bigg(N\exp\bigg(-c(\log N)^{3/5}(\log\log N)^{1/5}\bigg)\bigg). \ \ \ \ \ (2)

Conditionally on the well-known Riemann Hypothesis we have instead that for any {N\geq 2},

\displaystyle \pi(N)=\int_{2}^{N}\frac{dt}{\log t}+O(N^{1/2}\sqrt{\log N}). \ \ \ \ \ (3)

Roughly speaking, the above conjecture suggests that the prime numbers should behave as randomly as possible, indeed as a simple random walk process would behave. We will justify better this behaviour later on.
The Prime Number Theorem was conjectured independently by Gauss and Legendre towards the end of the 18{^{\text{th}}} century and was proved independently by Hadamard and de la Vallee Poussin in the year 1896. Their proofs are similar and rely on sophisticated analytic machinery involving complex analysis, which was developed throughout the 19{^{\text{th}}} century by the combined effort of many great mathematicians of that era, including Euler, Chebyshev, and Riemann (to name a few). Historically, this method became known as analytic.

Even though it was believed for a long time not to be possible, more than 50 years after the analytic proof of the Prime Number Theorem was discovered, an elementary proof was found by Erdos and Selberg. In this context, elementary refers to methods that avoid using complex analysis (and the study of the distribution of the zeros of meromorphic extensions of some complex functions) and instead rely only on rudimentary facts from calculus and basic arithmetic identities and inequalities.

  1. Reformulation of the Prime Number Theorem

Let us now rephrase the statement of the Prime Number Theorem in a way more feasible to an elementary approach. To this aim, we define the function

\displaystyle M(x)=\sum_{n\leq x}\mu(n)

where \mu(n)=(-1)^{\omega(n)} when n is squarefree and 0 otherwise, where {\mu(n)} is called the Mobius function and we define {\omega(n)} to be the number of prime factors of {n} and finally we say that {n} is squarefree when it is not divided by any square of a positive integer number.
We can now rewrite Theorem 1 and Theorem 2 in the following equivalent forms.

Theorem 3 We have
\displaystyle M(x)=o(x), \ \ \ \ \ (4)

as {x\rightarrow \infty}.
Theorem 4 There exists a constant {c>0} such that for any {x\geq 2} we have
\displaystyle M(x)=O\bigg(x\exp\bigg(-c(\log x)^{3/5}(\log\log x)^{1/5}\bigg)\bigg). \ \ \ \ \ (5)

Also the previous conjecture on {\pi(N)} can be restated in the following (almost) equivalent form: for any {\varepsilon>0} we have

\displaystyle M(x)=O_{\varepsilon}(x^{1/2+\varepsilon}). \ \ \ \ \ (6)

The above reflects better the aforementioned random walk behaviour of the primes. If we consider {\mu(n)} as a simple random walk on the integers, with probability {1/2} to be {1} (when {n} has an even number of prime factors) and probability {1/2} to be {-1} (when {n} has an odd number of prime factors), we have that the discrepancy between the empirical average of {\mu(n)} and its mean value (the limit being {0}) is at most the square root of the variance. Indeed, one can show that
\displaystyle \sum_{n\leq x}\mu(n)^2\sim ax,

for a certain constant {a>0}.
Let us write {[N]} to abbreviate the set {{1,\ldots,N}}. Given a finite set {A\subset {\mathbb N}} and a function {f\colon A\rightarrow{\mathbb C}}, define the Cesaro of {f} over {A} and the logarithmic average of {f} over {A} respectively by

\displaystyle \mathbb{E}_{n\in A}f(n):= \frac{1}{|A|}\sum_{n\in A} f(n)

\displaystyle \mathbb{E}_{n\in A}^{\log}f(n):=\frac{\sum_{n\in A} \frac{f(n)}{{n}}}{\sum_{n\in A}\frac{1}{n}}.

Thus, we would like to show that:

\displaystyle \lim_{N\rightarrow \infty}\mathbb{E}_{n\in [N]}\mu(n)=0.

  1. The distribution law of the number of prime factors

Our final aim is to count the number of prime numbers up to a certain large threshold {s}. Let us first see what we can say about the number of primes up to {s} dividing a fixed integer {n}, i.e.

\displaystyle \sum_{\substack{p\leq s\ p|n}}1,

where we will always indicate a prime number with the letter {p}.
On average over {n\leq N} we have

\displaystyle \sum_{n\leq N}\sum_{\substack{p\leq s\ p|n}}1=\sum_{\substack{p\leq s}}\sum_{\substack{n\leq N\ p|n}}1=\sum_{p\leq s}\bigg\lfloor \frac{N}{s}\bigg \rfloor=N\sum_{p\leq s}\frac{1}{p}+O(s),

which is efficient only when {N} is much larger than {s}. Thus by dividing through by {N} and considering {N} as tending to infinity, we could conclude that, for a large {s} and “almost all” {n\in\mathbb{N}} we have:
\displaystyle \sum_{\substack{p\leq s\ p|n}}1\approx \sum_{p\leq s}\frac{1}{p}.

A more precise and technical statement is the following. For all {\varepsilon>0} there exists {s_0=s_0(\varepsilon)>0} such that if {s>s_0} we have
\displaystyle \limsup_{N\rightarrow \infty}\mathbb{E}_{n\in [N]}\bigg|\mathbb{E}_{p\in \mathbb{P}\cap [s]}^{\log}(1-p\textbf{1}_{p|n})\bigg|\leq \varepsilon, \ \ \ \ \ (7)

where {\mathbb{P}} is the set of prime numbers.

  1. The main idea behind this new proof

The new idea is to compare the average of the Mobius function with dilated averages of the Mobius function itself with primes or almost primes. Those dilated averages are generally defined as the average:

\displaystyle \mathbb{E}_{n\in B}\mu(n),

over a finite non-empty set {B}, which will be mainly taken as the set of all multiples of a fixed prime or a fixed {k-} almost prime, which in turn is defined as the product of {k} distinct primes.
More specifically, we are looking to estimate the following quantity:

\displaystyle \mathbb{E}_{n\in [N]}\mu(n)-\mathbb{E}_{m\in B_i}^{\log}\mathbb{E}_{n\in [N/m]}\mu(mn), \ \ \ \ \ (8)

where {B_i} is a finite set containing either prime numbers or {k}-almost prime numbers. To this aim we can rewrite the last average as
\displaystyle \mathbb{E}_{n\in [N/m]}\mu(mn)=\frac{\sum{l\leq N}\textbf{1}{m|l}\mu(l)}{N/m+O(1)}=\frac{m}{N}\sum{l\leq N}\textbf{1}{m|l}\mu(l)(1+O(m/N))=\mathbb{E}_{l\in [N]}m\textbf{1}_{m|l}\mu(l)+O(1/N),

where we used that {B_i} is finite and that of course
\displaystyle \sum_{n\leq x}\mu(n)=O(x).

Plugging this in (8) gives

\displaystyle \bigg|\mathbb{E}_{n\in [N]}\mu(n)-\mathbb{E}_{m\in B_i}^{\log}\mathbb{E}_{n\in [N/m]}\mu(mn)\bigg|

\displaystyle \leq \bigg|\mathbb{E}_{n\in [N]}\mu(n)-\mathbb{E}_{m\in B_i}^{\log}\mathbb{E}_{l\in [N]}m\textbf{1}_{m|l}\mu(l)\bigg|+O(1/N)

\displaystyle \leq \mathbb{E}_{n\in [N]}\bigg|\mathbb{E}_{m\in B_i}^{\log}(1-m\textbf{1}_{m|n})\bigg|+O(1/N).

To conclude from here we will need to find suitable sets {B_i} for which an opportune variation of (7) holds.

  1. The distribution of the number of divisors of a fixed integer

An important role in this proof of the Prime Number Theorem will be played by a variant of (7), which is provided in the following result.

Theorem 5 Let {B\subset[N]} be finite and non-empty. Then
\displaystyle \limsup_{N\rightarrow\infty}\mathbb{E}_{n\in [N]} \left|\mathbb{E}_{m\in B}^{\log}\big(1- m\textbf{1}{m|n}\big)\right|\leq\left( \mathbb{E}_{m\in B}^{\log}\mathbb{E}_{n\in B}^{\log}\Phi(n,m) \right)^{1/2}, \ \ \ \ \ (9)

where {\Phi: \mathbb{N}\times \mathbb{N}\rightarrow \mathbb{N}\cup{0}} is the function {\Phi(m,n):=\gcd(m,n)-1}.
Proof: Using the Cauchy-Schwarz inequality and expanding the square gives

\displaystyle \left(\mathbb{E}_{n\in [N]}\bigg|\mathbb{E}_{m\in B}^{\log}(1- m \textbf{1}{m|n})\bigg|\right)^2 \leq\mathbb{E}_{n\in [N]} \left|\mathbb{E}_{m\in B}^{\log}(1- m \textbf{1}{m|n})\right|^2

\displaystyle = S_1 -2 S_2+1,\ \ \ \ \ (10)

where
\displaystyle S_1:=\mathbb{E}_{n\in [N]}\mathbb{E}^{\log}_{l,m\in B} (l \textbf{1}{l| n})(m\textbf{1}{m| n})

and
\displaystyle S_2:=\mathbb{E}_{n\in [N]} \mathbb{E}^{\log}_{m\in B} m \textbf{1}_{m| n}.

Since every {m}-th number is divisible by {m}, we have
\displaystyle \mathbb{E}_{n\in [N]} m\textbf{1}{m|n}=1+O(1/N).

This implies
\displaystyle S_2 = 1+ O(1/N).

Similarly, one can verify that
\displaystyle \mathbb{E}_{n\in [N]} lm \textbf{1}{l| n}\textbf{1}_{m|n} = \gcd(l,m) + O(1/ N),

which gives \displaystyle S_1=\mathbb{E}_{l\in B}\mathbb{E}_{m\in B}^{\log} \gcd(l,m)+O(1/N)=1+\mathbb{E}_{l\in B}^{\log}\mathbb{E}_{m\in B}^{\log} \Phi(l,m) +O(1/N).

Substituting {S_2 = 1+ O(1/N)} and {S_1 = 1+ \mathbb{E}^{\log}_{l\in B}\mathbb{E}^{\log}_{m\in B} \Phi(l,m) +O(1/N)} into (10) finishes the proof of (9). \Box
In layman’s terms, the previous result says that {B} is good for a variation of (7) to hold if two integers {n} and {m} chosen at random from {B} have a “high likelihood” of being coprime.

The usefulness of the Theorem 5 is to reduce the task of finding sets for which a variation of (7) holds to the easier task of exhibiting sets for which {\mathbb{E}_{n\in B}^{\log}\mathbb{E}_{m\in B}^{\log}\Phi(m,n)} is small.

For example, the initial segment of the set of {k}-almost primes

\displaystyle \mathbb{P}_k:=\{n\in \mathbb{N}: \omega(n)=k\}

has this property.
Also note that {\mathbb{E}_{n\in \mathbb{P}_k\cap [s]}\mathbb{E}_{m\in \mathbb{P}_k\cap [s]}\Phi(m,n)\rightarrow \infty}, as {s\rightarrow \infty.} Therefore, we really need to use the logarithmic average to hope to realize variations of (7).

As we already said above, another important task is to find suitable choices of sets {B} for which the right hand side of (9) is small. Those are provided in the following result.

Theorem 6 For all {\eta>0} there exist {\rho\in (1,1+\eta]} and {k_0\in{\mathbb N}} such that for all {k\geq k_0} and all {l\geq 1} there exist two finite, non-empty sets {B_1,B_2\subset{\mathbb N}} with the following properties:


-all elements in {B_1} are primes larger than {l} and all elements in {B_2} are a product of {k} distinct primes larger than {l};


-the sets {B_1} and {B_2} have the same cardinality when restricted to {\rho}-adic intervals, by which we mean {|B_1\cap (\rho^j,\rho^{j+1}]|=|B_2\cap (\rho^j,\rho^{j+1}]|} for all {j\in{\mathbb N}\cup{0}};


{\mathbb{E}^{\log}_{m\in B_i}\mathbb{E}^{\log}_{n\in B_i} \Phi(m,n)\leq \eta} for {i=1,2}, where {\Phi} is as in Theorem 5.


We do not insert here the proof of the above result, because it is quite technical. Despite this, it should be remarked that it is a beautiful combinatorial argument, which basically relies on the so called “Chebyshev-type estimates” on the number of prime numbers in long intervals. Those are usually deducted form Stirling’s formula and some double counting-type argument and are employed in the proof of the Theorem 6 to estimate the number of primes in exponentially short intervals, by means of some applications of the pigeon-hole principle.

  1. The main proof

Fix an arbitrary {\eta>0}. Let {\rho\in (1,1+\eta]} and {k_0\in{\mathbb N}} be as guaranteed by Theorem 6. Let {k} be an arbitrary even number bigger or equal to {k_0} and let {l\in{\mathbb N}} be any number which the reader should think of as being much larger than {k}. By Theorem 6, we can find two finite and non-empty sets {B_1,B_2\subset{\mathbb N}} satisfying the properties in Theorem 6. In particular, for {i\in{1,2}}, thanks to the last property in Theorem 6, (4) becomes \displaystyle \left|\mathbb{E}_{n\in [N]}\mu(n)-\mathbb{E}_{m\in B_i}^{\log}\mathbb{E}_{n\in [N/m]}\mu(mn)\right|\leq \sqrt{\eta}+O(1/\sqrt{N}).

This implies

\displaystyle \mathbb{E}_{n\in [N]}\mu(n)=\mathbb{E}_{m\in B_i}^{\log}\mathbb{E}_{n\in [N/m]}\mu(mn)+O(\eta+1/\sqrt{N}), \qquad i=1,2. \ \ \ \ \ (11)

Since any element {m\in B_i} has at most {k} prime factors, each of which is bigger than {l}, the asymptotic density of numbers that are not coprime to {m} is very small; in fact it is smaller than {k/l}. Indeed,
\displaystyle \sum_{\substack{n\leq N\ gcd(n,m)\neq 1}}1\leq \sum_{p|m}\sum_{\substack{n\leq N\ p|n}}1\leq N\sum_{p|m}\frac{1}{p}\leq N\frac{k}{l}.

Combining this observation with the multiplicativity of the Mobius function, i.e. {\mu(mn)=\mu(n)\mu(m)} whenever {m} and {n} are coprime, we deduce that
\displaystyle \mathbb{E}_{n\in[N/m]}\mu(mn)= \mathbb{E}_{n\in[N/m]}\mu(m)\mu(n)+O(k/l).

Note that {\mu(m)=-1} for all {m\in B_1} because {B_1} consists only of primes, whereas {\mu(m)=1} for all {m\in B_2} because any element in {B_2} is a product of {k} distinct primes and we choose {k} to be even. Thus, (11) becomes
\displaystyle \mathbb{E}_{n\in [N]}\mu(n)=-\mathbb{E}_{m\in B_1}^{\log}\mathbb{E}_{n\in [N/m]} \mu(n)+O(\eta+k/l+1/\sqrt{N}) \ \ \ \ \ (12)

in the case {i=1} and
\displaystyle \mathbb{E}_{n\in [N]}\mu(n)=\mathbb{E}_{m\in B_2}^{\log}\mathbb{E}_{n\in [N/m]} \mu(n)+O(\eta+k/l+1/\sqrt{N}) \ \ \ \ \ (13)

in the case {i=2}. Finally, using that {B_1} and {B_2} have the same size within {\rho}-adic intervals and that {\rho} is very close to {1}, we conclude that the expressions {\mathbb{E}_{m\in B_1}\mathbb{E}_{n\in [N/m]} \mu(n)} and {\mathbb{E}_{m\in B_2}\mathbb{E}_{n\in [N/m]} \mu(n)} are approximately the same; in fact,
\displaystyle \mathbb{E}_{m\in B_1}^{\log}\mathbb{E}_{n\in [N/m]}\mu(n)=\mathbb{E}_{m\in B_2}^{\log}\mathbb{E}_{n\in [N/m]}\mu(n)+O(\eta). \ \ \ \ \ (14)

Indeed, observe that if {m_1} and {m_2} lie in the same {\rho}-adic interval, that is {m_1,m_2\in (\rho^{j},\rho^{j+1}]}, then the averages {\mathbb{E}_{n\in [N/m_1]}\mu(n)} and {\mathbb{E}_{n\in [N/m_2]}\mu(n)} differ by a factor of at most {\rho}. This implies that the logarithmic average of {\mathbb{E}_{n\in [N/m]}\mu(n)} as {m} ranges over {B_1} differs from the logarithmic average of {\mathbb{E}_{n\in [N/m]}\mu(n)} as {m} ranges over {B_2} by a factor of at most {\rho}, because {B_1} and {B_2} have the same size within every {\rho}-adic interval. Since {1<\rho\leq 1+\eta}, we deduce (14).
From (12), (13) and (14) it follows that

\displaystyle \mathbb{E}_{n\in [N]}\mu(n)=O(\eta+k/l+1/\sqrt{N}).

Since {\eta} and {k/l} can be made arbitrarily small, this completes the proof of the Theorem 1.

  1. Final considerations

The above proof has been taken from the F.K.Richter’s paper “A new elementary proof of the Prime Number Theorem”.

The proof is inspired by recent developments towards the well-known Chowla’s conjecture, which is a sort of high-dimensional generalization of the Prime Number Theorem. It states that for any {k}-tuple of distinct positive integers {h_1,\dots, h_k} we have

\displaystyle \mathbb{E}_{n\in [N]}\mu(n+h_1)\mu(n+h_2)\cdots\mu(n+h_k)=o(N)

as {N\rightarrow\infty}. Recently, some advances in its direction have been made by proving some cases of the related Chowla’s logarithmic conjecture. In particular, for certain {k\in\mathbb{N}} it was proved that
\displaystyle \mathbb{E}_{n\in [N]}^{\log}\mu(n+h_1)\mu(n+h_2)\cdots\mu(n+h_k)=o(1).

Some of the ideas, techniques and results contained in this new proof of the Prime Number Theorem have been expanded further in other works of Richter and some of his collaborators to analyse and prove other results in number theory via a dynamical and ergodic point of view.

The main aim of their work is indeed to recast Sarnak’s Mobius disjointness conjecture in a new dynamical framework and provide proofs of several special cases to one of its natural extensions, i.e. to the disjointness of additive and multiplicative semigroup actions.

We remind here that Sarnak’s conjecture states that for a dynamical system {(X,T)}, with {X} a compact metric topological space with {0} topological entropy, we need to have

\displaystyle \sum_{n\leq x}\mu(n)F(T^n(y))=o(x)

for any {y\in X}, {F:X\rightarrow \mathbb{C}} continuous function and considering {x} as tending to infinity. The above is sometimes considered as the property of the Mobius function to being orthogonal to any arithmetic function of low complexity.

Sharp bounds for the exponential sum of a polynomial

1. A theorem on the growth of exponential sums evaluated at polynomials

Let {d\geq 2} be a positive integer and fix {\underline{\alpha}=(\alpha_1,\dots,\alpha_d)\in [0,1]^d}. We are looking at bounds for

\displaystyle S_d(\underline{\alpha}, N):=\sum_{n\leq N}e(\alpha_1n+\cdots +\alpha_d n^d)

where {N} is a large number. The main theorem in this regard is the following.

Theorem 1 

Let {\epsilon>0}. Then

\displaystyle |S_d(\underline{\alpha}, N)|\leq N^{1/2+\epsilon},\ \ \ \textrm{as}\ N\longrightarrow +\infty

for all {\alpha\in [0,1]^d} outside a set of measure tending to 0 with N.

2. A probabilistic model

We will prove such theorem by showing that for any {\epsilon>0} the above exponential sum will have size bigger than N^{1/2+\epsilon} only on a set of measure tending to 0 with N.

Indeed, suppose that {\underline{\alpha}} is now being chosen at random in {[0,1]^d}, where the space is equipped with the uniform probability on {\mathbb{R}^d}. Let

\displaystyle X_{d,N}=\frac{|S_d(\underline{\alpha}, N)|^2}{N}.

By expanding the square and using the formula

{\int_{0}^{1}e(\alpha n)d\alpha=1} if {n=0} and {0} otherwise,

we may easily deduce that

\displaystyle \mathbb{E}[X_{d,N}^2]=\frac{\#\mathcal{S}(d,N)}{N^2},

where {S(d, N)} is the set of positive integers solution to the system of equations given by

\displaystyle \left\{ \begin{array}{lll} x_1+x_2=x_3+x_4;\\ \cdots + \cdots = \cdots + \cdots;\\ x_1^d+x_2^d=x_3^d+x_4^d.\end{array} \right.

This number is actually usually called Vinogradov’s mean value and denoted by {J_{2,d}(N)}. Its analytical representation is indeed

\displaystyle J_{2,d}(N)=\int_{[0,1]^d}|\sum_{n\leq N}e(\alpha_1 n+\cdots \alpha_d n^d)|^{4} d\alpha_1\cdots d\alpha_d.

By a well known theorem of Bourgain, Demeter and Guth we have for any {\epsilon>0} and {N\geq 2}

\displaystyle \#S(d, N)\ll_{d,\epsilon}N^{2+\epsilon}+N^{4-\frac{d(d+1)}{2}+\epsilon}.

However, here the situation is much easier and we do not need to appeal to such an important theorem. Indeed, since d\geq 2, the system has only trivial solutions. More precisely, once fixed a solution (x_1, x_2, x_3,x_4) all the other solutions are just a rearrangement of this tuple. One can see this by noticing that the solutions to the system

\displaystyle \left\{ \begin{array}{lll} x_1+x_2=n_1;\\ \cdots + \cdots = \cdots;\\ x_1^d+x_2^d=n_d.\end{array} \right.

with n_1, \dots, n_d positive integers determine via Newton’s identities all the elementary symmetric functions in x_1\dots x_d. Thus any other solution of it will just be a permutation of the roots of the polynomial (X-x_1)(X-x_2). We conclude that in such case \#S(d,N) =N^2(2+o(1)).

Thus, since {d\geq 2} we get {\mathbb{E}[X_{d,N}^2]\ll 1.} Since {\mathbb{E}[X_{d,N}]=0}, which is equivalent to {V(X_{d,N})=\mathbb{E}[X_{d,N}^2]}, by Chebyshev’s inequality we get

\displaystyle P(|X_{d,N}|>\eta)\ll\frac{1}{\eta^2},\ \textrm{for\ any}\ \eta>0,

which leads to

\displaystyle P(|S_d(\underline{\alpha},N)|>\sqrt{N\eta})\ll \frac{1}{\eta^2}.

Taking now {\eta=N^{2\epsilon}} and by the arbitrariness of {\epsilon} we deduce that {|S_d(\underline{\alpha},N)|\leq N^{1/2+\epsilon}} for all {d\geq 2} and all {\alpha\in[0,1]^d}, apart from those in a certain set of measure at most {N^{-4\epsilon}}, tending to {0} with {N}.

3. Some further considerations

Remark 1 Considering higher moments of {X_{d,N}} is it possible to prove for instance that

\displaystyle P(|S_d(\underline{\alpha},N)|>1.01\sqrt{N})>0,

or equivalently, {S_d(\underline{\alpha},N)} deviates a bit from its mean value.

Moreover, as {d,N\longrightarrow \infty} we also have {X_{d,N}\xrightarrow{\mathscr{L}}Exp(1)}, as convergence in law.

Some natural questions now arises:

Can the above procedures be made effective?

Can one do better by considering different scales?

Regarding the last one, we just point out that the random variables {X_{d,N}} behave like a random walk. Therefore, under some further assumptions one could use the iterated law of logarithms to deduce that

\displaystyle \limsup_{N\longrightarrow\infty}\frac{X_{d,N}}{\sqrt{N\log\log N}}=1\ \textrm{almost\ surely}.

Acknowledgement: The above discussion is a slightly enlarged version of what was presented by Sam Chow during one of the usual Number Theory meetings at Warwick.

How far off is Cauchy-Schwarz?

Cauchy-Schwarz is both a very simple and a very powerful bound. We won’t even try to give an extensive list of possible applications, but concentrate instead on a classical one in probability theory.

First notice that Cauchy-Schwarz can simply be interpreted as a fancy way of saying that |\cos(x)|\leq 1 for all real x by the following identity:

|\langle a,b\rangle| = |\cos(\theta)|\|a\|\|b\| \leq \|a\|\|b\|,

where \theta=\measuredangle(a,b) is the angle between a and b.

In several scenarios the dimension of the underlying vector space is very large or infinite; think of random matrix theory, where one typically looks at sequences of vector spaces with dimension going to infinity, or function spaces. We will treat these two cases as prototypical examples of (I) analysis in terms of some diverging parameter and (II) infinite dimensional spaces, where the correlation between values plays an important role.

(I) On \mathbb C^n with n\gg 1, how much off is Cauchy-Schwarz “typically”?

The intuition is that in higher dimensions it becomes increasingly unlikely for two vectors to be approximately parallel (i.e. \theta\approx 0 or $latex\theta\approx \pi$ in which case Cauchy-Schwarz is sharp) since there are “too many directions”. To answer question (I) it is instructive to consider a pair of independent vectors, chosen uniformly at random from the set of unit vectors.

By rotational symmetry we may fix a to be (1,0,0,\dots,0) and choose b uniformly at random from the set of unit vectors. Clearly its distribution is invariant under permuting indices, hence we expect (because of the constraint of having norm 1) that each entry will be of order n^{-\frac{1}{2}} and thus \mathbb E|\langle a,b\rangle|=O(n^{-\frac{1}{2}}).

While this tells us that Cauchy-Schwarz is off by a factor of O(\sqrt n) (note that by assumption \|a\|\|b\|=1 deterministically), it is not very illuminating. It is a good exercise to write out a proof not using the above symmetries — using the central limit theorem and the fact that entries are asymptotically independent as n\to\infty, one then notices that the missing factor of O(\sqrt n) comes from the fact that Cauchy-Schwarz does not account for any cancellations in the inner product.

(II) How much off is (\mathbb Ef)^2\leq \mathbb E[f^2]?

First we establish how this relates to Cauchy-Schwarz. We consider (more general) inner products of the form \langle f,g\rangle_\mu := \int_\mathbb{R} \bar f(x)g(x)d\mu(x), where \mu is some probability measure on \mathbb R and f,g are square-integrable, complex-valued functions. Clearly we recover the usual Euclidean inner product (up to a factor) by choosing \mu to be the normalised counting measure on \{1,2,\dots,n\}\subseteq\mathbb R. Now note that Cauchy-Schwarz gives us that

|\langle \mathbf 1,f\rangle_\mu|^2 = \cos(\theta)^2 \langle f,f\rangle_\mu \leq \langle f,f\rangle_\mu, (*)

where \mathbf 1 is the constant 1 function. But note that the left hand side is just (\mathbb E_\mu f)^2, whereas the right hand side equals \mathbb E_\mu [|f|^2], hence giving the claim in the heading.

It is hard to imagine angles between functions. So we give a different way to “derive” Cauchy-Schwarz here: Recall the formula giving the variance of a function in terms of expectations:

\text{Var}_\mu f = \mathbb E_\mu [f^2]-(\mathbb E_\mu f)^2\geq 0.

Comparing with equation (*) we thus get

|\langle \mathbf 1,f\rangle_\mu|^2 = \langle f,f\rangle_\mu - \text{Var}_\mu f \leq \langle f,f\rangle_\mu.

Side remark (a curious identity): Turning this “additive” bound into a “multiplicative” one (i.e. \langle \mathbf 1,f\rangle_\mu^2 = \langle f,f\rangle_\mu (1-\frac{\text{Var}_\mu f}{\langle f,f\rangle_\mu})) and comparing with the cosine formulation from before, we notice (using \sin^2+\cos^2 = \mathbf 1) that \frac{\text{Var}_\mu f}{\langle f,f\rangle_\mu} = \sin(\theta)^2. (/end of side remark)

We conclude with the heuristic for a fixed(!) function f:\|f\|_\mu=1: “The smaller its variance, the better the bound (\mathbb Ef)^2\leq \mathbb E[f^2]. But it still remains to see how big the variance “typically” is. Now this “typical” will depend on the application one has in mind; the analogy to the previous section would be to choose f randomly as white noise, but this is not (most of the time) not what one has in mind when thinking about a “uniformly random” function. Depending on which field one comes from, the “canonical” choice might be chosen from a smoother family of functions, like Brownian motion on the interval [0,1]; it is not hard to convince oneself that “more smoothness” will typically decrease the (now itself random!) \text{Var}_\mu f. This can be made rigorous using Poincaré-type inequalities.

Positive proportion of short intervals with a given number of primes

Remark 1 (Disclaimer) This article has been written for my own reference in preparation to my talk in occasion of the RNTA5 conference.

1. Introduction

The distribution of the prime numbers is very chaotic and we do not have efficient rules to exaclty find the {n}-th prime {p_n} as a function of {n}, whereas, thanks to the prime number theorem, we can certainly approximate the {n}-th prime with a suitable function of {n} (i.e., {p_n\sim n\log n}). Indeed, most of the cryptography is based on the difficulty to factorize a number into its prime factors.

In particular, the distribution of gaps between primes behave like a random object, especially if we look at small scales, i.e. at gaps between consecutive primes smaller than the average gap (which by the prime number theorem is {\log n} for {p_{n+1}-p_n}).

For this reason a well-studied problem was the following: fix {m\geq 1} a positive integer and {\lambda> 0} a positive real number. How many short intervals of the form {[n,n+\lambda\log n]} contain at least {m} primes?

In his breakthrough paper on bounded gaps between primes, James Maynard proved that the answer is: infinitely many. In particular, if we set {m=2}, we can be more specific and state that there exist infinitely many intervals of length at most {246} containing at least {2} primes.

Afterwards, Maynard improved his result, allowing a certain rate of uniformity, showing a more precise estimate:

\displaystyle |\{n\leq x: |[n,n+\lambda\log n]\cap \mathbb{P}|\geq m\}|\gg_{m}x, \ \ \ \ \ (1)

if {0<\lambda<\epsilon(m)}. The method used is a variation of a well-known sieve method, namely the GPY’s sieve.

These statements do not preclude the possibility that there are choices of {\lambda} and {m} for which the intervals {[n,n+\lambda\log n]} contain exactly {m} primes, for at most finitely many {n}. Anyway, this was proven to be not the case by Freiberg, who showed the following more precise result. For any positive real number {\lambda} and any non-negative integer {m}, we have

\displaystyle |\lbrace n\leq x: |[n,n+\lambda\log n]\cap \mathbb{P}|= m\rbrace|\geq x^{1-\varepsilon(x)}, \ \ \ \ \ (2)

if {x} is sufficiently large in terms of {\lambda} and {m}, and {\varepsilon(x)} is a certain function that tends to zero as {x} tends to infinity. We may take {\varepsilon(x)=(\log\log\log\log x)^{2}/\log\log\log x}.

Since we are working with very short intervals we expect that an application of the well-known Cramér model for the distribution of prime numbers would give the exact answer for the distribution of such intervals. Therefore, heuristically we expect the above cardinality to be

\displaystyle |\lbrace n\leq x: |[n,n+\lambda\log n]\cap \mathbb{P}|= m\rbrace|\backsim \frac{\lambda^{m} e^{-\lambda}}{m!}x,

for every {\lambda} and {m}.

The idea behind Freiberg’s result is that we can make use of the Maynard’s sieve method to find clusters of consecutive primes inside particular sets and then construct short intervals of specific form around them. Indeed, the Maynard’s sieve method allows us to show that any subset of the primes, which is well distributed in arithmetic progressions (e.g., that verifies a variant of the prime number theorem and of the Bombieri-Vinogradov’s theorem), contains many elements that are close together. The work of Freiberg showed that the subset of primes which belongs to the image of certain admissible sets of linear functions, i.e.

\displaystyle \mathbb{P}\cap \{gn+h_1,gn+h_2,\dots,gn+h_k\}

is well distributed in arithmetic progressions and is suitable for the application of Maynard’s results. More specifically, when considering {n\leq x}, we ask for a suitable {g} very large in terms of {x} and an admissible set {\mathcal{H}=\{h_1,h_2,\dots, h_k\}} with {0\leq h_1<h_2<\cdots<h_k\leq \log x}. We recall that a set {\mathcal{H}} is admissible if for any prime {p} the set of all residue classes {h_1,h_2,\dots,h_k} modulo {p} do not cover all the classes mod {p}. Next, the Maynard’s result tells us that the number of integers {n\leq x} for which there are at least {m} primes among {gn+h_1,gn+h_2,\dots,gn+h_k} is at least of order {x/(\log x)^k}. A combinatorial process is hence used to detect exactly {m} primes among them that are contained in our selected set of intervals. The idea indeed, is to start with a suitable interval {I_{j_{0}}} of the form {I_j:=[gn+j,gn+j+\lambda\log(gn+j)]\subset I:=[gn,gn+5\lambda\log x]}, which contains all the primes (at least {m}) inside {I}. Next, we note that there exists a suitable {j_{1}\geq j_{0}} such that {I_{j_{1}}\subset I} and {I_{j_{1}}\cap \mathbb{P}=\emptyset}. Considering the sequence {\{I_{j}\}_{j_0\leq j\leq j_{1}}}, it is easy to prove that for at least one {\tilde{j}\in[j_{0},j_{1}]} we must have {|I_{\tilde{j}}\cap \mathbb{P}|=m}. Collecting the various intervals {I_{\tilde{j}}} found in this way and using the estimates contained in the Maynard’s work, we recover the Freiberg’s result.

The major difference between the work of Freiberg and that one of the author is in the way the needed admissible set of linear forms is generated. In the former case, an Erdhos–Rankin type construction was considered, which allows us to lower bound the density related to each choice of {\lambda} and {m}, choosing a suitable large value {g} and an admissible set {\{h_1,h_2,\dots,h_k\}}. However, this freedom inevitably forces us to lose precision and obtain weak estimates. In the latter case, the set of linear forms was chosen implicitly by means of the Maynard’s sieve, producing in this way better information on the density, allowing to consider {g} of order {\log x}, only for very small values of {\lambda}.

A better exploration of the last aforementioned approach, in which we insert a well-separated admissible set into the uniform Maynard’s estimates, leads us to generate a positive proportion of short intervals containing a prescribed number of primes. The key idea is that at the start of the process we need to select clusters of primes in which the elements are also well-spaced.

From now on, we will indicate with {m} a non-negative integer, with {k} the value {k=C\exp(49m/C')}, for certain suitable constants {C,C'>0}, and with {\lambda} a positive real number smaller than {\varepsilon=\varepsilon(k):=k^{-4}(\log k)^{-2}}. The result is the following.

Theorem 1 (M.) We have

\displaystyle |\lbrace n\leq x: |[n,n+\lambda\log n]\cap \mathbb{P}|= m\rbrace|\gg \lambda^{k+1}e^{-Dk^{4}\log k}x, \ \ \ \ \ (3)

for a certain absolute constant {D>0}, if {x} is sufficiently large in terms of {m} and {\lambda}.

2. Finding many intervals with clusters of well-spaced primes

We define the double sum

\displaystyle S=\sum^{*}_{\mathcal{H}}\sum_{x<n\leq 2x}S(\mathcal{H},n), \ \ \ \ \ (4)

where S(\mathcal{H},n) equals to

\displaystyle\bigg( \sum_{i=1}^{k} \textbf{1}_{\mathbb{P}}(gn+h_{i})-m-k\sum_{i=1}^{k}\sum_{\substack{p|gn+h_{i}\\ p\leq x^{\rho}, p\nmid B}}1 -k\sum_{\substack{ h\leq 5\lambda\log x\\ (h,g)=1\\h\not\in\mathcal{H}}} \textbf{1}_{S(\rho, B)}(gn+h)\bigg)w_{n}(\mathcal{H}). \ \ \ \ \ (5)

Here {\Sigma^{*}_{\mathcal{H}}} means that the sum is over all the admissible sets {\mathcal{H}} such that {0\leq h_{1}<h_{2}<...<h_{k}<\lambda\log x} and {|h_i-h_j|>\frac{\lambda\log x}{C_{0}}}, for any {1\leq i\neq j\leq k}, where {C_0=C_0(k)} a suitable constant. Moreover, we define

\displaystyle S(\rho, B)=\lbrace n\in \mathbb{N}: p|n\Rightarrow (p>x^{\rho}\ \textrm{or}\ p|B)\rbrace.

We consider {\rho} verifying suitable growth conditions to be chosen later, {B} a natural number under certain arithmetic conditions, and take {w_n(\mathcal{H})} as the sieve weights used in the Maynard’s work.

We can prove that

\displaystyle S\ll k(\log x)^{2k} \exp (O(k/\rho))|I(x)|, \ \ \ \ \ (6)

where now the set {I(x)} contains intervals of the form {[gn,gn+5\lambda\log x]}, for {x<n\leq 2x}, with the property that {|[gn,gn+5\lambda\log x]\cap \mathbb{P}|=|\lbrace gn+h_{1},...,gn+h_{k}\rbrace\cap\mathbb{P}|\geq m+1}, for a unique admissible set {\mathcal{H}} such that {0\leq h_{1}<...<h_{k}<\lambda\log x} and {|h_i-h_j|>\frac{\lambda\log x}{C_{0}}}, for any {1\leq i\neq j\leq k}. We recall that the intervals in {I(x)} are pairwise disjoint, if for instance {\lambda<1/5}. The idea is to bound the term in parenthesis with {k} times the indicator function that such term is positive, using the arithmetic definitions of the sums present there to define the type of intervals contained in {I(x)}.

Choosing {k} sufficiently large in terms of {m}, e.g. {k:=C\exp(49m/C')}, {\lambda\leq\varepsilon}, with {\varepsilon=\varepsilon(k):=k^{-4}(\log k)^{-2}}, and {\rho=k^{-3}(\log k)^{-1},} and finally taking {x} and {C} suitably large, we may conclude that

\displaystyle S\gg \lambda^{k} e^{-C_{1} k^{2}}x(\log x)^{2k}, \ \ \ \ \ (7)

for a certain constant {C_{1}>0}.

By combining (6) with (7), we obtain

\displaystyle |I(x)|\gg \lambda^{k}e^{-C_{2}k^{4}\log k}x, \ \ \ \ \ (8)

with an absolute constant {C_{2}>0}.

3. Combinatorial process

Consider an interval {I\in I(x)}. There exist an integer {x<n\leq 2x} and an admissible set {\mathcal{H}}, with {0\leq h_{1}<h_{2}<...<h_{k}< \lambda\log x} and {|h_i-h_j|>\lambda\log x/C_0,} for any {1\leq i\neq j\leq k}, such that {I=[gn, gn+5\lambda\log x]} and

\displaystyle |[gn,gn+5\lambda\log x]\cap \mathbb{P}|=|\lbrace gn+h_{1},...,gn+h_{k}\rbrace\cap\mathbb{P}|\geq m+1.

In order to avoid having a trivial gap between the elements of {\mathcal{H}} we ask for {x} to be sufficiently large with respect to {\lambda} and {k}. Let us define

\displaystyle I_{j}=[N_{j}, N_{j}+\lambda \log N_{j}],\ \ N_{j}=gn+j, \ \ \ \ \ (9)

for {j=0,...,\lfloor \lambda\log N_{0}\rfloor}. We recall here the following properties of the intervals {I_j}:
1) for any such {j} we have {I_{j}\subseteq I};
2) for the choice {j=h_{1}} we find that {I_{j}\cap \lbrace gn+h_{1},...,gn+h_{k}\rbrace=\lbrace gn+h_{1},...,gn+h_{k}\rbrace};
3) for the value {j=\lfloor \lambda\log N_{0}\rfloor} we have {I_{j}\cap\lbrace gn+h_{1},...,gn+h_{k}\rbrace=\emptyset};
4) if {|I_{j}\cap \mathbb{P}|<|I_{j+1}\cap\mathbb{P}|}, for a certain {j}, then {|I_{j+1}\cap\mathbb{P}|=|I_{j}\cap\mathbb{P}|+1.}
Now, let’s define

\displaystyle \tilde{j}:=\max\lbrace 0\leq j\leq \lfloor \lambda\log N_{0}\rfloor: |I_j\cap \mathbb{P}|\geq m+1\rbrace.

Note that we necessarily have {N_{\tilde{j}}=gn+\tilde{j}} being prime. Consequently, this implies {|I_{\tilde{j}+1}\cap \mathbb{P}|=m}, but from our assumption on {\mathcal{H}} it actually derives that

\displaystyle |I_{\tilde{j}+l}\cap \mathbb{P}|=m,\ \textrm{for any}\ 1\leq l\leq \lfloor \lambda\log x/C_0\rfloor. \ \ \ \ \ (10)

4. Final estimate

The information (10) is equivalent to say that we have found {\lfloor \lambda\log x/C_0\rfloor}-different intervals {[N, N+\lambda\log N]} containing exactly {m} primes, with {N<5x\log x}, if {x} is sufficiently large. Together with the lower bound (8), we have obtained that for every {m\geq 1} and for each {0<\lambda\leq\varepsilon},

\displaystyle |\lbrace N\leq 5x\log x: |[N, N+\lambda\log N]\cap\mathbb{P}|=m\rbrace|\gg \frac{\lambda^{k+1}}{k\log k}e^{-C_{2}k^{4}\log k}x\log x, \ \ \ \ \ (11)

which in turn is equivalent to

\displaystyle |\lbrace N\leq X: |[N, N+\lambda\log N]\cap\mathbb{P}|=m\rbrace|\gg \lambda^{k+1}e^{-C_{3}k^{4}\log k}X, \ \ \ \ \ (12)

when {X} is large enough in terms of {\lambda} and {k}, for a certain constant {C_3>0}.

5. Concluding remarks

The strength of the Maynard’s sieve is the flexibility, that makes it applicable to counting primes in sparser subsets as well. In fact, the same proof that leads to Theorem 1 can be overall adapted to study a variety of different situations, in which for instance we restrict the primes to lye on an arithmetic progression or allow for uniformity of the parameters {\lambda} and {m}. The results are as following.

Remark 2 Let {x} be sufficiently large in terms of {m} and {\lambda}. Suppose that {q\leq f(x)} is a positive integer, with {(\log x)/f(x)\rightarrow\infty}, as {x\rightarrow\infty}. Take {0\leq a<q} with {(a,q)=1}. Then, we have

\displaystyle |\lbrace n\leq x: |[n,n+\lambda\log n]\cap \mathbb{P}_{a,q}|= m\rbrace|\gg \frac{\lambda^{k+1}e^{-Dk^{4}\log k}}{q^{k+1}}x, \ \ \ \ \ (13)

for a certain {D>0}, where {\mathbb{P}_{a,q}} is the intersection of {\mathbb{P}} with {a\pmod{q}}.

Remark 3 Fix {\epsilon_1>0} a small parameter and {0<\epsilon_{2}<1}. Let {x\geq x_{0}(\epsilon_1,\epsilon_2)}, {m\leq \epsilon_1\log\log x} and {\lambda\geq (\log x)^{\epsilon_2-1}}, obeying to the relations {k^4(\log k)^2 \lambda\leq 1} and {\lambda>k\log k (\log x)^{-1}}. Then, the estimate (3) continues to hold.

Remark 4 Let {\mathbb{K}/\mathbb{Q}} be a Galois extension of {\mathbb{Q}} with discriminant {\Delta_{\mathbb{K}}}. There exist constants {C_{\mathbb{K}},C_{\mathbb{K}}'>0} depending only on {\mathbb{K}} such that the following holds. Let {\mathcal{C} \subset Gal(\mathbb{K}/\mathbb{Q})} be a conjugacy class in the Galois group of {\mathbb{K}/\mathbb{Q}}, and let

\displaystyle \mathcal{P} = \lbrace p\ \textrm{prime} : p\nmid \Delta_{\mathbb{K}}, \left[\frac{\mathbb{K}/\mathbb{Q}}{p}\right]=\mathcal{C}\rbrace,

where {\left[\frac{\mathbb{K}/\mathbb{Q}}{.}\right]} denotes the Artin symbol. Let {m\in\mathbb{N}}, {k=C_{\mathbb{K}}'\exp(C_{\mathbb{K}} m)} and {\lambda<\varepsilon}. Then, we have

\displaystyle |\lbrace n\leq x: |[n,n+\lambda\log n]\cap \mathcal{P}|= m\rbrace|\gg\lambda^{k+1}e^{-Dk^{4}\log k}x, \ \ \ \ \ (14)

provided {x \geq x_0(\mathbb{K},\lambda,m)}.

If we consider values of {\lambda} slightly bigger than {k^{-4}(\log k)^{-2}}, a little variation of the sieve method used to prove Theorem 1 leads to the following improvement on the Freiberg bound.

Remark 5 For every non-negative integer {m} and positive real number {\lambda} smaller than {k^{-1}(\log k)^{-1}}, with {k} the value {k=C\exp(49m/C')}, for suitable constants {C,C'>0}, we have

\displaystyle |\lbrace n\leq x: |[n,n+\lambda\log n]\cap \mathbb{P}|= m\rbrace|\gg \frac{\lambda e^{-Dk^{4}\log k}x}{(\log x)^{k}}, \ \ \ \ \ (15)

for a certain {D>0}, if {x} is sufficiently large in terms of {m} and {\lambda}.

Sieving multiplicative functions

1. Basics on multiplicative functions

A multiplicative function is a function {f:\mathbb{N}\longrightarrow \mathbb{C}} that preserves multiplication between coprime numbers, i.e. if {(m,n)=1} then {f(mn)=f(m)f(n)}, where {(m,n)} indicates the largest common divisor between {m} and {n}. One of the most important multiplicative functions is the so called Mobius function {\mu}, defined as follows:

\mu(n)=1 if n=1, \mu(n)=(-1)^k if n=p_1p_2\cdots p_k and p_1,p_2,\dots, p_k distinct primes, \mu(n)=0 otherwise. Therefore, the value of the Mobius function tells us information on the parity of the number of prime factors of a squarefree number. An important property that can be derived from its definition is the ability of isolating the number {1}. Indeed, it is easy to prove that
\sum_{d|n}\mu(d)=0 if n>1 and it equals 1 if n=1. This is sometimes called the Mobius inversion formula.

2. Heuristic on mean value of multiplicative functions

Let {f} be a multiplicative function and write {f(n)=\sum_{d|n}g(d)}. Note that we can recover {g} using the Mobius inversion formula and then it is easy to show that {g} is multiplicative, too. Now, using such relation, we can rewrite the mean value of {f} in terms of {g} as

\displaystyle \sum_{n\leq x}f(n)=\sum_{n\leq x}\sum_{d|n}g(d)=\sum_{d\leq x}g(d)\left\lfloor \frac{x}{d}\right\rfloor.

Since {\lfloor z\rfloor=z+O(1),} we deduce

\displaystyle \sum_{n\leq x}f(n)=x\sum_{d\leq x}\frac{g(d)}{d}+O\bigg(\sum_{d\leq x}|g(d)|\bigg).

In several situations, the remainder term may be shown to be small. Omitting this term, and approximating {\sum_{d\leq x}g(d)/d} with {\prod_{p\leq x}(1+g(p)/p+g(p^2)/p^2+\cdots)}, we find

\displaystyle \sum_{n\leq x}f(n)\approx x\mathcal{P}(f,x),

where we define

\displaystyle \mathcal{P}(f,x):=\prod_{p\leq x}\bigg(1+\frac{g(p)}{p}+\frac{g(p^2)}{p^2}+\cdots\bigg)

\displaystyle =\prod_{p\leq x}\bigg(1-\frac{1}{p}\bigg)\bigg(1+\frac{f(p)}{p}+\frac{f(p^2)}{p^2}+\cdots\bigg).

In the special case that {0\leq f(p)\leq f(p^2)\leq \dots} for all primes {p} (so that {g(d)\geq 0}, for all {d}), one easily gets an upper bound of the correct order of magnitude, because

\displaystyle \sum_{d\leq x}g(d)\left\lfloor \frac{x}{d}\right\rfloor\leq x\sum_{d\leq x}\frac{g(d)}{d}\leq x\mathcal{P}(f,x).

The Delange’s theorem tells us that if {f} is sufficiently close to the function {1}, then the above heuristic holds. More precisely, if we introduce the notion of distance between two multiplicative functions (say with values in the unit disc) as

\displaystyle D(f,g,x)^2:=\sum_{p\leq x}\frac{1-\Re(f(p)\overline{g(p)})}{p}, \ \ \ \ \ (1)

we have the following result (see Chapter 3 Section 4 here, for further details and a proof).

Theorem 1 Suppose that {|f|\leq 1} and that

\displaystyle \lim_{x\longrightarrow \infty}D(1,f,x)<\infty. \ \ \ \ \ (2)

Then

\displaystyle \sum_{n\leq x}f(n)\sim x\mathcal{P}(f,x),\ \textrm{as}\ x\longrightarrow \infty. \ \ \ \ \ (3)

Delange’s theorem gives a satisfactory answer in the case of multiplicative functions at a bounded distance from {1}, and we are left to ponder what happens when the limit in (2) is infinite. One would be tempted to think that in this case {\sum_{n\leq x}f(n)=o(x)}. However, we have an important counter-example. Let {\alpha\neq 0} be a fixed real number and consider the multiplicative function {f(n) = n^{i\alpha}}. By partial summation we find that

\displaystyle \sum_{n\leq x}f(n)\sim \frac{x^{i\alpha+1}}{i\alpha+1},\ \textrm{as}\ x\longrightarrow \infty.

The mean value at {x} has magnitude {1/|1+i\alpha|}, but whose argument varies with {x}. Moreover, it can be proved that {D(1,n^{i\alpha},\infty)=\infty}. The example {n^{i\alpha}} takes on complex values, and one might wonder if there is a real valued multiplicative function {f} taking values in {[-1,1]} for which (1) diverges, but for which the mean value does not tend to zero. A lovely theorem of Wirsing shows that this does not happen (see Chapter 3 Section 4 here, for further details and a proof).

Theorem 2 Let {f} be a real valued multiplicative function with {|f(n)|\leq 1} and suppose that

\displaystyle \lim_{x\longrightarrow \infty}D(1,f,x)=\infty. \ \ \ \ \ (4)

Then

\displaystyle \sum_{n\leq x}f(n)=o(x),\ \textrm{as}\ x\longrightarrow \infty. \ \ \ \ \ (5)

3. The problem of sieving multiplicative functions

A problem that frequently appears in Analytic Number Theory is to find information (upper/lower bounds, or even asymptotics) for the average of {f} restricted on integers without a prescribed list of prime factors in their prime factorization. The most well known example of this is when considering {f(n)\equiv 1}. In this situation we ask for information on

\displaystyle \sum_{\substack{n\leq x\\ (n,\mathcal{P})=1}}1=\#\{n\leq x: p|n \Rightarrow p\nmid \mathcal{P}\},

where {\mathcal{P}} is a fixed set of primes. The study of this quantity gave rise to Sieve Theory and the Fundamental Lemma of Sieve Theory consists in the following estimate (see Chapter 1 Section 3 here, for further details and a proof).

Proposition 3 Let {\mathcal{A}} be a finite set of integers and let {\mathcal{P}} be a set of prime numbers. Define

\displaystyle \mathcal{A}_d:=\#\{a\in\mathcal{A}: d|a\},\ P(y):=\prod_{p\in\mathcal{P},\ p\leq y}p,

and

\displaystyle S(\mathcal{A},\mathcal{P},y):=\#\{a\in\mathcal{A}: (a,P(y))=1\}.

Assume that for any {d|P(y)} we have

\displaystyle \mathcal{A}_d=X\frac{\rho(d)}{d}+R(X,d),\ (X\in\mathbb{R}),

with {\rho(d)} a non-negative multiplicative function satisfying

\displaystyle \prod_{\eta<p<\xi}\bigg(1-\frac{\rho(p)}{p}\bigg)^{-1}< \bigg(\frac{\log\xi}{\log\eta}\bigg)^{\kappa}\bigg(1+\frac{A}{\log \eta}\bigg),\ (2\leq \eta<\xi),

for certain constants {\kappa,A>0}. Then we have, uniformly for {\mathcal{A},X,y} and {u\geq 1},

\displaystyle S(\mathcal{A},\mathcal{P},y)=X\prod_{p\leq y}\bigg(1-\frac{\rho(p)}{p}\bigg)(1+O(u^{-u/2}))+O\bigg(\sum_{\substack{d\leq y^u\\ d|P(y)}}|R(X,d)|\bigg).

In some circumstances, it is important to consider a wider class of multiplicative functions in place of {1}. For instance, when working on the distribution of multiplicative functions in arithmetic progressions, we easily come across the problem to find estimates for the mean value of {f} on integers coprime with the module of the arithmetic progressions in question.

This coprimality condition is considered as a mild restriction on {f} and sometimes can be removed in a very easy way. For example, let us look at what happens if {f} is allowed to be a completely multiplicative function, i.e. the relation {f(mn)=f(m)f(n)} now holds for any couple {n,m} of positive integers. We would like to estimate

\displaystyle \sum_{\substack{n\leq x\\ (n,q)=1}} f(n), \ \ \ \ \ (6)

for a certain {q\geq 1}. Therefore we deduce that (6) is

\displaystyle \sum_{\substack{n\leq x}} f(n)\sum_{\substack{d|n\\ d|q}}\mu(d)=\sum_{\substack{d|q}}\mu(d)\sum_{\substack{k\leq x/d}} f(dk)= \sum_{\substack{d|q}}\mu(d)f(d)\sum_{\substack{k\leq x/d}} f(k). \ \ \ \ \ (7)

In conclusion, if we know the behaviour of {f} on average over the positive integers {n\leq x} for any {x\geq 1}, we are able to plug this into the final sum in (7) and compute (6). Note that in the case when {f\equiv 1}(7) leads to a special case of Proposition 3. Furthermore, we remark that the idea behind such proposition (and Brun’s sieve theory as a whole) is exactly to refine in a clever way the simple computation (7), for {f\equiv 1}, in the general situation in which the set {\mathcal{P}} is no longer the set of prime factors of an integer {q}, and the variable {n} varies on a subset of the integers smaller than {x}. If {f} is only multiplicative, we can still perform all the computations in (7), but this time we end up with

\displaystyle \sum_{\substack{d|q}}\mu(d)\sum_{\substack{k\leq x/d}} f(dk)

and we are not able to proceed any further (under specific assumptions on {f} though, it could still be possible to deduce an asymptotic for it, looking at the innermost sum as the mean value of a non-multiplicative function, which exhibits a sort of Euler product, and apply general theorems, like the Selberg–Delange’s theorem, to conclude).

In general the situation is more complicated and the estimates of (6) is a case by case problem. As a rule of thumbs, we could think of (6) as the whole average of {f} up to {x} against the inverse of the Euler product over the prime factors of {q}, which encodes the information coming from the coprimality restriction. The problem reduces then to understand the behaviour of the error term that naturally pops up after performing this approximation.

4. Halasz’s theorem comes into play

The Hálasz’s theorem is a refinement and an extension of Theorem 1 and 2. Indeed, it describes the average of all the multiplicative functions in the unit disc, regardless their distance from a specific phase function {n^{it}}. Moreover, it gives a stronger and explicit information on the mean value of {f} when it tends to {0}. The result is the following (see here, for its proof).

Theorem 4 Let {f} be a multiplicative function with {|f|\leq 1}. If there exists {\alpha\in\mathbb{R}} for which {D(f,n^{i\alpha},\infty)<\infty}, then

\displaystyle \sum_{n\leq x}f(n)\sim\frac{x^{1+i\alpha}}{1+i\alpha}\mathcal{P}(g,x),\ \textrm{as}\ x\longrightarrow \infty,

where {g(n)=f(n)n^{i\alpha}.} If instead, {D(f,n^{i\alpha},\infty)=\infty}, for every {\alpha\in\mathbb{R}}, define

\displaystyle M(x,T):=\min_{|t|\leq T}D(f,n^{it},x)^2. \ \ \ \ \ (8)

Then, for any {T\geq 1}, we have

\displaystyle \frac{1}{x}\bigg|\sum_{n\leq x}f(n)\bigg|\ll (1+M(x,T))e^{-M(x,T)}+\frac{1}{\sqrt{T}}.

Hálasz’s result shows that if the mean value of {f} is large, then {f(p)} will be very close to {p^{it}}, for some small value of {t}, with the quantity {D(f,n^{it})} giving an appropriate measure of the distance between them. Now, let us come back to the distribution of multiplicative functions in arithmetic progressions. Fix a module {q} and define the slight variation of the distance that takes into account the coprimality condition

\displaystyle D_q(f(n),n^{it},x)^2:=\sum_{\substack{p\leq x\\ p\nmid q}} \frac{1-\Re(f(p)p^{-it})}{p}.

Suppose that {t^{*}} is where {D_q} achieves its minimum over {|t|\leq T}. A beautiful consequence of Theorem 4, and of some other fascinating results in the theory of Lipschitz estimates for multiplicative functions, is the following. For any {d\leq \sqrt{x}} we have

\displaystyle \sum_{n\leq x/d}f(n)=\frac{1}{d^{1+it^{*}}}\sum_{n\leq x}f(n)+O\bigg(\frac{x}{d}\frac{\log\log x}{(\log x)^{1/4}}\bigg). \ \ \ \ \ (9)

Plugging (9) inside Proposition 3, we arrive to the following conclusion (see here, for further details).

Proposition 5 For {1\leq T\leq (\log x)^{1/2}} and {t^{*}} selected as above, we have for any {q\leq \sqrt{x}} that

\displaystyle \sum_{\substack{n\leq x\\ (n,q)=1}}f(n)=\frac{\phi(q)}{q}\prod_{p|q}\bigg(1-\frac{f(p)}{p^{1+it^{*}}}\bigg)\bigg(1-\frac{1}{p}\bigg)^{-1}\sum_{n\leq x}f(n)+O\bigg(\frac{\phi(q)}{q}x\frac{\log\log x}{(\log x)^{1/4}}\bigg). \ \ \ \ \ (10)

Note that (10) gives back the result of Proposition 3 when {f\equiv 1}, but with a weak error term if {q} has several small prime factors. On the other hand, it is clear that Proposition 5 works in a very general set up, allowing uniformity in the set of multiplicative functions, hence explaining the presence of a large error term. Moreover, it must be noticed that for some specific {f} the big-O term in (10)may exceed the main term there. A particular instance of this comes with the choice {f=\mu(n)}.

5. The Wolke’s Theorem

From what we have discussed so far, we realize the need of having a quite general result on the average of multiplicative functions over comprimality condition without losing too much on the error term arising from the approximation (10). In other words, we would like to show for a wide class of multiplicative functions {f} that

\displaystyle \sum_{\substack{n\leq x\\ (n,q)=1}}f(n)\asymp \frac{\phi(q)}{q}\prod_{p|q}\bigg(1-\frac{f(p)}{p^{1+it^{*}}}\bigg)\bigg(1-\frac{1}{p}\bigg)^{-1}\sum_{n\leq x}f(n).

A wonderful theorem of Wolke fills in this gap (see here, for a proof).

Theorem 6 Let {f} to be a multiplicative function such that

\displaystyle 0\leq f(p^{l})\ll l^{O(1)}, \ \ \ \ \ (11)

for any {p} and {l}. Consider any sequence {a_n\ll n^{O(1)}}, such that

\displaystyle A(x,d):=\sum_{\substack{n\leq x\\ d|a_n}}1=x\frac{\rho(d)}{d}+R(x,d),

with {\rho(d)} a positive multiplicative function behaving like (11) on the primes and such that {\rho(p)<p}, for any {p}. Moreover, suppose that the remainder term satisfies

\displaystyle |R(x,d)|\ll x^{1-\epsilon}\ (\forall d\leq x^{C}),\ \textrm{and}\ \sum_{d\leq x^{D}}|R(x,d)|\ll \frac{x}{(\log x)^{E}},

for suitable constants {\epsilon,C,D,E}. Then, if we define

\displaystyle D^{*}(f,x):=\sum_{p\leq x}\frac{\rho(p)}{p}(f(p)-1), \ \ \ \ \ (12)

we have

\displaystyle \sum_{n\leq x}f(a_n)\ll x\exp(D^{*}(f,x)).

If moreover, there exists a constant {c>0} such that {f(p^{l})\geq c^{l}}, for any {l} and {p\gg 1}, then

\displaystyle \sum_{n\leq x}f(a_n)\gg x\exp(D^{*}(f,x)).

Finally, if {f(p^l)\longrightarrow 1}, if {p\longrightarrow \infty} uniformly in {l}, there exists a constant {\gamma} such that

\displaystyle \sum_{n\leq x}f(a_n)=(\gamma+o(1))x\exp(D^{*}(f,x)).

The idea in proving Wolke’s theorem is that, under reasonable sieve assumptions on the sequence {a_n} and assuming {f} does not grow too fast, we are able to use the techniques of Proposition 3 to sieve {f} on {a_n}.

6. A special example

Now, consider {f} to be the {\alpha}–fold divisor function, denoted by {d_\alpha(n)}, i.e. the multiplicative function defined on the prime powers by the generalized binomial coefficient

\displaystyle d_\alpha(p^k):=\binom{\alpha+k-1}{k}.

Suppose {\alpha>0} so that {d_\alpha(n)\geq 0}, for any {n\geq 1}. The heuristic prediction in Section 2 assumes now the form

\displaystyle \sum_{n\leq x}d_\alpha(n)\approx x\mathcal{P}(d_\alpha,x)=x\prod_{p\leq x}\bigg(1-\frac{1}{p}\bigg)\bigg(1-\frac{1}{p}\bigg)^{-\alpha}\asymp x(\log x)^{\alpha-1},

which turns out to be quite accurate, if we only look at the order of magnitude of the above average. When {0\leq \alpha< 1}, we have also {|d_{\alpha}(n)|\leq 1}, and by Wirsing’s theorem we find {\sum_{n\leq x}d_\alpha(n)=o(x)}, since {D(1,d_\alpha,\infty)=\infty.} Therefore, we can see that (10) gives a non-trivial information about the average of {d_\alpha} on integers coprime with {q} only when {\alpha} is not too close to {0} ({\alpha>3/4}, say).

Let us now apply Wolke’s theorem to {d_\alpha(n)\textbf{1}_{(n,q)=1}(n)} when {\alpha>0}. Fix the sequence {a_n} to simply be {a_n=n}. Thus, {A(x,d)=x/d+O(1).} Clearly, {\rho(d)\equiv 1} and {R(x,d)\ll 1} satisfy the hypotheses in Theorem 6. Moreover, we have {0\leq d_\alpha(p^{l})\ll l^{O(1)},} for any {p} and {l}. Indeed,

\displaystyle 0\leq d_\alpha(p^{l})=\frac{\alpha(\alpha+1)\cdots (\alpha+l-1)}{l!}=\exp\bigg(\sum_{i=1}^{l}\log\bigg(1+\frac{\alpha-1}{i}\bigg)\bigg)

\displaystyle =\exp\bigg(\sum_{i=1}^{l}\frac{\alpha-1}{i}+O(1)\bigg)\ll l^{\alpha-1}.

Consequently, for {q\leq x} we deduce

\displaystyle \sum_{\substack{n\leq x\\ (n,q)=1}}d_\alpha(n)\ll x\exp\bigg(\sum_{p\leq x,\ p\nmid q} \frac{(\alpha-1)}{p}-\sum_{p|q}\frac{1}{p}\bigg)

\displaystyle =x\exp\bigg(\sum_{p\leq x} \frac{(\alpha-1)}{p}-\sum_{p|q}\frac{\alpha}{p}\bigg).

Since

\displaystyle \prod_{\substack{p|q\\p>\alpha}}\bigg(1-\frac{\alpha}{p}\bigg)\asymp \exp\bigg(-\sum_{p|q}\frac{\alpha}{p}\bigg)

and by Merten’s theorem we have

\displaystyle \sum_{p\leq x}\frac{1}{p}=\log\log x+O(1),

we find

\displaystyle \sum_{\substack{n\leq x\\ (n,q)=1}}d_\alpha(n)\ll\prod_{\substack{p|q\\ p>\alpha}}\bigg(1-\frac{\alpha}{p}\bigg)x(\log x)^{\alpha-1},

which matches the right order of magnitude in (10). Moreover, we can prove also the lower bound estimate applying the second part of Theorem 6. Indeed, if {\alpha\geq 1} we can take {c=1} and find {d_\alpha(p^l)\geq 1}. If {\alpha<1} we can take {c=1/2} and still get {d_\alpha(p^{l})\gg 2^{-l}}. Hence, we deduce also

\displaystyle \sum_{\substack{n\leq x\\ (n,q)=1}}d_\alpha(n)\gg\prod_{\substack{p|q\\p>\alpha}}\bigg(1-\frac{\alpha}{p}\bigg)x(\log x)^{\alpha-1}.

An Ergodic Proof of the Strong Law of Large Numbers

1. What is a probability space?

A probability space is a triple {(B ,\mathcal {B},\nu)} consisting of the sample space {B}, an arbitrary non-empty set, the {\sigma}–algebra {\mathcal{B}\subseteq 2^{B},} (which is a set of subsets of {B}, called events, such that {\mathcal{B}} contains the sample space, {\mathcal{B}} is closed under complements and countable unions) and the probability measure {\nu:\mathcal{B}\rightarrow [0,1]}, with the property that {\nu} is countably additive (also called {\sigma}–additive, which means that if {\{A_{i}\}_{i=1}^{\infty }\subseteq \mathcal{B}} is a countable collection of pairwise disjoint sets, then {\nu(\bigcup _{i=1}^{\infty }A_{i})=\sum _{i=1}^{\infty }\nu(A_{i})}) and the measure of the entire sample space is equal to one. In short, a probability space is a measure space such that the measure of the whole space is equal to one. In this notation, a measurable function {X:B\longrightarrow \mathbb{R}} is called a real random variable.

2. The weak and strong law of large numbers

Consider now {\{X_n\}_{n\geq 1}} be a sequence of real independently and identically distributed (i.i.d.) random variables over {(B,\mathcal{B},\nu)}. Basically, this means that they have the same probability distribution, i.e. {\forall x\in\mathbb{R}} we have {P(X_i\leq x)=P(X_j\leq x)}, for any {i\neq j}, and that they behave independently one another, i.e. {P(X_1\leq x_1,\dots, X_n\leq x_n)=\prod_{i=1}^{n}P(X_i\leq x_i)}, for any {n\geq 1} and {x_1,\dots, x_n\in\mathbb{R}}. For any such sequence we define the empirical average as {\bar{X}_{n}=\sum_{i=1}^{n}X_i}.

Proposition 1 (The weak law of large numbers) 

If {\{X_n\}_n} is a sequence of real i.i.d. random variables over {(B,\mathcal{B},\nu)}, such that {E[X_1]<\infty}. Then, the empirical mean value converges in probability to the expected mean value {E[X_1]}. In other words, for any {\epsilon>0}

\displaystyle \nu(|\bar{X}_{n}-E[X_1]|> \epsilon)\longrightarrow 0,\ \textrm{as}\ n\longrightarrow \infty.

Here, we define {E[X_1]=\int_{B}X_1 d\nu}, and clearly we have {E[X_1]=\cdots= E[X_n]}, for any {n\geq 1}. The weak law is the statement that for any nonzero margin specified, no matter how small, with a sufficiently large sample there will be a very high probability that the average of the observations will be close to the expected value at most within that margin.

Proposition 2 (The strong law of large numbers) 

If {\{X_n\}_n} is a sequence of real i.i.d. random variables over {(B,\mathcal{B},\nu)}, such that {E[X_1]<\infty}. Then, the empirical mean value converges almost surely to the expected mean value {E[X_1]}. In formulae,

\displaystyle \bar{X}_n(\omega)\longrightarrow E[X_1],\ \textrm{as}\ n\longrightarrow \infty,

over a subset {\Omega\subset B} with measure {\nu(\Omega)=1}.

What this means is that as the number of trials goes to infinity, the probability that the average of the observations is equal to the expected value will be equal to one. This version is called the strong law because random variables which converge strongly (almost surely) are guaranteed to converge weakly (in probability). The strong law implies the weak law but not vice versa. More specifically, it implies that with probability {1}, we have that for any {\epsilon > 0} the inequality {|\bar{X}_{n}-E[X_1]|<\varepsilon} holds for all large enough {n}. However, the weak law is known to hold in certain conditions where the strong law does not hold and the convergence is only weak.

3. The Kolmogorov 0-1 law

The Kolmogorov’s zero-one law, specifies that a certain type of event, called a tail event, will either almost surely happen or almost surely not happen. Tail events are defined in terms of infinite sequences of random variables. Suppose {\{X_n\}_n} is an infinite sequence of independent random variables (not necessarily identically distributed). Let {\mathcal{F}} be the {\sigma}–algebra generated by the {X_n}. Then, a tail event {F\in\mathcal{F}} is an event which is probabilistically independent of each finite subset of these random variables. A more precise statement is contained in the following proposition.

Proposition 3 Let {\mathcal{F}_i} be the {\sigma}–algebra generated by {X_i}, i.e. the {\sigma}–algebra {\{X_i^{-1}(S)| S\in\mathcal{L}\}}, with {\mathcal{L}} a {\sigma}–algebra in {\mathbb{R}}. Consider

\displaystyle \mathcal{G}_k=\sigma\bigg(\bigcup_{i=k}^{\infty} \mathcal{F}_i\bigg)

be the smallest {\sigma}–algebra containing {\mathcal{F}_k,\mathcal{F}_{k+1},\dots} . Then the Kolmogorov’s zero-one law asserts that for any event

\displaystyle F\in\bigcap_{k=1}^{\infty}\mathcal{G}_k

we either have {\nu(F)=0} or {\nu(F)=1}.

4. A taste of Ergodic Theory

Given a probability space {(B,\mathcal{B},\nu)} and a measurable map {T:B\longrightarrow B} (i.e. {T^{-1}(A)\in\mathcal{B}}, for any {A\in\mathcal{B}}), we say that {T} is a measure preserving transformation (mpt) or equivalently {\nu} is a {T}–invariant measure if

\displaystyle \nu(T^{-1}(A))=\nu(A),\ \textrm{for\ any}\ A\in \mathcal{B}.

Proposition 4 The following assertions for a map {T} as above are equivalent:
{1)\ T} is mpt;
{2) \int_{B}f\circ T d\nu=\int_{B}f d\nu,} for any {f\in L^1(B)=\{f:B\longrightarrow \mathbb{R}: \int_B |f|d\nu <\infty\}.}

Using the above result it is straightforward to show that on the circle {S^1} (with the Lebesgue measure) the rotations of a fixed angle {\alpha} are mpt, as well as the doubling map defined by {T(x)=2x\pmod{1}}.

Remark 1 On a compact metric space {B}, endowed with its Borel {\sigma}–algebra {\mathcal{B}} (i.e. the one generated by the open sets of {B}), for a continuous transformation {T} we can test the measure preserving property by looking only at continuous functions {f} on {B}.

We say that a mpt {T:B\longrightarrow B} is ergodic if {A\in\mathcal{B}} and {T^{-1}A=A} implies either {\nu(A)=0} or {\nu(A)=1}.

Remark 2 For the map {T} to be ergodic means that {T} is indecomposable into {2} non-trivial mpts. Indeed, suppose there exists {A\in\mathcal{B}} such that {T^{-1}A=A} but {0<\nu(A)<1}. Then, consider {B_1=A} and {B_2=B\setminus A}. Accordingly, define {T_1,T_2} as the restrictions of {T} on {B_1,B_2} respectively. We may see {B_1} and {B_2} as probability spaces with the measures

\displaystyle \nu_1=\frac{\nu|_{B_1}}{\nu(B_1)},\ \nu_2=\frac{\nu|_{B_2}}{\nu(B_2)}.

Therefore, it is clear that {T} decomposes into the {2} mpts {T_1,T_2}.

Let {f:B\longrightarrow\mathbb{R}} measurable. We say that {f} is invariant if {f\circ T=f} almost everywhere (short: a.e.), i.e. {f(T(\omega))=f(\omega)}, for any {\omega\in \Omega\subset B} with {\nu(\Omega)=1}. We say that {f} is constant if {f=c} a.e., for a certain {c\in\mathbb{R}}.

Proposition 5 For a mpt {T} and for any {1\leq p\leq \infty}, the following assertions are equivalent:
{1)\ T} is ergodic;
{2)\ } for any {f} measurable, {f} is invariant if and only if {f} is constant;
{3)\ } for any {f\in L^{p}(B)}, {f} is invariant if and only if {f} is constant.

We now come to the main result in this section and one of the most important ones in Ergodic Theory.

Proposition 6 (The (pointwise) Birkhoff’s Ergodic Theorem) 

Let {(B,\mathcal{B},\nu)} be a probability space, {T:B\longrightarrow B} a mpt and suppose that {f\in L^1(B)}. Then, there exists {f^{*}\in L^1(B)} a {T}–invariant function such that {\int_{B}f^{*}=\int_{B}f} and

\displaystyle \lim_{n\longrightarrow\infty}\frac{1}{n}\sum_{j=0}^{n-1}f\circ T^j =f^{*},\ \textrm{a.e.}\ .

If moreover {T} is ergodic, we have {f^{*}=\int_{B}f}.

In other words, if we define the time average

\displaystyle \hat{f}(x)=\lim_{n\longrightarrow\infty}\frac{1}{n}\sum_{j=0}^{n-1}f\circ T^j(x)

and the space average

\displaystyle \bar{f}=\int_{B}fd\nu,

the above result tells us that under strong assumptions on {T} they coincide almost everywhere.

5. Proof of the strong law of large numbers

Consider a sequence {\{X_n\}_n} of real i.i.d. random variables over the probability space {(B;\mathcal{B}, \nu)}. We can view such sequence as a unique random variable {\bar{X}} over the product space {\prod_{n=1}^{\infty} B}, endowed with the product {\sigma}–algebra {\bigotimes_{n=1}^{\infty}\mathcal{B}} and the product measure {\mu}.

Remark 3 The {\sigma}–algebra {\bigotimes_{n=1}^{\infty}\mathcal{B}} is the sigma algebra generated by all the rectangles {\prod_{n=1}^{\infty}A_n}, with {A_n\in\mathcal{B}} for any {n}, and {A_n=B} except finitely many times.

Since the random variables {X_n} are independent, it can be proven that {\mu} factorize as the product of the measures associated to them. Since they are also identically distributed, we can write {\mu=\nu^{\mathbb{N}}}.

Consider now the shift map {f:\mathbb{R}^{\mathbb{N}}\longrightarrow \mathbb{R}^{\mathbb{N}}}, defined on sequences by {f(x_1,x_2,x_3, \dots)=(x_2,x_3,\dots),} which is measurable with respect to the product {\sigma}–algebra of the Borel {\sigma}–algebra in {\mathbb{R}}. The composition {f\circ \bar{X}} defines a new random variable on {(\prod_{n=1}^{\infty} B,\bigotimes_{n=1}^{\infty}\mathcal{B}, \mu).} Moreover, we can see {\mathbb{R}^{\mathbb{N}}} as a probability space with measure {\mu^{*}} defined by {\mu^{*}(A)=\mu(\bar{X}^{-1}(A))}. It is clear that {\mu^{*}} is a {f}–invariant measure. Indeed, for any {A} measurable in {\mathbb{R}^{\mathbb{N}}} we have

\displaystyle \mu^{*}(f^{-1}(A))= \mu(\bar{X}^{-1}(f^{-1}(A)))=\mu(\bar{X}^{-1}(\mathbb{R}\times A))

\displaystyle =\nu(X_1^{-1}(\mathbb{R}))\prod_{n=2}^{\infty}\nu(X_n^{-1}(A))=\prod_{n=2}^{\infty}\nu(X_n^{-1}(A))=\prod_{n=1}^{\infty}\nu(X_n^{-1}(A))=\mu^{*}(A),

where in the second line of equalities we use the i.i.d assumption.

Moreover, {f} is ergodic. Indeed, consider {A} measurable with {f^{-1}(A)=A}. Then, we can show that {\bar{X}^{-1}(A)} is a tail event inside {\bigotimes_{n=1}^{\infty}\mathcal{B}}. In fact, {\bar{X}^{-1}(A)=\bar{X}^{-1}(f^{-1}(A))=(f\circ \bar{X})^{-1}(A)}, which takes into account only the information coming from {X_2,X_3,\dots} but not from {X_1}. In other words, {\bar{X}^{-1}(A)} is probabilistically independent on the {\sigma}–algebra generated by {X_1}. Moreover, we clearly have {f^{-2}(A)=f^{-1}(A)=A}. Thus, arguing as above we deduce {\bar{X}^{-1}(A)} is probabilistically independent on the {\sigma}–algebra generated by {X_1} and {X_2}. By induction we conclude that {\bar{X}^{-1}(A)} is probabilistically independent on the {\sigma}–algebra generated by any finite subset of the {\{X_n\}_n}, proving the assertion. By the Kolmogorov’s zero-one law, we either have {\mu^{*}(A)=0} or {\mu^{*}(A)=1}, showing the ergodicity of {f} with respect to {\mu^{*}}.

Consider now {\phi:\mathbb{R}^{\mathbb{N}}\longrightarrow \mathbb{R}} defined on sequences by {\phi(x_1,x_2,\dots)=x_1}. Clearly, {\phi\in L^{1}(\mathbb{R}^{\mathbb{N}},\mu^{*}).} By the Birkhoff’s ergodic theorem we find

\displaystyle \lim_{n\longrightarrow \infty} \frac{1}{n}\sum_{j=0}^{n-1}\phi(f^{j}(\underline{x}))=\int_{\mathbb{R}^{\mathbb{N}}}\phi d\mu^{*},

for almost all sequences {\underline{x}\in \mathbb{R}^{\mathbb{N}}}. This is equivalent to say that for almost all {\omega\in B} we have

\displaystyle \lim_{n\longrightarrow \infty} \frac{1}{n}\sum_{j=1}^{n} X_j(\omega)=\int_{\mathbb{R}}X_1 d\nu=E[X_1],

which concludes the proof of the strong law of large numbers.

Remark 4 We conclude by observing that the above proof of the strong law of large numbers is not the only one known. In fact, it is possible to give a pure probabilistic proof, which is rather long and intricate. Moreover, it is important to notice that in the presented proof we constructed a probability space and a mpt suitable for the application of the powerful ergodic theorem, but the meaningful step was to prove the ergodicity of the transformation itself, which followed by the beautiful Kolmogorov’s zero-one law.

Odd class number implies odd discriminant.

During the usual Number Theory meetings the following question was raised: let {K=Q(\sqrt d)} with {d<0} a squarefree integer. If the class group {C\ell_{K}} of {K} has odd cardinality different from {1}, is it true that the discriminant {D_{K}} of {K} must be odd? Soon after a simple short argument was given in favour of an affirmative answer. Before going through the details of the problem, let us carefully define the objects introduced above.

1. Quadratic imaginary number fields and their discriminants

A quadratic imaginary number field is an extension of {\mathbb{Q}} obtained by adding the squareroot of a negative integer {d} and the notation is {K=Q(\sqrt d)}. Clearly, we can assume {d} squarefree.

Proposition 1 Let {d} be a nonzero squarefree integer. Then, the discriminant {D_{K}} of the quadratic field {K=Q(\sqrt d)} has the following form

\displaystyle D_K= d\ \textrm{if}\ d\equiv 1\pmod{4},\ D_K=4d\ \textrm{otherwise}.

For example, if {d= -1}, then {K} is the field of Gaussian rationals and the discriminant is {-4}. The reason for such a distinction is that the ring of integers of {K} is generated by {(1+\sqrt{d})/2} in the first case, but by {\sqrt{d}} in the second case. We remember that the ring of integers of {K}, denoted by {\mathcal{O}_K}, is the ring of all integral elements contained in {K}, i.e. of all elements root of a monic polynomial with integer coefficients. The following proposition will be useful later.

Proposition 2 The function {N:\mathcal{O}_K\longrightarrow\mathbb{N}} given by {N(a+b\sqrt{d})=a^2-db^2} defines a norm on {\mathcal{O}_K}. The norm is multiplicative and can be extended to principal ideals in the following way:

\displaystyle N((a))=N(a),\ \textrm{for any}\ a\in \mathcal{O}_K. More in general, for any ideal {0\neq I\subset \mathcal{O}_K} we can set {N(I)=(\mathcal{O}_K : I).}

2. The Dedekin-Kummer factorization theorem

For every prime number {p\in\mathbb{Z}} we can consider {p\mathcal{O}_K}, which will be an ideal in {\mathcal{O}_K}. Now, since {\mathcal{O}_K} is a Dedekind domain, every ideal can be split as a product of prime ideals of {\mathcal{O}_K}. This means we have in particular the following factorization

\displaystyle p\mathcal{O}_K=\mathfrak{P}_1^{e_1}\cdots\mathfrak{P}_l ^{e_{l}}, for some {l\geq 1} and prime ideals {\mathfrak{P}_1, \dots, \mathfrak{P}_l} in {\mathcal{O}_K}. The Dedekind-Kummer factorization theorem allows us to find what are the primes above every prime number, also for more general number fields {K} with {\mathcal{O}_K=\mathbb{Z}[\alpha]}, for a certain {\alpha\in\mathbb{C}}.

Proposition 3 Let {f\in \mathbb{Z}[x]} be monic and irreducible. Let {\alpha\in \bar{\mathbb{Q}}} be such that {f(\alpha) = 0}. Let {p\in \mathbb{Z}} be prime. Choose {g_i(x)\in \mathbb{Z}[x] } monic and {e_i\in\mathbb{Z}_{>0}} such that {f\equiv \prod_{i=1}^{l} g_i(x)^{e_i}\pmod{p}} is the factorization of {f} modulo {p} into irreducible. Then, the prime ideals of {\mathbb{Z}[\alpha]} containing {p} are precisely the ideals {(p, g_i(\alpha))=:\mathfrak{P}_i}. Furthermore, if all {\mathfrak{P}_i} are invertible then {\prod_{i=1}^{l} \mathfrak{P}_i^{e_i}=(p)} and {N(\mathfrak{P}_i) = p^{f_i}}, where {f_i = \deg g_i}.

Note that by the multiplicativity of the norm, we immediately deduce that {p^{\deg f}=\sum_{i=1}^{l}e_if_i}.

3. Class groups of number fields

The ideal class group {C\ell_{K}} (or class group) of an algebraic number field {K} is the quotient group {J_K/P_K}, where {J_K} is the group of fractional ideals of the ring of integers of {K}, and {P_K} is its subgroup of principal ideals. A fractional ideal of {\mathcal{O}_K} is an {\mathcal{O}_K}–submodule {I} of {K} such that there exists a nonzero {r\in \mathcal{O}_K} with {rI \subset \mathcal{O}_K}. The element {r} can be thought of as clearing out the denominators in {I}. The class group is a measure of the extent to which unique factorization fails in the ring of integers of {K}. The order of the group, which is finite, is called the class number of {K}.

In the language of Dedekind-Kummer theorem, we say that {p} ramifies in {\mathcal{O}_K} if at least one the {e_i} is bigger than {1}. Otherwise, we say it is unramified. It can be proven that every prime dividing the discriminant {D_K} ramifies in {K} (and viceversa). More important for us, all the prime ideals in {\mathcal{O}_K} over the ramified primes in {\mathbb{Z}} (only one prime ideal for each of them) generate the {2}–torsion of {C\ell_{K}}.

Proposition 4 If the number of primes dividing {D_K}, that we will indicate with {\omega(D_K)}, equals {r}, then the {2}–torsion subgroup of {C\ell_{K}} will have rank {r-1} and more specifically will be isomorphic to a copy of {(\mathbb{Z}/2\mathbb{Z})^{r-1}}, generated by the classes of the primes ideals {\mathfrak{P}_1=(p_1,\sqrt{d}), \dots, \mathfrak{P}_r=(p_r,\sqrt{d}),} where {p_1,\dots,p_r} are all the distinct prime factors of {D_K}.

This statement is part of the so called Genus Theory, a subfield of Algebraic Number Theory whose study was started by Gauss. We remark here that the analogous situation for real quadratic extensions becomes more subtle, because of the difference between the class group and the strict class group (and the presence of non-trivial units).

4. The solution to our question

Coming back to our problems, let describe the proof pointed out during our meeting. Assume for contradiction that {d} is even. Since for those quadratic fields we have {\mathcal{O}_K=\mathbb{Z}[\sqrt{d}]}, by the Dedekind-Kummer factorization theorem we have that the ideal in {\mathcal{O}_K} generated by {2} is the square of an ideal of norm {2}. Indeed, {X^2-d\equiv X^2} if { d} is even or {X^2-d\equiv (X-1)^2} if { d} is odd. However, if {-d>2} there are no elements of norm {2}. Indeed, if {\beta\in\mathbb{Z}[\sqrt{d}]} was such that {N(\beta)=2}, then we would have {a,b\in\mathbb{Z}} with {\beta=a+b\sqrt{d}} and {a^2-db^2=2}, which is not possible. Using again the Dedekind-Kummer theorem we see that {(2)=(2,\sqrt{d})^2} and {(2,\sqrt{d})} cannot be principal, deducing that {\#Cl_K} must be even.

Let us now see a different solution which leads to a deeper understanding of the situation. Since we are supposing {\#C\ell_K>1} odd, we immediately deduce that the {2}–torsion in {C\ell_K} must be trivial, i.e. {\omega(D_K)=1.} By the characterization of the discriminant we must have {D_K=d}. This shows not only that {D_K} must be odd, but also gives the stronger conclusion that {d} must be an odd prime number congruent to {1\pmod{4}}. Indeed, we can rule out the possibility of {d=2}, because otherwise we know {\#C\ell_{\mathbb{Q}(i)}=1}.

5. Variation on the subject

A slightly but interesting variation is the following. Suppose now that the odd part of the class number, i.e. the largest odd divisor of {\#C\ell_K}, is bigger than {1}. Is the discriminant still odd? 
The answer this time is negative. Indeed, by the Davenport-Heilbronn theorem we have that when quadratic fields are ordered by their absolute discriminants, the average number of {3}–torsion elements in the class groups of imaginary quadratic fields is {2} (whereas the average number of {3}–torsion elements in the class groups of real quadratic fields is {4/3}). Actually, the techniques introduced by Davenport and Heilbronn can be adapted to prove the same statement but averaging only over class fields of the form {K=Q(\sqrt{2d})}, with {d} negative odd integer. We deduce from this that for infinitely many {d} the {\#C\ell_K>1} (indeed, divisible by {3}). However, {D_K} is always even.

6. What can we say for general arithmetic progressions?

Coming back again to our original problem, we may ask for information in the case in which {\#C\ell_K} is congruent to {a\pmod{q}}, with {0\leq a<q}, {(a,q)=1} and {q} an odd prime number. Is it true that {D_K\equiv a\pmod{q}}? It seems there is no general answer to this question and the feeling is that the couple {a=1,q=2} is somehow special. Indeed, when {q=2} we saw the question can be rephrased in terms of the {2}-torsion of {\#C\ell_K}, because basically we asked {2\nmid \#C\ell_K}. We can do the same also for more general {q} but now we lose the deep understanding we had before giving space to conjectural assumptions.

The Cohen–Lenstra heuristics for class groups of quadratic fields predict that the only deterministic part in class groups of quadratic fields comes from genus theory and that the rest can be considered as being random, in a suitable sense. More precisely, if {Q^{-}} indicates the set of all the quadratic imaginary extensions, then we expect

\displaystyle \lim_{x\longrightarrow \infty}\frac{\#\{K\in Q^{-}, -x<D_{K}<0:q\nmid \#C\ell_K\}}{\#\{K\in Q^{-}, -x<D_{K}<0\}}=\prod_{i\geq 1}\bigg(1-\frac{1}{q^i}\bigg).

This means that we should have a positive probability that {q\nmid\#C\ell_K} (and a positive one that {q\mid\#C\ell_K}). Similar considerations can be obtained for general arithmetic progressions but using a different version of the above conjectured asymptotic. Therefore, in such situations we do not see any more a specific pattern like before.

7. Concluding remarks and acknowledgements

In the paper “On the distribution of {C\ell(K)[\ell^{\infty}]} for degree {\ell} cyclic fields”, there is a good summary of what described above and some insights about the study of the same question but for family of larger number fields.

I would like to thank Martin Orr for pointing out this interesting curiosity in the weekly Warwick Number Theory Meeting and Carlo Pagano for a very fruitful and illuminating discussion on such topic.