How far off is Cauchy-Schwarz?

Cauchy-Schwarz is both a very simple and a very powerful bound. We won’t even try to give an extensive list of possible applications, but concentrate instead on a classical one in probability theory.

First notice that Cauchy-Schwarz can simply be interpreted as a fancy way of saying that |\cos(x)|\leq 1 for all real x by the following identity:

|\langle a,b\rangle| = |\cos(\theta)|\|a\|\|b\| \leq \|a\|\|b\|,

where \theta=\measuredangle(a,b) is the angle between a and b.

In several scenarios the dimension of the underlying vector space is very large or infinite; think of random matrix theory, where one typically looks at sequences of vector spaces with dimension going to infinity, or function spaces. We will treat these two cases as prototypical examples of (I) analysis in terms of some diverging parameter and (II) infinite dimensional spaces, where the correlation between values plays an important role.

(I) On \mathbb C^n with n\gg 1, how much off is Cauchy-Schwarz “typically”?

The intuition is that in higher dimensions it becomes increasingly unlikely for two vectors to be approximately parallel (i.e. \theta\approx 0 or $latex\theta\approx \pi$ in which case Cauchy-Schwarz is sharp) since there are “too many directions”. To answer question (I) it is instructive to consider a pair of independent vectors, chosen uniformly at random from the set of unit vectors.

By rotational symmetry we may fix a to be (1,0,0,\dots,0) and choose b uniformly at random from the set of unit vectors. Clearly its distribution is invariant under permuting indices, hence we expect (because of the constraint of having norm 1) that each entry will be of order n^{-\frac{1}{2}} and thus \mathbb E|\langle a,b\rangle|=O(n^{-\frac{1}{2}}).

While this tells us that Cauchy-Schwarz is off by a factor of O(\sqrt n) (note that by assumption \|a\|\|b\|=1 deterministically), it is not very illuminating. It is a good exercise to write out a proof not using the above symmetries — using the central limit theorem and the fact that entries are asymptotically independent as n\to\infty, one then notices that the missing factor of O(\sqrt n) comes from the fact that Cauchy-Schwarz does not account for any cancellations in the inner product.

(II) How much off is (\mathbb Ef)^2\leq \mathbb E[f^2]?

First we establish how this relates to Cauchy-Schwarz. We consider (more general) inner products of the form \langle f,g\rangle_\mu := \int_\mathbb{R} \bar f(x)g(x)d\mu(x), where \mu is some probability measure on \mathbb R and f,g are square-integrable, complex-valued functions. Clearly we recover the usual Euclidean inner product (up to a factor) by choosing \mu to be the normalised counting measure on \{1,2,\dots,n\}\subseteq\mathbb R. Now note that Cauchy-Schwarz gives us that

|\langle \mathbf 1,f\rangle_\mu|^2 = \cos(\theta)^2 \langle f,f\rangle_\mu \leq \langle f,f\rangle_\mu, (*)

where \mathbf 1 is the constant 1 function. But note that the left hand side is just (\mathbb E_\mu f)^2, whereas the right hand side equals \mathbb E_\mu [|f|^2], hence giving the claim in the heading.

It is hard to imagine angles between functions. So we give a different way to “derive” Cauchy-Schwarz here: Recall the formula giving the variance of a function in terms of expectations:

\text{Var}_\mu f = \mathbb E_\mu [f^2]-(\mathbb E_\mu f)^2\geq 0.

Comparing with equation (*) we thus get

|\langle \mathbf 1,f\rangle_\mu|^2 = \langle f,f\rangle_\mu - \text{Var}_\mu f \leq \langle f,f\rangle_\mu.

Side remark (a curious identity): Turning this “additive” bound into a “multiplicative” one (i.e. \langle \mathbf 1,f\rangle_\mu^2 = \langle f,f\rangle_\mu (1-\frac{\text{Var}_\mu f}{\langle f,f\rangle_\mu})) and comparing with the cosine formulation from before, we notice (using \sin^2+\cos^2 = \mathbf 1) that \frac{\text{Var}_\mu f}{\langle f,f\rangle_\mu} = \sin(\theta)^2. (/end of side remark)

We conclude with the heuristic for a fixed(!) function f:\|f\|_\mu=1: “The smaller its variance, the better the bound (\mathbb Ef)^2\leq \mathbb E[f^2]. But it still remains to see how big the variance “typically” is. Now this “typical” will depend on the application one has in mind; the analogy to the previous section would be to choose f randomly as white noise, but this is not (most of the time) not what one has in mind when thinking about a “uniformly random” function. Depending on which field one comes from, the “canonical” choice might be chosen from a smoother family of functions, like Brownian motion on the interval [0,1]; it is not hard to convince oneself that “more smoothness” will typically decrease the (now itself random!) \text{Var}_\mu f. This can be made rigorous using Poincaré-type inequalities.