# Expected value of spurious correlation

by P Sellaz   Last Updated October 16, 2018 19:19 PM

We draw $N$ samples, each of size $n$, independently from a Normal $(\mu,\sigma^2)$ distribution.

From the $N$ samples we then choose the 2 samples which have the highest (absolute) Pearson correlation with each other.

What is the expected value of this correlation ?

Thanks [P.S. This is not homework]

Tags :

I found the following article, which addresses this problem: Jiang, Tiefeng (2004). The Asymptotic Distributions of the Largest Entries of Sample Correlation Matrices. The Annals of Applied Probability, 14(2), 865-880

Jiang shows the asymptotic distribution of the statistic $L_n = \max_{1\leq i<j\leq N} |\rho_{ij}|$, where $\rho_{ij}$ is the correlation between the $i$th and $j$th random vectors of length $n$ (with $i\neq j$), is

$$\lim_{n \to \infty} \Pr[ nL_n^2 - 4\log n + \log(\log(n)) \leq y] = \exp\left(-\frac{1}{a^2\sqrt{8\pi}}\exp(-y/2)\right) \,,$$ where $a = \lim_{n\to\infty} n/N$ is assumed to exist in the paper and $N$ is a function of $n$.

Apparently this result holds for any distribution distributions with a sufficient number of finite moments (Edit: See @cardinal's comment below). Jiang points out that this is a Type I extreme value distribution. The location and scale are

$$\sigma=2,\quad\mu = 2\log\left( \frac{1}{a^2\sqrt{8\pi}} \right).$$

The expected value of the Type-I EV distribution is $\mu + \sigma \gamma$, where $\gamma$ denotes Euler's constant. However, as noted in the comments, convergence in distribution does not, in and of itself, guarantee convergence of the means to that of the limiting distribution.

If we could show such a result in this case, then the asymptotic expected value of $n L_n^2 -4\log n + \log(\log(n))$ would be

$$\lim_{n\to\infty} \mathbb E\left[ nL_n^2 - 4\log n + \log(\log(n)) \right] = -2\log\left(a^2\sqrt{8\pi} \right) + 2\gamma \,.$$

Note that this would give the asymptotic expected value of the largest squared correlation, whereas the question asked for the expected value of the largest absolute correlation. So not 100% there, but close.

I did a few brief simulations that lead me to think either 1) there's a problem with my simulation (likely), 2) there's a problem with my transcription / algebra (also likely), or 3) the approximation isn't valid for the values of $n$ and $N$ I used. Perhaps the OP can weigh in with some simulation results using this approximation?

jmtroos
April 05, 2012 02:12 AM

Further to the answer provided by @jmtroos, below are the details of my simulation, and a comparison with @jmtroos's derivation of the expectation from Jiang (2004), that is:

$$E\left[L_n^2 \right]= \frac{1}{n} \left \{ 2\log\left( \frac{N^2}{n^2\sqrt{8\pi}} \right) + 2\gamma+ 4\log n - \log(\log(n))\right \}$$

The values of this expectation seem to be above the simulated values for small $N$ and below for large $N$ and they appear to diverge slightly as $N$ increases. However, the differences diminish for increasing $n$, as we would expect as the paper claims that the distribution is asymptotic. I have tried various $n \in [100,500]$. The simulation below uses $n=200$. I'm pretty new to R, so any hints or suggestions to make my code better would be warmly welcomed.

set.seed(1)

ns <- 500
# number of simulations for each N

n <- 200
# length of each vector

mu <- 0
sigma <- 1
# parameters for the distribution we simulate from

par(mfrow=c(5,5))
x<-trunc(seq(from=5,to=n, length=20))
#vector of Ns

y<-vector(mode = "numeric")
#vector to store the mean correlations

k<- 1
#index for y

for (N in x) {
# loop over a range of N

dt <- matrix(nrow=n,ncol=N)

J <- vector(mode = "numeric")
# vector to store the simulated largest absolute
# correlations for each N

for (j in 1:ns) {
# for each N, simulated ns times

for (i in 1:N) {
dt[,i] <- rnorm(n,mu,sigma)
}
# perform the simulation

M<-matrix(cor(dt),nrow=N,ncol=N)
m <- M
diag(m) <- NA
J[j] <- max(abs(m), na.rm=TRUE)
# obtain the largest absolute correlation
# these 3 lines came from stackoverflow
}

hist(J,main=paste("N=",N, " n=",n, " N(0,1)", "\nmean=",round(J[j],4)))
y[k]<-mean(J)
k=k+1
}

lm1 <- lm(y~log(x))
summary(lm1)

logx_sq=log(x)^2
lm2<-lm(y~log(x)+logx_sq)
summary(lm2)
# linear models for these simulations

# Jiang 2004 paper, computation:

gamma = 0.5772
yy <- vector(mode = "numeric")
yy <- sqrt((2*log((x^2)/(sqrt(8*pi)*n^2)) + 2*gamma-(-4*log(n)+log(log(n))))/n)

plot(x,yy)
# plot the simulated correlations
points(x,y,col='red')
# add the points using the expectation

P Sellaz
April 05, 2012 17:30 PM