by P Sellaz
Last Updated October 16, 2018 19:19 PM

We draw $N$ samples, each of size $n$, independently from a Normal $(\mu,\sigma^2)$ distribution.

From the $N$ samples we then choose the 2 samples which have the highest (absolute) Pearson correlation with each other.

What is the expected value of this correlation ?

Thanks [P.S. This is not homework]

I found the following article, which addresses this problem: Jiang, Tiefeng (2004). The Asymptotic Distributions of the Largest Entries of Sample Correlation Matrices. *The Annals of Applied Probability,* 14(2), 865-880

Jiang shows the asymptotic distribution of the statistic $L_n = \max_{1\leq i<j\leq N} |\rho_{ij}|$, where $\rho_{ij}$ is the correlation between the $i$th and $j$th random vectors of length $n$ (with $i\neq j$), is

$$ \lim_{n \to \infty} \Pr[ nL_n^2 - 4\log n + \log(\log(n)) \leq y] = \exp\left(-\frac{1}{a^2\sqrt{8\pi}}\exp(-y/2)\right) \,, $$ where $a = \lim_{n\to\infty} n/N$ is assumed to exist in the paper and $N$ is a function of $n$.

Apparently this result holds for ~~any distribution~~ distributions with a sufficient number of finite moments (**Edit:** See @cardinal's comment below). Jiang points out that this is a Type I extreme value distribution. The location and scale are

$$ \sigma=2,\quad\mu = 2\log\left( \frac{1}{a^2\sqrt{8\pi}} \right). $$

The expected value of the Type-I EV distribution is $\mu + \sigma \gamma$, where $\gamma$ denotes Euler's constant. However, as noted in the comments, convergence in distribution does not, in and of itself, guarantee convergence of the means to that of the limiting distribution.

* If* we could show such a result in this case, then the asymptotic expected value of $n L_n^2 -4\log n + \log(\log(n))$ would be

$$ \lim_{n\to\infty} \mathbb E\left[ nL_n^2 - 4\log n + \log(\log(n)) \right] = -2\log\left(a^2\sqrt{8\pi} \right) + 2\gamma \,. $$

Note that this would give the asymptotic expected value of the largest squared correlation, whereas the question asked for the expected value of the largest absolute correlation. So not 100% there, but close.

I did a few brief simulations that lead me to think either 1) there's a problem with my simulation (likely), 2) there's a problem with my transcription / algebra (also likely), or 3) the approximation isn't valid for the values of $n$ and $N$ I used. Perhaps the OP can weigh in with some simulation results using this approximation?

Further to the answer provided by @jmtroos, below are the details of my simulation, and a comparison with @jmtroos's derivation of the expectation from Jiang (2004), that is:

$$E\left[L_n^2 \right]= \frac{1}{n} \left \{ 2\log\left( \frac{N^2}{n^2\sqrt{8\pi}} \right) + 2\gamma+ 4\log n - \log(\log(n))\right \}$$

The values of this expectation seem to be above the simulated values for small $N$ and below for large $N$ and they appear to diverge slightly as $N$ increases. However, the differences diminish for increasing $n$, as we would expect as the paper claims that the distribution is asymptotic. I have tried various $n \in [100,500]$. The simulation below uses $n=200$. I'm pretty new to R, so any hints or suggestions to make my code better would be warmly welcomed.

```
set.seed(1)
ns <- 500
# number of simulations for each N
n <- 200
# length of each vector
mu <- 0
sigma <- 1
# parameters for the distribution we simulate from
par(mfrow=c(5,5))
x<-trunc(seq(from=5,to=n, length=20))
#vector of Ns
y<-vector(mode = "numeric")
#vector to store the mean correlations
k<- 1
#index for y
for (N in x) {
# loop over a range of N
dt <- matrix(nrow=n,ncol=N)
J <- vector(mode = "numeric")
# vector to store the simulated largest absolute
# correlations for each N
for (j in 1:ns) {
# for each N, simulated ns times
for (i in 1:N) {
dt[,i] <- rnorm(n,mu,sigma)
}
# perform the simulation
M<-matrix(cor(dt),nrow=N,ncol=N)
m <- M
diag(m) <- NA
J[j] <- max(abs(m), na.rm=TRUE)
# obtain the largest absolute correlation
# these 3 lines came from stackoverflow
}
hist(J,main=paste("N=",N, " n=",n, " N(0,1)", "\nmean=",round(J[j],4)))
y[k]<-mean(J)
k=k+1
}
lm1 <- lm(y~log(x))
summary(lm1)
logx_sq=log(x)^2
lm2<-lm(y~log(x)+logx_sq)
summary(lm2)
# linear models for these simulations
# Jiang 2004 paper, computation:
gamma = 0.5772
yy <- vector(mode = "numeric")
yy <- sqrt((2*log((x^2)/(sqrt(8*pi)*n^2)) + 2*gamma-(-4*log(n)+log(log(n))))/n)
plot(x,yy)
# plot the simulated correlations
points(x,y,col='red')
# add the points using the expectation
```

- ServerfaultXchanger
- SuperuserXchanger
- UbuntuXchanger
- WebappsXchanger
- WebmastersXchanger
- ProgrammersXchanger
- DbaXchanger
- DrupalXchanger
- WordpressXchanger
- MagentoXchanger
- JoomlaXchanger
- AndroidXchanger
- AppleXchanger
- GameXchanger
- GamingXchanger
- BlenderXchanger
- UxXchanger
- CookingXchanger
- PhotoXchanger
- StatsXchanger
- MathXchanger
- DiyXchanger
- GisXchanger
- TexXchanger
- MetaXchanger
- ElectronicsXchanger
- StackoverflowXchanger
- BitcoinXchanger
- EthereumXcanger