Cluster analysis of variables or observations?

by bobmcpop   Last Updated October 10, 2018 19:19 PM

I'm very new to cluster analysis. In papers such as Richette et al.1 (which tries to see which concomitant diseases cluster together), authors first cluster the variables and then the observations (i.e., patients). (Bevis et al.2, did the same thing.) They used SAS's PROC VARCLUS and factor analysis (others have used PCA) for clustering variables, and cluster analysis for the patients. I don't understand why they would (need to) do both? In the first paper, all their discussion centered on the latter.

  1. Richette P, Clerson P, Périssin L, et al. Revisiting comorbidities in gout: a cluster analysis. Annals of the Rheumatic Diseases 2015;74:142-147.
  2. Bevis, et al. (2018). Comorbidity clusters in people with gout: an observational cohort study with linked medical record review. Rheumatology (Oxford). 57(8): 1358-1363.

Answers 1

From a mathematical point of view, a standard dataset is just a matrix of numbers organized into rows and columns. We attach meanings to these, and think of the rows as pertaining to patients and the columns as representing variables, but they're just numbers and you can perform mathematical operations on them. The question is whether any given operation is meaningful.

Variables can be understood to be manifestations of some underlying truth that we don't have access to. In such a case, people often seek to combine the variables to get a better picture of the reality. These are called latent variables. The standard is to determine them through factor analysis, but PCA will typically yield almost the same results, and clustering algorithms can be applied to the columns (variables) to do the same thing. The latter guarantees that the result will have simple structure, at the cost of a worse empirical fit. That's presumably what they were after. This is done first because there's no point in clustering patients on the wrong variables—that would bias the results.

October 10, 2018 18:55 PM

Related Questions

Grouping variables with small sample size

Updated February 04, 2019 14:19 PM

High Dimensional Clustering––Visual Art

Updated July 18, 2017 10:19 AM