# What is the assumption on the distribution of data in gaussian mixture models?

by Olórin   Last Updated March 14, 2019 20:19 PM

https://www.ics.uci.edu/~smyth/courses/cs274/notes/EMnotes.pdf

However, I am super confused at the very first line.

It says:

We have a dataset of some data $$x_i$$

Each data is assumed to be generated i.i.d. from an underlying distribution. We assume that the underlying distribution is a mixture of Gaussian distribution.

I do not understand why we make the assumption that the underlying distribution for the data is the mixture of Gaussian distribution.

This seems to me to be completely false.

The data distribution could be anything. We are only fitting a mixture of Gaussian model to whatever that underlying distribution is. We are minimizing the log-likehood using EM to approximate that distribution with the GMM.

Why do people assume that the data themselves are generated through Gaussians?

Is my interpretation correct?

Tags :

When we model a (true) distribution with a Mixture of Gaussians (MG), it can be said that we assumed the distribution is MG. Similarly, in linear regression, we can say we assume the relation between Y and X is linear, however, it is unlikely to be exactly linear. We should not interpret "assuming" as "believing", we don't believe, we just assume, which may be an apparent, unrealistic simplification. This is why we can say "simplifying assumptions", we are admitting to be ignorant right at the beginning.

Esmailian
March 14, 2019 20:03 PM

Actually, the GMM assumes the underlying data is generated from Gaussians. You are thereby automatically in the position of assuming the Gaussianity of data by accepting and using the model. You're actually believing that the GMM will approximately able to represent your data well enough. In almost every algorithm, there are certain assumptions that you accept/assume, e.g. Naive Bayes assumes independence between features. Remember that almost all models are wrong.

gunes
March 14, 2019 20:09 PM