{"id":234,"date":"2020-02-23T10:26:00","date_gmt":"2020-02-23T10:26:00","guid":{"rendered":"http:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/hamish-thorburn\/?p=234"},"modified":"2021-11-08T10:27:21","modified_gmt":"2021-11-08T10:27:21","slug":"model-based-clustering","status":"publish","type":"post","link":"https:\/\/www.lancaster.ac.uk\/stor-i-student-sites\/hamish-thorburn\/2020\/02\/23\/model-based-clustering\/","title":{"rendered":"Model-based clustering"},"content":{"rendered":"\n
Today’s post is based on a Masterclass given to the STOR-i cohort by Brendan Murphy from University College Dublin.<\/p>\n\n\n\n
In data science, clustering<\/strong> is the process of grouping objects into groups, or clusters<\/strong>, such that members of the same cluster are more ‘similar’ to each other than they are to members of different clusters.<\/p>\n\n\n\n Many common clustering methods (e.g. k-means<\/a> or hierarchical clustering<\/a>) are based off a metric known as the distance<\/strong> or dissimilarity<\/strong> between the points (an example of this distance is simply the straight Euclidean distance between the points). This is then used in a number of different ways to assign points to clusters – for example, in k-means clustering, each point is assigned to the cluster with the closest mean.<\/p>\n\n\n\n While these methods are very popular, they do suffer from drawbacks. Without assuming a model generating these points, it is hard to claim with certainty that future observations will fall into the same clusters. In addition, some of these algorithms can’t properly deal with many frequently repeated observations.<\/p>\n\n\n\n Professor Murphy’s Masterclass instead presented a framework for clustering continuous data known as a Gaussian Mixture Model<\/strong>. This is a form of clustering which assumes that the data comes from a particular probability model.<\/p>\n\n\n\n The model is based on 3 general assumptions:<\/p>\n\n\n\n These assumptions leave us with two problems to solve when fitting the model:<\/p>\n\n\n\n It is clear that these two problems are related. The mean and variance of a cluster will depend on which observations are assigned to it, and the cluster an observation should be assigned to will depend on that clusters mean and variance. Fortunately, there is a way to simulataneous solve both problems using the Expectation-Maximisation (EM) Algorithm.<\/strong> The algorithm works by repeated performing an E-step, which assigns each observation to it’s most likely cluster, and an M-step, which then updates the cluster means and variances based on the assigned observations.<\/p>\n\n\n\n For a simple example, we will look at clustering different subspecies of iris flowers, based on the length and width of their petals and sepals.<\/p>\n\n\n\n
<\/figure><\/li>
<\/figure><\/li><\/ul>Model-based clustering<\/h2>\n\n\n\n
An example – Iris dataset<\/h2>\n\n\n\n