9 Unsupervised methods

 

This chapter covers

  • Using R’s clustering functions to explore data and look for similarities
  • Choosing the right number of clusters
  • Evaluating a cluster
  • Using R’s association rules functions to find patterns of co-occurrence in data
  • Evaluating a set of association rules

In the previous chapter, we covered using the vtreat package to prepare messy real-world data for modeling. In this chapter, we’ll look at methods to discover unknown relationships in data. These methods are called unsupervised methods. With unsupervised methods, there’s no outcome that you’re trying to predict; instead, you want to discover patterns in the data that perhaps you hadn’t previously suspected. For example, you may want to find groups of customers with similar purchase patterns, or correlations between population movement and socioeconomic factors. We will still consider this pattern discovery to be “modeling,” and as such, the outcomes of the algorithms can still be evaluated, as shown in the mental model for this chapter (figure 9.1).

Figure 9.1. Mental model

Unsupervised analyses are often not ends in themselves; rather, they’re ways of finding relationships and patterns that can be used to build predictive models. In fact, we encourage you to think of unsupervised methods as exploratory—procedures that help you get your hands in the data—rather than as black-box approaches that mysteriously and automatically give you “the right answer.”

9.1. Cluster analysis

9.1.1. Distances

9.1.2. Preparing the data

9.1.3. Hierarchical clustering with hclust

9.1.4. The k-means algorithm

9.1.5. Assigning new points to clusters

9.1.6. Clustering takeaways

9.2. Association rules

9.2.1. Overview of association rules

9.2.2. The example problem

9.2.3. Mining association rules with the arules package

9.2.4. Association rule takeaways