Chapter 8. Representing data
This chapter covers
- Representing data as a Vector
- Converting text documents into Vector form
- Normalizing data representations
To get good clustering, you need to understand the techniques of vectorization: the process of representing objects as Vectors. A Vector is a very simplified representation of data that can help clustering algorithms understand the object and help compute its similarity with other objects. This chapter explores various ways of converting different kinds of objects into Vectors.
In the last chapter, you got a taste of clustering. Books were clustered together based on the similarity of their words, and points in a two-dimensional plane were clustered together based on the distances between them. In reality, clustering could be applied to any kind of object, provided you could distinguish similar and dissimilar items. Images could be clustered based on their colors, the shapes in the images, or both. You could cluster photographs to perhaps try to distinguish photos of animals from those of humans. You could even cluster species of animals by their average sizes, weights, and number of legs to discover groupings automatically.
As humans, we can cluster these objects because we understand them, and we “just know” what is similar and what isn’t. Computers, unfortunately, have no such intuition, so the clustering of anything by algorithms starts with representing the objects in a way that can be read by computers.