Chapter 7. Data mining: process, toolkits, and standards
This chapter covers
- A brief overview of the data mining process
- Introduction to key mining algorithms
- WEKA, the open source data mining software
- JDM, the Java Data Mining standard
The data mining process enables us to find gems of information by analyzing data. In this chapter, you’ll be introduced to the field of data mining. The various data mining algorithms, tools, and data mining jargon can be overwhelming. This chapter provides a brief overview and walks you through the process involved in building useful models. Implementing algorithms takes time and expertise. Fortunately, there are free open source data mining frameworks that we can leverage. We use WEKA—Waikato Environment for Knowledge Analysis—a Java-based open source toolkit that’s widely used in the data mining community. We look at the core packages of WEKA and work through a simple example to show how WEKA can be used for learning. We really don’t want our implementation to be specific to WEKA. Fortunately, two initiatives through the Java Community Process—JSR 73 and JSR 247—provide a standard API for data mining. This API is known as Java Data Mining (JDM). We discuss JDM in the last section of this chapter and review its core components. We take an even deeper look at JDM in chapters 9 and 10, when we discuss clustering and predictive models.