chapter nine

9 Advanced ingestion: finding data sources and building your own

This chapter covers

Finding third-party data sources for ingestion
Understanding the benefits of building your own data source
Building your own data source
Building a JavaBean data source

In a lot of use cases, I had to get data from nontraditional data sources to use in Apache Spark. Imagine that your data is in an enterprise resource planning (ERP) package, and you want to ingest it via the ERP’s REST API. Of course, you could create a standalone application, dumping all the data in a CSV or JSON file and ingesting the file or files, but you don’t want to deal with the life cycle of each file. When will you be able to delete it? Who has access to it? Can the disk be full at some time? Do I need all the data at once?

Another use case you could be facing is ingesting a specific format.

Imagine this simple scenario . . . You saw a computer numerical control (CNC) router in the Hillsborough workshop. It really outputs status reports in weird formats. And more recently, you saw those digital imaging and communications in medicine (DICOM) files from the X-ray machine you just installed at Duke. Once more, you may be able to extract the data you need from those files and have them ready in CSV or JSON.

But you would still have to handle those files and their life cycles. And, sometimes, the data cannot be easily converted to CSV or JSON because you would lose a lot of the metadata.

9.1 What is a data source?

9.2 Benefits of a direct connection to a data source

9.2.1 Temporary files

9.2.2 Data quality scripts

9.2.3 Data on demand

9 Advanced ingestion: finding data sources and building your own

This chapter covers

9.1 What is a data source?

9.2 Benefits of a direct connection to a data source

9.2.1 Temporary files

9.2.2 Data quality scripts

9.2.3 Data on demand

9.3 Finding data sources at Spark Packages

9.4 Building your own data source

9.4.1 Scope of the example project

9.4.2 Your data source API and options

9.5 Behind the scenes: Building the data source itself

9.6 Using the register file and the advertiser class

9.7 Understanding the relationship between the data and schema

9.7.1 The data source builds the relation

9.7.2 Inside the relation

9.8 Building the schema from a JavaBean

9.9 Building the dataframe is magic with the utilities

9.10 The other classes