This chapter covers
In a lot of use cases, I had to get data from nontraditional data sources to use in Apache Spark. Imagine that your data is in an enterprise resource planning (ERP) package, and you want to ingest it via the ERP’s REST API. Of course, you could create a standalone application, dumping all the data in a CSV or JSON file and ingesting the file or files, but you don’t want to deal with the life cycle of each file. When will you be able to delete it? Who has access to it? Can the disk be full at some time? Do I need all the data at once?
Imagine this simple scenario . . . You saw a computer numerical control (CNC) router in the Hillsborough workshop. It really outputs status reports in weird formats. And more recently, you saw those digital imaging and communications in medicine (DICOM) files from the X-ray machine you just installed at Duke. Once more, you may be able to extract the data you need from those files and have them ready in CSV or JSON.