preface

 

Data is everywhere, and it’s used in practically every industry in one way or another. One of the most common ways to interact with data, whether numbers or text, is with spreadsheet software. This approach offers several useful features: presenting data in a tabular view, allowing calculations to be performed using those values, and producing summaries of data. What spreadsheets don’t tend to provide is a way to do this repeatedly, reproducibly, or programmatically (without clicking or copying and pasting). Spreadsheets can be great for displaying data (including limited data summaries); but when you want to do something truly powerful with data, you need to go beyond them to a programming language.

Data munging—manipulating raw data—is a cornerstone of data science. Munging techniques include cleaning, sorting, parsing, filtering, and pretty much anything else you need to do to make data truly useful. They say 90% of data science is preparing the data, and the other 90% is actually doing something with it. Don’t underestimate how important it is to carefully prepare data; analysis interpretations hinge on getting this step right.

Using a programming language to perform data munging means the things you do to your data are recorded, can be reproduced from the raw source, and can be inspected later—even changed, if necessary. Trying to do this from a spreadsheet means either writing down which button to press when, or a broken link between output and input.