4 Data sources and biases

The mission of the (fictional) non-profit organization Unconditionally is charitable giving. It collects donations and distributes unconditional cash transfers—funds with no strings attached—to poor households in East Africa. The recipients are free to do whatever they like with the money. Unconditionally is undertaking a new machine learning project to identify the poorest of the poor households to select for the cash donations. The faster they can complete the project, the faster and more efficiently they can move much-needed money to the recipients, some of whom need to replace their thatched roofs before the rainy season begins.

The team is in the data understanding phase of the machine learning lifecycle. Imagine that you are a data scientist on the team pondering which data sources to use as features and labels to estimate the wealth of households. You examine all sorts of data including daytime satellite imagery, nighttime illumination satellite imagery, national census data, household survey data, call detail records from mobile phones, mobile money transactions, social media posts, and many others. What will you choose and why? Will your choices lead to unintended consequences or to a trustworthy system?

4.1 Modalities

4.2 Data sources

4.2.1 Purposefully collected data

4.2.2 Administrative data

4.2.3 Social data

4.2.4 Crowdsourcing

4.2.5 Data augmentation

4.2.6 Conclusion

4.3 Kinds of biases

4.3.1 Social bias

4.3.2 Representation bias

4.3.3 Temporal bias

4.3.4 Data preparation bias

4.3.5 Data poisoning

4.3.6 Conclusion

4.4 Summary