appendix-c

Appendix C. Dataset and feature dictionaries

Reproducing the enterprise-grade AI architectures discussed in this book requires a clear understanding of the underlying data structures. Because financial data spans highly regulated personal histories, real-time market movements, and complex behavioral interactions, we utilized distinct types of data sources across the different parts of this book: public Kaggle datasets for risk and fraud, live market APIs for investment, and a custom-generated synthetic dataset for personalization.

This appendix provides the data dictionaries and schemas for the core features and engineered variables used across these modeling pipelines.

C.1 Credit risk dataset (Part 2)

To build the credit scoring pipelines in Chapters 4 through 6, we utilized the AMEX Default Prediction dataset from Kaggle. This dataset contains highly dimensional, anonymized credit data. To ensure memory efficiency during our modeling exercises, we sampled 100,000 rows from the original 50GB+ dataset.

Appendix C. Dataset and feature dictionaries

C.1 Credit risk dataset (Part 2)

C.2 Fraud detection dataset (Part 3)

C.3 Market intelligence and news datasets (Part 4)

C.4 Synthetic retail banking and interaction dataset (Part 5)