index
Symbols
.coef method 170
.describe() method 49
.explore() method 50
.head(n) method 46
.isna method 48
.save() method 344
.score method 95
.transform method 97
.unique() method 327
A
Advanced Data Analytics API (ChatGPT4) 11
Airbnb example, exploring target 265
algorithms
choosing 131
linear regression 111
logistic regression 111
alpha hyperparameter 459
alpha regularizer 172
argparser list 420
AUC (area under the curve) 42
averaging 142
B
BaggingClassifier 144
BaggingRegressor 144
baseline model, building 273
batches 126
Bernoulli distribution 16
best first 183
best practices, deep learning
contrasting custom layer and Keras preprocessing layer approaches 341
bias detection and mitigation 287
boosting 142
bootstrap hyperparameter 460
bootstrapping 17
boxplot 102
C
categorical columns 4
categoricals 22
category_encoders package 271
CategoryEncoding layer 349
ccp_alpha parameter 163
ChatGPT4 (Advanced Data Analytics API) 11
classical algorithms 91
generalized linear methods 123
linear regression 111
machine learning 108
regularized methods 115
SGD 126
classical machine learning 7. See also ML (machine learning)
Cloud Storage bucket, Google Cloud 375
columns
compliance and regulations 287
concatenate() function 349
conda 180
constant or quasi-constant columns 25
containers, machine learning (ML) pipelines
adapting code to run in 419
benefits of 418
overview of 418
CrossEntropyLoss() function 309
strategy 268
D
dart booster 170
Dask 35
data
exploring and fixing 260
data imputation 257
data pipeline 399
Data Portals 37
data representation 7
decision_function method 164
DecisionTreeRegressor class 153
decision trees 134
bagging and sampling 141
extremely randomized trees 149
tree growth 183
XGBoost imitating 188
tree-based methods 135
deep learning 326
selecting solution 448
dependent variable 304
depth-first 183
depthwise 188
dicing 24
dimension tables 33
DNNs (deep neural networks) 160
domain-specific knowledge 7
DWH (data warehouse) 32
E
EFB (Exclusive Feature Bundling) 184
embeddings 199
encoding 199
engineering features, more complex 252
entropy 136
errors in data 30
ERTs (extremely randomized trees) 149
eta parameter 172
Euclidean distance 256
evaluation metrics 109
event tables 32
Exclusive Feature Bundles 184
F
comparing fastai solution with Keras solution 311
feature engineering 114
feature_fraction parameter 182
feature processing methods 192
handling missing data with GBDTs 198
multivariate missing data imputation 195
target encoding 199
transforming numerical data 204
features parameter 209
FeatureUnion command 98
first_metric_only 187
fit class method 94
fit, predict/transform interface 94
benefits of deploying model to endpoint 390
float64 data type 444
G
gamma parameter 172
GANs (Generative Adversarial Networks) 39
handling missing data with 198
gbtree booster 281
GDPR (General Data Protection Regulation) 294
generalized linear methods 123
generative AI (artificial intelligence)
using to help prepare data 243
geocoding 257
Gini impurity 136
Goodfellow, Ian 40
Google Cloud
accessing for first time 372
creating Cloud Storage bucket 375
creating project 374
Google Dataset Search 37
GOSS (Gradient-Based One-Side Sampling) 184
GPT (generative pretrained transformer) 243
deciding between XGBoost and LightGBM 231
deep learning vs. 457
GradientBoostingRegression class 157
GradientBoostingRegressor class 161
gradient descent 128
grow_policy parameter 188
H
HalvingGridSearchCV 223
HalvingRandomSearchCV 223
handle_unknown parameter 255
Harvard University Dataverse 37
help() command 94
categorical features 100
high-level API 303
HistGradientBoostingRegressor 188
human-AI collaboration 287
hyperparameters 96
I
ICE (individual conditional expectation) plots 207
idxmin() function 256
IID (independent and identically distributed) principle 16
information gain 136
init function 163
interactions 117
IQR (interquartile range) 102
irrelevant features 28
IterativeImputer 196
J
JSON (JavaScript Object Notation) 294
JSON Lines (JSONL) dataset 433
K
Kaggle Datasets 38
KDTree (k-dimensional tree) 258
Keras
comparing fastai solution with 311
comparing TabNet solution with 315
comparing with Lightning Flash solution 319
deep learning solution using 66
k-fold cross-validation 109
k-nearest neighbors (KNNs)
L
L1 norm 256
l1_ratio hyperparameter 459
L2 norm 256
l2_regularization hyperparameter 460
LabelEncoder 199
lambda_l1 parameter 181
lambda_l2 parameter 181
lambda regularizer 172
LassoCV 119
Lasso regression 116
leakage features 30
Scikit-learn 188
tree growth 183
XGBoost imitating 188
comparing with Keras solution 319
linear regression 111
linear_tree parameter 182
link function 120
listwise deletion 257
LLM (large language model) 242
local implementation, machine learning (ML) pipelines vs. 415
log_evaluation callback function 187
lossguide 188
low cardinality 22
LOWESS (LOcally WEighted Scatterplot Smoothing) 453
low-level framework 302
M
main function 432
make_scorer command 110
Manhattan distance 256
MAR (missing at random) 195
max_bins argument 189
max_delta_step hyperparameter 461
max_features hyperparameter 459
max_leaf_nodes parameter 163
MCAR (missing completely at random) 195
mean encoding 200
MediTab 11
merged_data.head() function 306
min_data_in_leaf parameter 182
min_delta 187
min_distances DataFrame 259
min_gain_to_split hyperparameter 462
min_impurity_decrease parameter 163
min_sample_leaf parameter 163
min_split_loss parameter 172
min_weight_fraction_leaf parameter 163
MissForest 197
missing data 29
classical 108
local implementation vs. 415
testing model trained in 425
types of 398
Vertex AI ML pipelines 400
MLOps (machine learning operations) 371
MLPs (multilayer perceptrons) 160
MNAR (missing not at random) 195
model building and optimizing 267
model deployment 359
public clouds and machine learning operations 371
Model object 95
model optimization 281
model training script 416
Modin 35
multi_class parameter 122
multivariate imputation 195
N
NewtonianGradientBoosting class 174
nonlinearity 7
normalization 352
normalized schema 32
np.hstack command 98
n_trials parameter 225
numerical data, transforming 204
numeric features 21
numeric standardization 107
NumPy 35
O
object data type 444
Occam’s razor principle 116
one-hot encoding 100
Open Data Monitor 37
OpenML 36
OrdinalEncoder 199
ordinal features 22
P
pandas DataFrame 23
parameter eval_set 179
PartialDependenceDisplay function 209
partial_dependence function 208
partial_fit class method 94
partial_fit method 130
pasting 142
patience threshold 315
PDP (partial dependence plot) 205
Pipeline command 98
pipelines
preparation 107
types of 398
Vertex AI ML pipelines 400
plot_model function 350
Polars 36
PolynomialFeatures function 117
predict class method 94
predicting, with random forests 146
Predictor object 95
prefetch() function 346
procs parameter 308
procs preprocessing steps 451
public clouds 371
comparing TabNet solution with Keras solution 315
with TabNet 303
R
random forests, predicting with 146
random patches 143
random_state setting 96
random subspaces 143
RAPIDS 35
rare categories 30
Ray 35
read_csv command 245
regex (regular expressions) 247
regularization 352
regularized methods 115
relevant information extraction 7
representation learning 7
reverse geocoding 257
RidgeCV 119
Ridge regression 117
rows
S
sample_posterior parameter 196
sampling 141
common features of packages 93
pipelines 97
script_path training script 423
sdv package 40
SequentialFeatureSelector function 216
service accounts 400
creating for ML pipelines 401
creating keys for ML pipelines 403
granting access to Compute Engine default service accounts 404
uploading keys 408
snake_case naming convention 344
Spark 35
sparse_threshold parameter 140
StringLookup layer 349
structured data 5
subsample rows 163
subsampling 17
successive halving 223
summary_listing dataset 251
SVD (singular value decomposition) 57
T
table 4
TableGPT 11
TABLET benchmark 11
comparing with Keras solution 315
TabularClassificationData object 318
tabular data 3
defined 4
importance of 6
tabular data library 303
TabularDataLoaders definition 308
TabularDataset object 423
TargetEncoder 201
TensorFlow 301
testing, model trained in pipeline 425
tf.data.Dataset objects 346
training/inference pipeline 398
transfer learning 10
transform class method 94
Transformer object 95
transforming numerical data 204
conclusions 82
explainability 77
feature importance 80
tree-based methods 135
TreeExplainer function 289
tree_method parameter 178
trees
extremely randomized 149
LightGBM 183
TreeSHAP algorithm 287
t-SNE (t-distributed stochastic neighbor embedding) 57
TweedieRegressor 124
U
UCI Machine Learning Repository 36
UMAP (uniform manifold approximation and projection) 58
unknown_value parameter 271
V
Vaex 35
validation_fraction parameter 166
verbosity parameter 182
Vertex AI
creating managed datasets 412
Vertex AI ML pipelines 400
Vertex AI SDK, setting up 387
VIF (variance inflation factor) 28
W
exercising 370
overview of 361
show-prediction.html page 369
winsorize function 262
X
XGBFIR (XGBoost Feature Interactions Reshaped) 209
building and optimizing model 267
building baseline model 273
building first tentative model 279
engineering more complex features 252
example using 242
exploring and fixing data 260
exploring target 265
finalizing data 259
LightGBM imitating 188
optimizing model 281
preparing and exploring data 243
preparing cross-validation strategy 268
preparing data 244
preparing pipeline 270
training final model 285
using generative AI to help prepare data 243
XGBRegression 170
XGBRegressor class 168
XGBRFClassifier 171
XGBRFRegressor 171