index

published book

Symbols

.coef method 170

.describe() method 49

.explore() method 50

.fit method 95, 96, 97

.head(n) method 46

.isna method 48

.predict method 95, 97

.save() method 344

.score method 95

.transform method 97

.unique() method 327

A

Advanced Data Analytics API (ChatGPT4) 11

Airbnb example, exploring target 265

algorithms

choosing 131

linear regression 111

logistic regression 111

alpha hyperparameter 459

alpha regularizer 172

argparser list 420

AUC (area under the curve) 42

Auto MPG example dataset 4445

averaging 142

B

backward selection 216218

bagging 141, 143, 182

BaggingClassifier 144

BaggingRegressor 144

baseline model, building 273

batches 126

Bayesian methods 225229

Bernoulli distribution 16

best first 183

best_iteration attribute 180, 187

best practices, deep learning

contrasting custom layer and Keras preprocessing layer approaches 341

defining deep learning model 341350

examining code for model definition using Keras preprocessing layers 344350

processing dataset 331340

bias detection and mitigation 287

boosting 142

bootstrap hyperparameter 460

bootstrapping 17

Boruta 214216

boxplot 102

C

categorical columns 4

categorical one-hot encoding 107, 140

categoricals 22

category_encoders package 271

CategoryEncoding layer 349

ccp_alpha parameter 163

ChatGPT 243, 244

ChatGPT4 (Advanced Data Analytics API) 11

classical algorithms 91

for tabular data 99108

generalized linear methods 123

linear regression 111

logistic regression 111, 120

machine learning 108

regularized methods 115

Scikit-learn 9298

SGD 126

classical machine learning 7. See also ML (machine learning)

hyperparameters for models 459462, 472

Cloud Storage bucket, Google Cloud 375

collinear features 2628

colsample_bytree parameter 172, 173, 182, 237

columns

characteristics of 1524

representing 2324

compliance and regulations 287

concatenate() function 349

conda 180

constant or quasi-constant columns 25

containers, machine learning (ML) pipelines

adapting code to run in 419

benefits of 418

overview of 418

updating training code to work in 420422

CrossEntropyLoss() function 309

cross-validation 17, 109, 110, 112, 352

strategy 268

D

dart booster 170

Dask 35

data

exploring and fixing 260

preparing and exploring 243244

data imputation 257

data pipeline 399

Data Portals 37

data representation 7

decision_function method 164

DecisionTreeRegressor class 153

decision trees 134

bagging and sampling 141

extremely randomized trees 149

gradient boosting 150161

accelerating with histogram splitting 175178

applying early stopping to 178180

applying early stopping to avoid overfitting 164167

explaining effectiveness of 160161

extrapolating with 155159

how it works 152155, 173175

in Scikit-learn 161167

key parameters 169173

XGBoost 167180

LightGBM 180189

early stopping 186187

Scikit-learn 188189

speed 184186

tree growth 183

XGBoost imitating 188

tree-based methods 135

deep learning 326

best practices for 327350

blending with gradient boosting 442, 455457

efficacy of 8283

exercising model 353357

gradient boosting vs. 443448, 457

predicting Airbnb prices in New York City 6276

PyTorch with TabNet 312316

research on 8487

selecting solution 448

solution to Tokyo Airbnb problem 450452

training model 350353

with tabular data 810, 299, 300311, 316324

dependent variable 304

depth-first 183

depthwise 188

df_to_dataset function 345, 346, 347

dicing 24

dimension tables 33

DMatrix format 173, 288

DNNs (deep neural networks) 160

domain-specific knowledge 7

duplicated features 2628

DWH (data warehouse) 32

E

early stopping 130, 178180

EDA (exploratory data analysis) 4260, 99, 265

examining labels, values, distributions 4654

exploring bivariate and multivariate relationships 5560

loading Auto MPG example dataset 4445

EFB (Exclusive Feature Bundling) 184

embeddings 199

encoding 199

engineering features, more complex 252

entropy 136

errors in data 30

ERTs (extremely randomized trees) 149

eta parameter 172

Euclidean distance 256

eval() function 335, 336, 338, 339

evaluation metrics 109

event tables 32

Exclusive Feature Bundles 184

external data 3142

internet data 3638

synthetic data 3942

using pandas to access data stores 3236

F

fastai 304311

comparing fastai solution with Keras solution 311

comparing XGBoost and fastai solutions 452454

reviewing key code aspects of fastai solution 304311

feature engineering 114

feature_fraction parameter 182

feature processing 211218

forward and backward selection 216218

shadow features and Boruta 214216

stability selection for linear models 212214

feature processing methods 192

handling missing data with GBDTs 198

multivariate missing data imputation 195

optimizing hyperparameters 219230

target encoding 199

transforming numerical data 204

features parameter 209

FeatureUnion command 98

first_metric_only 187

fit class method 94

fit_one_cycle procedure 451, 452

fit, predict/transform interface 94

Flask, Vertex AI deployment with 386391

benefits of deploying model to endpoint 390

updating Flask server module to call endpoint 387390

Flask server module 362365

float64 data type 444

forward selection 216218

G

gamma parameter 172

GANs (Generative Adversarial Networks) 39

GBDTs (gradient-boosted decision trees) 150, 193

handling missing data with 198

speeding up by GBDTs and compiling 237240

gblinear booster 170, 281

gbtree booster 281

GDPR (General Data Protection Regulation) 294

generalized linear methods 123

generative AI (artificial intelligence)

using to help create ML pipelines 427440

using to help prepare data 243

tabular data and 1012

geocoding 257

get_category_encoding_layer() function 346, 347, 348

get_model() function 301, 302

get_normalization_layer() function 346, 348

getOption() function 367, 368

get_pipeline_config function 429, 431

Gini impurity 136

Goodfellow, Ian 40

Google Cloud

accessing for first time 372

creating Cloud Storage bucket 375

creating project 374

Google Dataset Search 37

GOSS (Gradient-Based One-Side Sampling) 184

GPT (generative pretrained transformer) 243

GPUs (graphics processing units), for machine learning 470472

gradient boosting 150161

avoiding overfitting with early stopping 164167

blending with deep learning 442, 455457

comparing deep learning solutions with 6875

deciding between XGBoost and LightGBM 231

deep learning and 443448

deep learning vs. 457

explaining effectiveness of 160161

exploring tree structures 232236

extrapolating with 155159

how it works 152155

in Scikit-learn 161167

speeding up by GBDTs and compiling 237240

GradientBoosting class 152, 153, 154, 157

GradientBoostingClassifier class 161, 163

GradientBoostingRegression class 157

GradientBoostingRegressor class 161

gradient descent 128

grow_policy parameter 188

H

HalvingGridSearchCV 223

HalvingRandomSearchCV 223

handle_unknown parameter 255

Harvard University Dataverse 37

help() command 94

high cardinality 22, 200

categorical features 100

high-level API 303

HistGradientBoostingClassifier 188, 189

HistGradientBoostingRegressor 188

hist method 101, 178

histogram splitting 175178

human-AI collaboration 287

hyperparameters 96

for classical machine learning models 459462, 472

optimizing 219230

I

ICE (individual conditional expectation) plots 207

idxmin() function 256

IID (independent and identically distributed) principle 16

information gain 136

init function 163

int64 data type 443, 444

interactions 117

internal data 3142

internet data 3638

synthetic data 3942

using pandas to access data stores 3236

IQR (interquartile range) 102

irrelevant features 28

IterativeImputer 196

J

JSON (JavaScript Object Notation) 294

JSON Lines (JSONL) dataset 433

K

Kaggle Datasets 38

KDTree (k-dimensional tree) 258

Keras

comparing fastai solution with 311

comparing TabNet solution with 315

comparing with Lightning Flash solution 319

deep learning solution using 66

k-fold cross-validation 109

k-nearest neighbors (KNNs)

algorithm 464468

GPUs for machine learning 470472

SVMs (support vector machines) 468470

Kuala Lumpur real estate dataset 327331

L

L1 norm 256

l1_ratio hyperparameter 459

L1 regularization 116, 175

L2 norm 256

L2 regularization 116, 175

l2_regularization hyperparameter 460

LabelEncoder 199

lambda_l1 parameter 181

lambda_l2 parameter 181

lambda regularizer 172

LassoCV 119

Lasso regression 116

leakage features 30

learning_rate parameter 172, 173, 181

LightGBM (Light Gradient Boosted Machines) 135, 231, 180189

early stopping 186187

Scikit-learn 188

speed 184186

tree growth 183

XGBoost imitating 188

Lightning Flash 304, 316

comparing with Keras solution 319

key code aspects of solution 316320

linear models, stability selection 212214

linear regression 111

linear_tree parameter 182

link function 120

listwise deletion 257

LLM (large language model) 242

local implementation, machine learning (ML) pipelines vs. 415

log_evaluation callback function 187

logistic regression 111, 120

lossguide 188

low cardinality 22

LOWESS (LOcally WEighted Scatterplot Smoothing) 453

low-level framework 302

lr_find 451452

M

MAE (mean absolute error) 196, 275, 448

main function 432

make_scorer command 110

Manhattan distance 256

MAR (missing at random) 195

max_bins argument 189

max_delta_step hyperparameter 461

max_depth parameter 141, 153, 163, 172, 173, 181, 188

max_features hyperparameter 459

max_iter parameter 123, 189

max_leaf_nodes parameter 163

MCAR (missing completely at random) 195

mean encoding 200

MediTab 11

merged_data.head() function 306

min_child_weight parameter 171, 173

min_data_in_leaf parameter 182

min_delta 187

min_distances DataFrame 259

min_gain_to_split hyperparameter 462

min_impurity_decrease parameter 163

min_sample_leaf parameter 163

min_samples_leaf parameter 153, 171

min_samples_split parameter 153, 163

min_split_loss parameter 172

min_weight_fraction_leaf parameter 163

MissForest 197

missing data 29

ML (machine learning) 7, 61

classical 108

deep learning vs. 68

efficacy of 8283

GPUs for 470472

transparency 7782

ML (machine learning) pipelines 397, 399

containers 418422

defining 415427

local implementation vs. 415

overview of 398400

pipeline script 422425

preparation steps 401415

testing model trained in 425

types of 398

using generative AI to help create 427440

Vertex AI ML pipelines 400

MLOps (machine learning operations) 371

MLPs (multilayer perceptrons) 160

MNAR (missing not at random) 195

model building and optimizing 267

model deployment 359

Gemini for Google Cloud 391395

Google Cloud 372376

in Vertex AI 376386

public clouds and machine learning operations 371

Vertex AI deployment with Flask 386391

web deployment 360371

Model object 95

model optimization 281

model training script 416

Modin 35

multi_class parameter 122

multivariate imputation 195

N

n_estimators parameter 169, 181

NewtonianGradientBoosting class 174

nonlinearity 7

normalization 352

normalized schema 32

np.hstack command 98

n_trials parameter 225

number_of_reviews 99, 246

numerical data, transforming 204

numeric features 21

numeric pass-through 107, 140

numeric standardization 107

NumPy 35

O

object data type 444

objective parameter 171, 173

Occam’s razor principle 116

one-hot encoding 100

Open Data Monitor 37

OpenML 36

OrdinalEncoder 199

ordinal features 22

overfitting, avoiding with early stopping 164167

P

pandas, accessing data stores 3236

pandas DataFrame 23

parameter eval_set 179

PartialDependenceDisplay function 209

partial_dependence function 208

partial_fit class method 94

partial_fit method 130

pasting 142

patience threshold 315

PCA (principal component analysis) 57, 465

PDP (partial dependence plot) 205

pip 180, 200, 271, 281

Pipeline command 98

pipelines

preparation 107

types of 398

Vertex AI ML pipelines 400

pipeline script 422425

plot_model function 350

plot_tree function 234, 236

Polars 36

PolynomialFeatures function 117

predict class method 94

predicting, with random forests 146

Predictor object 95

prefetch() function 346

procs parameter 308

procs preprocessing steps 451

public clouds 371

PyTorch 312316

code aspects of TabNet solution 312315

comparing TabNet solution with Keras solution 315

with fastai 303311

with Lightning Flash 316320

with TabNet 303

R

random forests, predicting with 146

random patches 143

random_state setting 96

random subspaces 143

RAPIDS 35

rare categories 30

Ray 35

read_csv command 245

reg_alpha hyperparameter 461, 462

regex (regular expressions) 247

reg_lambda hyperparameter 461, 462

regularization 352

regularized methods 115

relevant information extraction 7

representation learning 7

reverse geocoding 257

RidgeCV 119

Ridge regression 117

RMSE (root mean squared error) 108, 178, 448

ROC-AUC (Receiver Operating Characteristic Area Under the Curve) 42, 106

rows

characteristics of 1524

representing 2324

S

sample_posterior parameter 196

sampling 141

scale_pos_weight hyperparameter 461, 462

Scikit-learn 9298, 135, 161167, 188189

common features of packages 93

common interface 9497

pipelines 97

script_path training script 423

sdv package 40

SequentialFeatureSelector function 216

service accounts 400

creating for ML pipelines 401

creating keys for ML pipelines 403

granting access to Compute Engine default service accounts 404

uploading keys 408

SGD (stochastic gradient descent) 126, 128

shadow features 214216

SHAP (SHapley Additive exPlanations) 287296

SHAP values 288, 292, 294

snake_case naming convention 344

Spark 35

sparse_threshold parameter 140

speed, LightGBM 184186

stability selection 212214

stopping, early 164167, 186187

StringLookup layer 349

structured data 5

subsample parameter 172, 173, 237

subsample rows 163

subsampling 17

successive halving 223

summary_listing dataset 251

summary() statement 310, 315

SVD (singular value decomposition) 57

SVMs (support vector machines) 468470

GPUs for machine learning 470472

synthetic data 3942

T

table 4

TableGPT 11

TABLET benchmark 11

TabNet 312316

code aspects of solution 312315

comparing with Keras solution 315

TabularClassificationData object 318

tabular data 3

classical algorithms for 99108

deep learning and 810, 300311, 316320

defined 4

generative AI and 1012

importance of 6

machine learning vs. deep learning 68

tabular data library 303

TabularDataLoaders definition 308

TabularDataset object 423

tabular datasets 4, 14

exploratory data analysis 4260

external and internal data 3142

pathologies and remedies 2431

rows and columns characteristics 1524

TabularPandas 450451

target analysis 101, 106

TargetEncoder 201

target encoding 199, 200, 292

TensorFlow 301

testing, model trained in pipeline 425

tf.data.Dataset objects 346

Tokyo Airbnb problem, deep learning solution to 450452

training/inference pipeline 398

transfer learning 10

transform class method 94

Transformer object 95

transforming numerical data 204

transparency 7782

conclusions 82

explainability 77

feature importance 80

tree-based methods 135

TreeExplainer function 289

tree_method parameter 178

trees

extremely randomized 149

LightGBM 183

TreeSHAP algorithm 287

tree structures 232236

t-SNE (t-distributed stochastic neighbor embedding) 57

TweedieRegressor 124

U

UCI Machine Learning Repository 36

UMAP (uniform manifold approximation and projection) 58

unknown_value parameter 271

V

VAEs (Variational Autoencoders) 39, 40

Vaex 35

validation_fraction parameter 166

verbosity parameter 182

Vertex AI

creating managed datasets 412

deploying model in 376386

deployment with Flask 387391

tuning foundation model in 438440

Vertex AI ML pipelines 400

Vertex AI SDK, setting up 387

VIF (variance inflation factor) 28

W

web deployment 360371

exercising 370

Flask server module 362365

home.html page 365369

overview of 361

show-prediction.html page 369

winsorize function 262

X

XGBClassifier 168, 170

XGBFIR (XGBoost Feature Interactions Reshaped) 209

XGBoost (eXtreme Gradient Boosting) 135, 167180, 202, 231

accelerating with histogram splitting 175178

applying early stopping to 178180

building and optimizing model 267

building baseline model 273

building first tentative model 279

comparing fastai and XGBoost solutions 452454

engineering more complex features 252

example using 242

exploring and fixing data 260

exploring target 265

finalizing data 259

how it works 173175

key parameters 169173

LightGBM imitating 188

optimizing model 281

preparing and exploring data 243

preparing cross-validation strategy 268

preparing data 244

preparing pipeline 270

SHAP 287295

training final model 285

using generative AI to help prepare data 243

XGBRegression 170

XGBRegressor class 168

XGBRFClassifier 171

XGBRFRegressor 171

sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage