Part I: Conducting Exploratory Data Analysis (EDA) for the Kaggle Home Credit Default Competition

Follow along as the team competes to win the Kaggle Home Credit Default Competition — this is the first of a series of posts on our modeling process!

For applicants with sparse credit history, obtaining a loan can be frustrating

In this first post, we are going to conduct some preliminary exploratory data analysis (EDA) on the datasets provided by Home Credit for their credit default risk Kaggle competition (with a 1st place prize of $35,000!).

Home Credit is a loan provider for people with little or no credit history. They use a variety of alternative data sources such as transactional or telco information to evaluate a client’s repayment abilities.

We’re going to break down our analysis into two posts:

  1. Exploratory data analysis (EDA), feature pre-processing, and initial modeling with LightGBM and Random Forest (this post!)
  2. Creating hand-engineered features from a master dataset of all available Kaggle datasets.

Whether you’re an experienced Kaggler or someone who is just starting out in Kaggle competitions, this series is for you!

All the code for this post can be found here & model results, figures, and notes can be found in this public project

Setting up the environment

First, let’s set up our experiment in, and grab our API key.

At, we help data scientists and machine learning engineers to automatically track their datasets, code, experiments and results creating efficiency, visibility and reproducibility.

# Import and log an experiment with API key
from comet_ml import Experiment
experiment = Experiment(api_key="YOUR API KEY", project_name="home-credit")

# Reading in the data + setting a hash to the dataset
df = pd.read_csv('./application_train.csv', sep=',')

Getting an initial view of the data

Let’s start by looking at features in the application_train.csv file. This file contains 121 features, and 307, 511 examples. Before we begin any sort of modeling, it’s important to get a sense of the distribution of your data, and the correlations between individual features.

First, let’s check the distribution of our target variable, and log that visualization to our Comet project.

import matplotlib.pyplot as plt

feature = "TARGET"

ax = integer_df[feature].value_counts().plot(kind='bar',
experiment.log_figure(figure_name=feature, figure=plt)

We can see that our figure has been uploaded to the Graphics page of our experiment on Having the figure ready at hand will be useful for reference as we progress through the competition and for collaboration!

It’s clear that our target distribution is highly imbalanced with 80% of clients repaying their loans on time. This is great for Home Credit, but will definitely inform how we evaluate our classifier. When your target variable’s distribution is imbalanced, accuracy is not a good metric to evaluate model performance.

Next let’s check for the number of categorical and numerical features to see the split.

Categorical: 16
Numerical: 105 . #can be divided into integer and float type

We seem to have a larger presence of numerical features in our dataset. These numerical features can be divided into integer type features and float type features. On closer examination, we see that a majority of our numerical features, such as, FLAG_DOCUMENT, FLAG_EMAIL, REG_CITY_NOT_LIVE_CITY, are actually encoding categorical information, and so we will include them in our categorical features dataset.

We’ll take a look at these three types of features in order: (1) float valued, (2) categorical, and (3) integer valued.

Let’s first take a look at the correlation matrix for our float valued features and our target.

Correlation matrix for float valued features

This figure illustrates that there seems to be little correlation between our target label (feature no. 65) and our float valued features. However, we do see that features 11 to 53 are highly correlated, and on further inspection, we find that these are all features related to the client’s home (interesting 🧐). We can make a note of this in the Notes tab of our experiment page.

# highly correlated float valued features (features 11 - 53)
# Conduct PCA on these features to reduce down to 10

These features are good candidates for dimensionality reduction, since they are adding redundant information to our model. We’ll run a Principal Component Analysis (PCA) transformation over these features, and use the top 10 principle components in our classifier. These 10 components are able to explain about 77% of the variance in our dataset.

Next we’ll take a look at categorical features. However, in order to use them in our model, we will have to One Hot Encode them into binary vectors (basically create dummy variables). After encoding these variables, we can run a Random Forest and LightGBM models with similar parameters, over the data to extract an estimate of feature importance.

Our models consist of 100 trees, with 31 leaves, and produce the following feature rankings.

Some takeaways: LightGBM and Random Forest both rank the type of income, type of education, family status, and car ownership in the top 15 categorical features.

LightGBM Feature Importance Rankings
Random Forest Feature Importance Rankings

From these categorical features, I’ve included some of the more informative plots of features that showed up in both the LightGBM and Random Forest feature rankings below. The full list can be found here.

  • Gender distribution
  • Loan type distribution
  • Family status distribution
  • Occupation distribution
Gender Distribution
Loan Type Distribution
Family Status Distribution
Occupation Distribution

Finally, let’s take a look at our integer type features. After filtering out integer features that represent categories, we are left with only 7 integer valued features.


These features are on different scales, so we will just use LightGBM, and Random Forest on this data, and plot the most important features

LightGBM Integer Features Importance Ranking
Random Forest Integer Features Importance Ranking

Both algorithms rank the integer features in a similar way. We will now combine these filtered features into a new dataset, and train three models: (1) Logistic Regression, (2) Random Forest, and (3) LightGBM on this data. These models will serve as our baselines.

We’ve logged all three models in and also used the Hyperparameter Optimizer feature. Our initial run of LightGBM results in an AUC score 0.745, which is significantly higher than both Logistic Regression and Random Forest.

Try out this public project in

Not bad for a baseline model, but we can definitely do better! Next post we’ll explore some automatic feature engineering using a Neural Network.

👉🏼 Subscribe to our blog to stay tuned for our next two posts for our Kaggle Home Credit Default Risk competition submission! 👈🏼

All the code for this post can be found here & model results, figures, and notes can be found in this public project

Dhruv Nair is a Data Scientist on the team. 🧠🧠🧠 Before joining, he worked as a Research Engineer in the Physical Analytics team at the IBM T.J. Watson Lab.

About — is doing for ML what Github did for code. Our lightweight SDK enables data science teams to automatically track their datasets, code changes, experimentation history. This way, data scientists can easily reproduce their models and collaborate on model iteration amongst their team!

It’s easy to get started

And it's free. Two things everyone loves.