Part II: Manual Feature Engineering techniques for the Kaggle Home Credit Default Competition

Our second post in this series, where the team competes to win the Kaggle Home Credit Default Competition!

This post is the second in our series as we work through our submission for the Home Credit Default Risk Kaggle competition (with a 1st place prize of $35,000!). Don’t miss our first post here where we conducted some exploratory data analysis on the application_train.csv and created some baseline models

All the code for this post can be found here & model results, figures, and notes can be found in this public project

The Loan Process in a diagram — via Moody’s Analytics and Finagraph

In this post, we will cover:

  • Creating a master dataset for our models, by incorporating features found in the other csv files provided for the competition (7 files in total). In our first post, we only used features from the first application_train.csv
  • Method A: Manual feature engineering after consulting with subject matter experts (SMEs).

In our next post, we will walk through:

  • Method B: Automatic feature engineering using a regularized Autoencoder
  • Comparing the performance of our LightGBM models that use the two different sets of features (Method A: manual vs. Method B: automated)

Creating a master dataset

In our first post, we extracted a subset of candidate features from a single csv file, application_train.csv, and trained some baseline models with this data (our LightGBM model had an AUC score of 0.745).

However, Home Credit provides six more datasets that we haven’t even touched so far that includes information from other financial institutions ( bureau.csv), the applicant’s credit card account (credit_card_balance.csv), and even their previous loan applications (previous_application.csv). As our next step, we will incorporate these other six datasets and aggregate them into one master dataset. For this post, we are only going to use the first 100,000 rows from each dataset.

Diagram of dataset (“table”) relationships provided by Kaggle

There are drawbacks to having a large amount of features, i.e. the curse of dimensionality.

Our final training dataset consists of 281 features, most of which are categorical features.

Feature Extraction Method A: Manual feature engineering

Tapping into domain expertise

Since Cecelia Shao and I don’t have any domain expertise in lending, we thought it would be both helpful and interesting to connect with mortgage professionals with experience reviewing applications.

Thanks to Clement Kao, a product manager at Blend (a mortgage technology provider), we were able to connect with two mortgage professionals at Preferred Financial Group, Lead Processor Cody Dadiw and Senior Loan Officer Philippa Stewart-Donnelly!

Preferred Financial Group is an independent financial services brokerage in San Ramon, CA. They provide mortgage loan assistance, insurance coverage and real estate services. On the lending side: they specialize in true no-cost loans where there are no hidden costs or fees rolled into the loan, and the target is always what is in the client’s best interest. They have access to 30+ wholesale lenders (Caliber Home Loans, Quicken, Flagstar Bank, New Penn Financial, UWM, and more and also broker a wide range of loan products in not only California, but many other states.

Manual feature selection with our SME (subject matter experts)

We asked Cody and Philippa to look through the different feature options in the column_description.csv (across all seven datasets) and select the features they thought would be indicative of an applicant’s default risk.

They came back to us with 42 selected columns (in yellow).

Here’s a snippet of the datasets’ columns are Cody and Philippa reviewed it. The yellow indicates their selections, while the blue marks are features that our initial LightGBM model indicated as important. Green cells denote features that both the SMEs and the model selected.

Some of the most important features our SME’s selected were:

  • The applicant’s debt-to-income (DTI) ratio, where you stack up all of their liabilities against their monthly income. We approximated this with our credit_to_income feature.
  • The applicant’s loan to value (LTV) ratio that compares the loan amount to the property’s appraised value. We approximated this with our credit_to_goods_ratio
  • Number of other inquiries into their credit history
  • Previous loan applications and the status of those applications
  • Previous payments, defaults or overdue notices
  • The applicant’s employment history

Question: Why all the emphasis on feature engineering? Where’s the modeling?!

Answer: For most machine learning problems, well engineered features, and simple models tend to produce the best results. Feature engineering is a way to use domain knowledge to create predictive indicators that better represent the underlying problem for your model.

Another great resource for feature engineering is William Koehrsen’s post on Kaggle.

It’s interesting to see different approaches to determining an applicant’s risk level as the industry shifts on what kind of attributes (features) they focus on.

See the code for the manual feature engineering here . The full model results, figures, and notes can be found in this public project

LightGBM Model Results with manually engineered features

Our LightGBM model results in

Our manual features are able to give us a slight boost in the AUC score for our validation data to around 0.7519. In our previous model with just one dataset and no manual features, our AUC was 0.745. In fact, multiple versions of the LightGBM model place our hand crafted features within the top 15 positions of the feature importance ranking.

Feature Importance Ranking for LightGBM model with 66 trees and 41 leaves per tree

Keep in mind that our model has only used a portion of the available data due to compute constraints. As a result, the final dataset has a large number of columns with mostly NaN values. This is because multiple IDs in the other datasets correspond to a single ID in the application_train.csv file. By only reading in the first 100,000 rows of the other datasets, we are leaving out data related to a lot of the applicant ID’s in application_train.csv

Multiple IDs in the other datasets correspond to a single ID in the main dataset

In our next post, we will try to use all available data to generate our features, and check how our model predictions stack up on the Kaggle leaderboard.

See the code for the manual feature engineering here . The full model results, figures, and notes can be found in this public project

Beyond the model — reflections from our SMEs

It was great to have the opportunity to work with experts from the lending space to great handcrafted features, but we wanted to take the conversation a step further than just which features to include in our model.

We asked Cody and Philippa what they thought about machine learning and reflect on how human experts and algorithms can potentially work together.

What did our mortgage professionals think about machine learning?

When we asked Cody and Philippa what they think about machine learning and its impact on their industry, they were undeniably excited about how AI and machine learning were sparking true innovation in their industry

They also raised an important point about compliance. For financial services companies in the US or Europe, it is actually illegal for companies to make evaluations based on certain applicant attributes such as race, gender, martial status, etc…

Yet, the Home Credit competition’s data actually includes most of these personal attributes as features that would be illegal to make an evaluation with in the US.

Guidance from the Consumer Finance Protection Bureau on how to identify and protect against credit discrimination. See the full article here.

To protections against credit discrimination, the financial services industry also faces intense regulation over which models can be used in production. Financial services films are required to audit the decision process and ensure it’s not discriminatory — something which can only be done with more explicit and interpretable model frameworks like XGBoost (definitely not neural networks). The US Fair Credit Reporting Act requires that agencies disclose “all of the key factors that adversely affected the credit score of the consumer in the model used, the total number of which shall not exceed four”.

What’s missing with a model?

Our conversations with Philippa and Cody also shed light on the value of the human aspect of the lending process. For mortgage professionals like Philippa and Cody, the applicants are more than just Approvals or Rejections — they’re clients with aspirations for their future. Cody noted that she often would have a discussion with the applicant about his/her financial goals for the future and dig deeper into specific data points like late payments to find out if there was some unavoidable reason behind the overdue payment.

Can a model take that into account?

It was a viewpoint and question that resonated with us. As consumers of financial products like credit cards or student loans, there are certainly many times when we would prefer to talk to a human agent not because we wanted to change a decision but because we wanted clarity around why a certain decision was made (e.g. if I was rejected for a credit card, I want to know if the decision came from a low credit score, lack of credit history, or some other reason).

Models embedded within software programs have real impact on peoples’ lives (just take Wells Fargo’s recent glitch which caused 625 people to lose their home to foreclosures by miscalculating their eligibility for mortgage modifications). By creating the frameworks for hybrid ML infrastructures that wrap in human considerations, ethics, and experts, it’s clear that we can make much more progress (in a Kaggle competition or in real life).

All the code for this post can be found here & model results, figures, and notes can be found in this public project

Enjoyed this article? Here are some other articles you might find interesting: Team:

Dhruv Nair is a Data Scientist on the team. 🧠🧠🧠 Before joining, he worked as a Research Engineer in the Physical Analytics team at the IBM T.J. Watson Lab.

About — is doing for ML what Github did for code. Our lightweight SDK enables data science teams to automatically track their datasets, code changes, experimentation history. This way, data scientists can easily reproduce their models and collaborate on model iteration amongst their team!

Preferred Financial Team:

Cody Dadiw has been working in wholesale mortgage lending since 2012 and is currently the direct assistant and loan processor for the President. Cody has experience with a variety of different loan scenarios beyond conventional products and, as someone who is people-driven, values the opportunity to make a meaningful impact on a client’s life.

Philippa Stewart-Donnelly has been working as a mortgage broker since 2010. She got started in wholesale mortgage lending after following in her late father’s footsteps, and has since enjoyed growing the business and cultivating close relationships with her clients. Philippa values transparency in the mortgage process, always striving to educate her clients and act as an advocate on their behalf. Philippa and Cody teamed up in 2012, with the shared goal of streamlining the mortgage process to make it as painless as possible for people.

It’s easy to get started

And it's free. Two things everyone loves.