DCG Insights

XGBoost Fundamentals for Bankers: The Mechanics Behind the Algorithm

 

Xgboost

 

XGBoost is a popular supervised machine learning algorithm among data scientists and may be the new wave of modeling credit risk at banks and credit unions. This article explains the basics of the mechanics behind the algorithm.

For a higher-level discussion of XGBoost and the key takeaways for bank and credit union management, click to read Credit Risk Meets Machine Learning: 6 Cautionary Takeaways for Banking Executives.
 

What Is XGBoost?

XGBoost stands for Extreme Gradient Boosting. It is a supervised machine learning algorithm well-known for modeling complex datasets with high performance, flexibility, and scale. At a high level, XGBoost creates an ensemble of classification and regression trees through gradient boosting (a calculus-based optimization method) and then uses the ensemble to make a prediction.

Let's break this down.

What Is a Supervised Machine Learning Algorithm? 

Supervised machine learning algorithms learn how to make predictions based on data with the benefit of already knowing the answer in advance. An algorithm's learning (or training) process tests how accurate its predictions are against the known outcome and can therefore adjust to reach an optimal state. Because the training process is directed to a known target, it is classified as “supervised.”

The alternative would be an unsupervised algorithm. When an outcome is not already known during training, the algorithm explores data to find patterns. An example of an unsupervised algorithm would be discovering groups within a dataset, called “clustering.”   

In general, practitioners use supervised algorithms to do two things.

The first is to observe characteristics and predict the class of a new observation. For example, will this loan application default 90 days after origination, yes or no? The algorithm can learn from previous applications and determine the characteristics, like FICO or debt-to-income, that may help determine future default status.  

The second is to observe characteristics and estimate a continuous numeric value. In the above example, it may be better to know more than just yes or no. It may be more helpful to know the probability of default instead (a continuous value between 0 and 1). You may recognize the similarity to logistic regression. 

XGBoost performs both classification and regression. For classification problems such as “will the loan default or not,” XGBoost can use the estimated probability of default and a threshold to determine a final classification of a new applicant. While this article focuses on XGBoost classification models, know also that XGBoost can use regression to solve for other prediction problems, like forecasting prices.

About Decision Trees

ree

 A decision tree is a simple modeling technique that is easy to interpret and visually represent. Decision trees will sequentially “branch” or “split” data into smaller and smaller groups to create homogeneous sub-groups. Decision trees can be used for classification (predicting categorical outcomes) or regression problems (predicting continuous outcomes). 

Classification and Regression Tree (CART) is the decision tree algorithm that XGBoost uses for classification and regression. The CART algorithm scans an initial dataset, evaluates each feature and split value, and determines the best way to divide the data. It does this by finding the feature and split value that create the best result (i.e., the most homogeneity in the resulting subgroup). It is known as a "greedy" approach because it considers all possible features.

XGBoost also allows other approaches, such as a histogram-based or approximate algorithm, to find the split feature and value. The flexibility in split finding is one of the reasons that XGBoost is highly efficient with big datasets. A greedy approach may be cumbersome and time-consuming for high-dimensional datasets. However, by having the option to pivot to a histogram-based or approximate algorithm for split finding instead, XGBoost can decrease training time while preserving predictive performance.

Once the algorithm determines the best feature and feature value, it splits the data into two groups: datapoints with that feature's value above the split value, and datapoints with that feature's value below the split value. From there, the algorithm recursively splits the data by the best feature and split value until either a stopping condition is met, or further splitting offers no more improvement in the performance metric.  

XGBoost’s next step is to create an ensemble of these decision tree models through a process known as gradient boosting.  

About Gradient Boosting 

Gradient boosting is an ensemble method, which means it takes an extensive collection of "weak" models and uses them to generate a prediction based on the predictions of the individual weak models. As the name implies, a weak model contains fewer features and provides a less reliable prediction. For example, a weak model for predicting snow today could be, Is the temperature below freezing? A cold day is a prerequisite for snow, but snow does not fall on every cold day.

A gradient boosting algorithm iteratively adds new trees as it learns from the prior tree's mistakes. Continuing the previous example, consider the predictive power of adding another weak model, Is it cloudy? This corrects the earlier model's mistake of predicting snow on cold but clear days. With the addition of more weak models, the overall prediction gets stronger.

ree

This learning process (adding more trees) needs guidance to know when to stop to avoid “overfitting.” Overfitting describes when the model learns the training data too well and starts including outliers and random fluctuations that don’t generalize to new data. So, an “overfit” model may perform very well on the data it trained on, but not as well on fresh data. Imagine a student who memorized every tiny detail of a practice exam, including random typos and the color of the ink, who goes on to fail the real exam because the questions are slightly different.

XGBoost modifies traditional gradient boosting for flexibility and performance. It does this by adding a regularization term to the objective function (i.e., a kind of penalty for models that try to get too fancy), which helps it to balance making accurate predictions with keeping the model simple.

Mathematically, the gradient of the loss function guides the model towards the optimal results, which in turn minimizes the overall error of the model. In layperson's terms, the loss function is like a judge that observes each attempt of the model to grow and picks the attempt that results in the best improvement in predictive power.

Practical Limitations and Considerations

XGBoost's effectiveness is entirely based on data and training parameters. The model learns from the data, so the broader the view it can learn from, the better it will perform when facing new information.

During development, management must consider both the model's objective and what data are available. The balance between the two influences data preparation and model training. For example, when modeling a sequence such as a time series, model developers should prepare the training data using sequential sampling. Otherwise, if the model is to solve a problem where the order of observation is not essential, then stratified or randomized data splitting may be more appropriate.

After determining the objective and available data, model developers must “tune” the model, i.e., adjust its settings (parameters) so it will perform better on new data.

Modelers can tune many parameters in XGBoost, and the model's performance is sensitive to the tuning process. Developers and management should review various performance metrics to assess the quality of the model as it is tuned. Because XGBoost is so flexible, model developers can use many different metrics, but the selected metrics ultimately depend on the model's purpose and objective function (the function the algorithm solves to minimize loss and improve predictive performance). For classification models, some metrics that may help evaluate a model are learning curves, confusion matrices, or ROC and PR curves. 

Extra Credit: Algorithm Details for the Math-Inclined

How does gradient boosting work to solve problems in XGBoost?

  1. Define a loss function (l). The model owner chooses this function depending on the problem they are trying to model. For example, to classify a potential borrower as likely to default, a model developer may use an XGBoost classification model with the parameter "objective function" set to "binary logistic." The loss function is then:

    ree

    Behind the scenes, the loss function is combined with a regularization term to construct XGBoost's objective function. The loss function explains how well the model is performing and encourages fitting the training data. The regularization term prevents overfitting the training data by minimizing tree complexity. XGBoost optimizes the objective function in an additive manner, adding a new tree that improves the overall model at each iteration. The combined objective function is:

    ree

    Later steps will show how the algorithm optimizes this. For now, the important point is that the structure of this objective function makes XGBoost very flexible.

  2. Next, XGBoost generates an initial tree from the factors (X) and known label (Y). It begins with a very naive prediction of 0.5.

    ree

    At this phase, the ensemble has only a single tree, and the predicted value of the ensemble is the predicted value based on the tree.

  3. The XGBoost algorithm then calculates the pseudo residuals. This is simply the difference between the actual and predicted values for each datapoint in the training dataset. Recall that the predicted value is determined using the ensemble, not a single decision tree.

  4. Once it has the pseudo residuals, the algorithm can train the next tree in the ensemble using the same factors (X) and the pseudo residuals. The new tree is then fit so that it minimizes the objective function. How?

    Some calculus takes place during the optimization process. Without going into the details of deriving the final equations, the following presents the final equations, interprets them, and provides a specific derivation for the classification sample.

    When XGBoost builds a tree, it starts at the base (the root node). The root node contains all of the pseudo residual values. From there, the algorithm makes its first split. It must find the factor and a split value for that factor that reveals the most information from the data. To put it another way, it needs to find the factor and split value that best separates the y labels. To find this factor and value, the algorithm generates a score to compare different factors and split values with each other to find the best one. The score may be an information gain, Gini impurity, or entropy value for a simple classification decision tree. However, XGBoost solves various problems, so the score must be able to be generalized. The generalized score value, the Optimal Value, is:

    ree

    In the equation, g stands for the gradients (first-order derivative) and h for the hessians (second-order derivative). These terms may seem advanced and some calculus is involved, but the important thing to know is that this form allows the algorithm to solve many loss functions. The g and h will vary depending on the chosen loss function. For the classification sample, the optimal value function translates to:

    ree

    So, how can this evaluate potential splits? XGBoost considers a node, and then tests a factor and split value by calculating the optimal value score for the resulting left node, right node, and starting node, and generates a split score as follows:

    ree

    Iterating through each factor and potential split value for that factor, XGBoost calculates a score to determine how well this split improves the model's performance. It chooses the factor and value that provides the best improvement in performance, and estimates this improvement by the gain.

    The algorithm will continue the splitting process until certain conditions are met. The user can set the maximum depth of a tree. If this parameter is set, the algorithm will stop growing a tree once this maximum depth is reached. The algorithm will also stop growing a tree if it reaches a minimum number of residuals in a leaf. The model developer can set the minimum number of residuals using the minimum child weight parameter. If this parameter is not set explicitly, the algorithm will grow the tree until the cover is less than 1. The cover is the summation of hessians. For our classification example, it is:

    ree

    So far, this discussion has considered only the known y values and predicted y values in minimizing the loss function, but hasn’t covered lambda or gamma. Recall that XGBoost uses a regularization term:

    ree

    This term helps avoid overfitting the dataset to provide a better model. Lambda is seen in the denominator or the score functions above; it is used to reduce the model's sensitivity to individual data points. When lambda is greater than 0, it will reduce the resulting score by increasing the denominator's value.

    Gamma is used to help prune the tree. The algorithm will consider the gain value of a node and subtract gamma from that gain value. If this difference results in a negative value, that node is removed from the tree. The resulting loss reduction after a split is:

    ree

    Once a tree is grown, it needs to determine the output value (w):

    ree

    For this classification sample, this translates to:

    ree

    The algorithm calculates this for all terminal nodes of a tree. For every observation, it goes through each decision point in the tree until it reaches a terminal node. This terminal node provides the predicted results for that tree.

  5. Finally, the algorithm needs to add the new tree to the ensemble and update the model. The ensemble model's prediction is the summation of the initial tree plus the predicted output of each tree in the ensemble scaled by a learning rate. If the model developer sets the number of boosting rounds to one thousand, then one thousand trees are generated. The initial tree is not scaled by a learning rate, but the prediction of every tree after the initial tree (trees 2 to 1000) is scaled by a learning rate, which is an important feature in many machine learning models that controls how much the model considers the new information from a training round in its overall prediction.

  6. Repeat steps 3 through 5 using updated probabilities for the number of boosting rounds.


Contact DCG if you have questions about credit modeling, XGBoost, or machine learning in banking.


ABOUT THE AUTHORS

Hannah Glasrud is a Quantitative Consultant at Darling Consulting Group. In her role, she performs model validations and runs DCG’s CECL model. Throughout her career she has worked with a variety of models including CECL, DFAST, capital aggregators, credit, and default rate models. Her work includes validating machine learning models.

Hannah began her career at DCG in the Data Analytics Group as a business systems analyst where she performed prepayment and deposit studies for community banks. From there, she was promoted to Quantitative Analyst where she assisted in model validations and model risk management engagements.

Using her technical knowledge and coding experience, Hannah produces automated reporting tools for various areas within the company and performs code reviews for models written in SAS, R, and Python. She also developed and maintains DCG’s CECL model.

Hannah earned a BA in Economics from Boston University and an MA in Analytics from Georgia Tech.

J. Chase Ogden is a Quantitative Consultant with Darling Consulting Group's Quantitative Risk Analysis and Strategy team. As a practitioner at large and mid-sized financial institutions, Chase has experience in a wide array of modeling approaches, applications, and techniques, including asset/liability models, pricing and profitability, capital models, credit risk and reserve models, operational risk models, deposit studies, prepayment models, branch site analytics, associate goals and incentives, customer attrition models, householding algorithms, and next-most-likely product association. 

Chase is a graduate of the University of Mississippi and holds master’s degrees in international commerce policy and applied statistics from George Mason University and the University of Alabama, respectively. A teacher at heart, Chase frequents as an adjunct instructor of mathematics and statistics.