Gradient Boosting with Intel® Optimization for XGBoost

您所在的位置:网站首页 counttothree中文 Gradient Boosting with Intel® Optimization for XGBoost

Gradient Boosting with Intel® Optimization for XGBoost

2023-04-06 14:34| 来源: 网络整理| 查看: 265

The xgboost library provides scalable, portable, distributed gradient-boosting algorithms for Python*. The key features of the XGBoost algorithm are sparse awareness with automatic handling of missing data, block structure to support parallelization, and continual training. This article refers to the algorithm as XGBoost and the Python library as xgboost to distinguish between the two.

The XGBoost algorithm’s rapid rise in popularity motivated companies to develop products to support its growth. Intel has made significant contributions in this regard, introducing optimizations to every open source xgboost release starting with 0.81. The Intel® AI Analytics Toolkit (AI Kit) includes Intel® Optimization for XGBoost* and many other optimized libraries for machine learning, such as an optimized version of Python, Scikit-learn* (sklearn), and Modin* to enhance data preprocessing and analytics.

This article focuses on the XGBoost algorithm and compares its performance to related tree-based models. We can access the XGBoost algorithm as a Python package (xgboost) using Anaconda*, Python pip, or other package managers. We install the relevant dependencies step-by-step as we progress through this tutorial.

What Is Gradient Boosting?

Gradient boosting, also known as a stochastic gradient or a gradient boosting machine, combines the ideas of gradient descent and ensemble boosting to create an algorithm that reduces errors as new decision trees are added to the sequence. It minimizes errors by iteratively computing the gradient of a convex function in the direction of a minimum. In this tutorial, we train and test a boosted tree evaluated on the log loss, which is sklearn’s default loss function.

Decision tree models are particularly susceptible to underfitting, as their simple design favors interpretability. We can use ensemble learning to reduce the risk of underfitting. A single ensemble architecture combines multiple models into one architecture that uses the base learner’s predictions to train additional models until convergence. The two main types of ensembles are bagging and boosting. Bagging selects data points randomly with replacement and equal probability, thereby reducing variance. Boosting selects data points based on the performance so far, reducing bias. The XGBoost algorithm combines these concepts to ensure low bias and low variance.

How to Perform Gradient Boosting

In this project, we implement, evaluate, and compare a regular decision tree model, a gradient-boosting decision tree, and the XGBoost algorithm using Intel Optimization for XGBoost. The task is to take a set of attributes that describe a car and classify its quality as unacceptable, acceptable, good, or very good.

Let’s begin. First, download the dataset from the University of California at Irvine (UCI) machine learning repository’s website or Kaggle*.

The dataset contains the following six features used to classify a car’s quality:

Buying price Maintenance cost Number of doors Number of passengers Luggage boot Estimated safety level

Second, import the necessary libraries and load the entire dataset. Run the code snippets in your Anaconda environment on your preferred integrated development environment (IDE). We run all the code in this article in a Jupyter* Notebook. Following this tutorial in a notebook will minimize the risk of errors when you run it.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns; sns.set() import warnings import category_encoders as ce from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split import time import warnings df = pd.read_csv("pathToData\\car_evaluation.csv")

Third, we conduct some exploratory data analysis to understand the data better. The pandas package command df.info returns relevant information about the characteristics of our data frame, like the presence of null values and our features’ data types.

Execute df.info(). This is the result:

RangeIndex: 1727 entries, 0 to 1726 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 vhigh 1727 non-null object 1 vhigh.1 1727 non-null object 2 2 1727 non-null object 3 2.1 1727 non-null object 4 small 1727 non-null object 5 low 1727 non-null object 6 unacc 1727 non-null object dtypes: object(7) memory usage: 94.6+ KB

Although we can infer what each column represents based on the dataset’s description, the columns currently have generic names. So, let’s pass a list of names to give each column a descriptive label.

df.columns= ['buying','maint','doors','persons','lug_boot','safety','class']

Before instantiating our models, we need to encode all our categorical variables and split our dataset into training and testing. We can easily split our data with sklearn’s train_test_split function:

# Separate class column for target variable X = df.drop(['class'], axis=1) y = df['class'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

Note that the nature of the categorical variable is ordinal. We can use the ordinal encoder (OrdinalEncoder) from the categorical encoders package to encode our data in this manner:

# Encode training features with ordinal encoding encoder = ce.OrdinalEncoder(cols=X_train.columns) X_train = encoder.fit_transform(X_train) X_test = encoder.transform(X_test)

In the following three sections, we implement three approaches for comparison: a simple decision tree, a gradient boosting machine, and Intel’s XGBoost algorithm.

Creating a Simple Decision Tree

The scikit-learn Python package provides many functionalities to access, train, and evaluate machine learning models. The DecisionTreeClassifier corresponds to a simple decision tree model meant for classification tasks.

Let’s establish our benchmark as the accuracy score of the default implementation of this model after training.

We run the following code in our chosen IDE to instantiate and train a decision tree classifier with sklearn.

from sklearn.tree import DecisionTreeClassifier tree = DecisionTreeClassifier(random_state=42) # fit the model tree.fit(X_train, y_train)

Next, we evaluate our model’s prediction by computing the accuracy score, which is the percentage of correctly classified predictions, using the code below.

# Predict and test on test data y_hat = tree.predict(X_test) accuracy_score(y_test, y_hat) > 0.9441233140655106

Therefore, our benchmark score is 94.4 percent prediction accuracy.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3