用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

您所在的位置：网站首页 › 学生成绩分析数据采集表 › 用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

2024-07-12 14:15| 来源: 网络整理| 查看: 265

—数据集见评论区—

用机器学习进行学生成绩预测的数据分析（入门向附可用源码） ---数据集见评论区--- 声明思路检查数据图像化处理分析相关性分析构建模型代码实现可运行代码

声明

文章代码修改于kaggle博主DIPAMVASANI，本文旨在将精华的内容留住并加以分析，防止繁杂的信息扰乱该文的本质，因此删除了图形化处理模块并在核心代码处加了中文注释。

思路检查数据

数据的检测是很重要的，之前没有好好地检查数据，导致很多有用的信息没有看到，因此要先检查数据集的数据，保证核心的信息不遗漏，例如：y值（待预测值）和x值（属性值）一定要分清。

图像化处理

这个部分代码由于最后的分析实际上不需要，因此删除。

分析相关性分析

首先要注意非数值型变量进行one-hot编码。然后直接去计算目标数据和各属性的相关性矩阵，并取topK个用作最后的建模。

构建模型

先用训练集的均值预测一下试试，如果后面的模型比均值还烂。。。那拜拜了您内。最后分数据集分一分，注意随机种子，分好之后用各个模型跑出误差，挑选最优，即完成。

如果感觉这部分没有看懂，没有关系，后面我们会详细讲解。

代码实现

首先读入数据并用describe函数查看大致属性

# Some basic analysis student = pd.read_csv('./input/student-mat.csv') print(student.head()) print('Total number of students:',len(student)) print(student['G3'].describe())

然后对整个数据集进行one-hot编码

# Encoding categorical variables # Select only categorical variables category_df = student.select_dtypes(include=['object']) # 挑选非数值型变量 # One hot encode the variables dummy_df = pd.get_dummies(category_df) # get_dummies是实现one-hot编码的方法 # Put the grade back in the dataframe dummy_df['G3'] = student['G3'] # Find correlations with grade dummy_df.corr()['G3'].sort_values() # Applying one hot encoding to our data and finding correlation again # selecting the most correlated values and dropping the others labels = student['G3'] # G3那一列的数值 # drop the school and grade columns student = student.drop(['school', 'G1', 'G2'], axis='columns') # 删除三列 # One-Hot Encoding of Categorical Variables student = pd.get_dummies(student) # 对整个数据集进行one-hot编码

随后进行相关性分析

# Find correlations with the Grade 生成与G3相关度的排序列（降序排列） most_correlated = student.corr().abs()['G3'].sort_values(ascending=False) # student.corr()生成相关矩阵 .abs()取绝对值 ['G3']挑出G3列 sort默认升序 # Maintain the top 8 most correlation features with Grade most_correlated = most_correlated[:9] # 其实就是[0:9]有0无9 最相关的9个变量（G3是自己除G3外8个） print(most_correlated) student = student.loc[:, most_correlated.index] # .loc 通过label选定一组行或列第一个是行逗号后面是列选定最相关的九个变量旗下每个学生的属性值 print(student.head())

然后运用最简单的均值预测作为我们的基线方法

# splitting the data into training and testing data (75% and 25%) # we mention the random state to achieve the same split everytime we run the code #train_test_split(训练数据，样本结果，测试集的样本占比，random_state=None(default)每次划分不同，为整数则相同) X_train, X_test, y_train, y_test = train_test_split(student, labels, test_size = 0.25, random_state=42) print(X_train.head()) # Calculate mae and rmse def evaluate_predictions(predictions, true): mae = np.mean(abs(predictions - true)) # 平均绝对误差(np.mean需要是数组) rmse = np.sqrt(np.mean((predictions - true) ** 2)) # 均方根误差(np.sqrt同样需要是数组) return mae, rmse # find the median median_pred = X_train['G3'].median() # 取中位数 # create a list with all values as median median_preds = [median_pred for _ in range(len(X_test))] # 生成长度为测试集的全值为median_pred的列表 # store the true G3 values for passing into the function true = X_test['G3'] # 保存真实的结果 # Display the naive baseline metrics mb_mae, mb_rmse = evaluate_predictions(median_preds, true) print('Median Baseline MAE: {:.4f}'.format(mb_mae)) print('Median Baseline RMSE: {:.4f}'.format(mb_rmse))

最后跑模型，生成结果（最后的部分是唯一生成图片的部分，用于生成模型的比较，需要掌握）

results = evaluate(X_train, X_test, y_train, y_test) print(results) plt.figure(figsize=(12, 8)) # Root mean squared error ax = plt.subplot(1, 2, 1) results.sort_values('mae', ascending = True).plot.bar(y = 'mae', color = 'b', ax = ax, fontsize=20) plt.title('Model Mean Absolute Error', fontsize=20) plt.ylabel('MAE', fontsize=20) # Median absolute percentage error ax = plt.subplot(1, 2, 2) results.sort_values('rmse', ascending = True).plot.bar(y = 'rmse', color = 'r', ax = ax, fontsize=20) plt.title('Model Root Mean Squared Error', fontsize=20) plt.ylabel('RMSE',fontsize=20) plt.show() 可运行代码 import pandas as pd from matplotlib import pyplot as plt import numpy as np from sklearn.linear_model import LinearRegression from sklearn.linear_model import ElasticNet from sklearn.ensemble import RandomForestRegressor from sklearn.ensemble import ExtraTreesRegressor from sklearn.ensemble import GradientBoostingRegressor from sklearn.svm import SVR # Splitting data into training/testing from sklearn.model_selection import train_test_split # Some basic analysis student = pd.read_csv('./input/student-mat.csv') print(student.head()) print('Total number of students:',len(student)) print(student['G3'].describe()) # Correlation print(student.corr()['G3'].sort_values()) # Encoding categorical variables # Select only categorical variables category_df = student.select_dtypes(include=['object']) # 挑选非数值型变量 # One hot encode the variables dummy_df = pd.get_dummies(category_df) # get_dummies是实现one-hot编码的方法 # Put the grade back in the dataframe dummy_df['G3'] = student['G3'] # Find correlations with grade dummy_df.corr()['G3'].sort_values() # Applying one hot encoding to our data and finding correlation again # selecting the most correlated values and dropping the others labels = student['G3'] # G3那一列的数值 # drop the school and grade columns student = student.drop(['school', 'G1', 'G2'], axis='columns') # 删除三列 # One-Hot Encoding of Categorical Variables student = pd.get_dummies(student) # 对整个数据集进行one-hot编码 # Find correlations with the Grade 生成与G3相关度的排序列（降序排列） most_correlated = student.corr().abs()['G3'].sort_values(ascending=False) # student.corr()生成相关矩阵 .abs()取绝对值 ['G3']挑出G3列 sort默认升序 # Maintain the top 8 most correlation features with Grade most_correlated = most_correlated[:9] # 其实就是[0:9]有0无9 最相关的9个变量（G3是自己除G3外8个） print(most_correlated) student = student.loc[:, most_correlated.index] # .loc 通过label选定一组行或列第一个是行逗号后面是列选定最相关的九个变量旗下每个学生的属性值 print(student.head()) # splitting the data into training and testing data (75% and 25%) # we mention the random state to achieve the same split everytime we run the code #train_test_split(训练数据，样本结果，测试集的样本占比，random_state=None(default)每次划分不同，为整数则相同) X_train, X_test, y_train, y_test = train_test_split(student, labels, test_size = 0.25, random_state=42) print(X_train.head()) # Calculate mae and rmse def evaluate_predictions(predictions, true): mae = np.mean(abs(predictions - true)) # 平均绝对误差(np.mean需要是数组) rmse = np.sqrt(np.mean((predictions - true) ** 2)) # 均方根误差(np.sqrt同样需要是数组) return mae, rmse # find the median median_pred = X_train['G3'].median() # 取中位数 # create a list with all values as median median_preds = [median_pred for _ in range(len(X_test))] # 生成长度为测试集的全值为median_pred的列表 # store the true G3 values for passing into the function true = X_test['G3'] # 保存真实的结果 # Display the naive baseline metrics mb_mae, mb_rmse = evaluate_predictions(median_preds, true) print('Median Baseline MAE: {:.4f}'.format(mb_mae)) print('Median Baseline RMSE: {:.4f}'.format(mb_rmse)) # Evaluate several ml models by training on training set and testing on testing set def evaluate(X_train, X_test, y_train, y_test): # Names of models model_name_list = ['Linear Regression', 'ElasticNet Regression', 'Random Forest', 'Extra Trees', 'SVM', 'Gradient Boosted', 'Baseline'] X_train = X_train.drop('G3', axis='columns') X_test = X_test.drop('G3', axis='columns') # Instantiate the models model1 = LinearRegression() model2 = ElasticNet() # 结合岭回归和Lasso回归，避免过拟合 model3 = RandomForestRegressor() # 多个决策树 model4 = ExtraTreesRegressor() # 类随机森林 model5 = SVR() # SVM的反义词 model6 = GradientBoostingRegressor(n_estimators=50) # 梯度提升回归 # Dataframe for results results = pd.DataFrame(columns=['mae', 'rmse'], index=model_name_list) # columns列 index行 # Train and predict with each model for i, model in enumerate([model1, model2, model3, model4, model5, model6]): model.fit(X_train, y_train) predictions = model.predict(X_test) # Metrics mae = np.mean(abs(predictions - y_test)) rmse = np.sqrt(np.mean((predictions - y_test) ** 2)) # Insert results into the dataframe model_name = model_name_list[i] results.loc[model_name, :] = [mae, rmse] # Median Value Baseline Metrics baseline = np.median(y_train) baseline_mae = np.mean(abs(baseline - y_test)) baseline_rmse = np.sqrt(np.mean((baseline - y_test) ** 2)) results.loc['Baseline', :] = [baseline_mae, baseline_rmse] return results results = evaluate(X_train, X_test, y_train, y_test) print(results) plt.figure(figsize=(12, 8)) # Root mean squared error ax = plt.subplot(1, 2, 1) results.sort_values('mae', ascending = True).plot.bar(y = 'mae', color = 'b', ax = ax, fontsize=20) plt.title('Model Mean Absolute Error', fontsize=20) plt.ylabel('MAE', fontsize=20) # Median absolute percentage error ax = plt.subplot(1, 2, 2) results.sort_values('rmse', ascending = True).plot.bar(y = 'rmse', color = 'r', ax = ax, fontsize=20) plt.title('Model Root Mean Squared Error', fontsize=20) plt.ylabel('RMSE',fontsize=20) plt.show()

【本文地址】

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

今日新闻

推荐新闻

用机器学习进行学生成绩预测的数据分析（入门向 附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向 附可用源码）

今日新闻

推荐新闻

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）

用机器学习进行学生成绩预测的数据分析（入门向附可用源码）