Red Wine Quality Prediction

您所在的位置:网站首页 生肖三八开是什么意思 Red Wine Quality Prediction

Red Wine Quality Prediction

2024-05-23 05:31| 来源: 网络整理| 查看: 265

Red Wine Quality Prediction#

by Nicole Bidwell, Ruocong Sun, Alysen Townsley, Hongyang Zhang

Summary#

In this project our group seeks to use machine learning algorithms to predict wine quality (scale of 0 to 10) using physiochemical properties of the liquid. We use a train-test split and cross-validation to simulate the model encountering unseen data. We use and tune the parameters of several classification models: logistic regression, decision tree, kNN, and SVM RBF to see which one has the highest accuracy, and then deploy the winner onto the test set. The final test set accuracy is around 62.3 percent. Depending on the standard, this can be decent or poor. However, a more important note is that for the really extreme quality ones (below 5 or above 6), the model was unable to identify quite a few of them correctly, suggesting that it is not very robust to outliers. We include a final discussion section on some of the potential causes for this performance as well as proposed solutions for any future analysis.

Introduction#

Red wines have a long history that can be traced all the way back to the ancient Greeks. Today, they are more accessible to an average person than ever and the entire industry is estimated to be worth around 109.5 billion USD [Company, 2023]. Despite its ubiquity, most people can barely tell the difference between a good and a bad wine, to the point where we need trained professionals (sommeliers) to understand the difference. In this project, we seek to use machine learning algorithms to predict the quality of the wine based on the physiochemical properties of the liquid. This model, if effective, could allow manufactures and suppliers to have a more robust understanding of the wine quality based on measurable properties.

Methods & Results# EDA# Dataset Description#

The dataset is the “winequality-red.csv” file from the UC Irvine Machine Learning Repository [Cortez et al., 2009], which was originally referenced from Decision Support Systems, Elsevier [Cortez et al., 2009]. The dataset contains physiochemical proprties (features) of red vinho verde wine samples from the north of Portugal, along with an associated wine quality score from 0 (worst) to 10 (best).

fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality 0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5 1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5 2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5 3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6 4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5

Fig. 1 First five rows of the red wine dataframe.#

There are 11 feature columns representing physiochemical characteristics of the wines, such as fixed acidity, residual sugar, chlorides, density, etc. There are 1599 rows or observations in the dataset, with no missing values. The target is the quality column which is listed as a set of ordinal values from 3 to 8, although they could go as low as 0 or as high as 10 (this data set does not contain observations across the entire range). Most observations have an “average” quality of 5 or 6, with fewer below a score of 5 or above a score of 6.

Columns#

fixed acidity: grams of tartaric acid per cubic decimeter.

volatile acidity: grams of acetic acid per cubic decimeter.

citric acid: grams of citric acid per cubic decimeter.

residual sugar: grams of residual sugar per cubic decimeter.

chlorides: grams of sodium chloride per cubic decimeter.

free sulfur dioxide: grams of unreacted sulfur dioxide per cubic decimeter.

total sulfur dioxide: grams of total sulfur dioxide per cubic decimeter.

density: density of the wine in grams per cubic decimeter.

pH: pH value of the wine

sulphates: grams of potassium sulphate per cubic decimeter

alcohol : percentage volume of alcohol content.

quality : integer range from 0 (representing low-quality) to 10 (representing high-quality).

Visualization#

We first observe the distribution of the features using their statistical summaries and a histogram. We can see that the majority of features have a skewed distribution, with many containing outliers. Volatile acidity, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, and sulphates all have very extreme outliers.

Fig. 2 Histograms showing the distrbution of each feature in the red wine dataframe.#

Model Training# Model Selection and Hyperparameter Tuning#

Our method for model selection involves using 5-fold cross-validation and hyperparameter tuning on several models: logistic regression, decision tree, kNN and SVM RBF. We use validation accuracy as our metric. Below we first use a dummy classifier to establish the baseline.

fit_time score_time test_score train_score 0 0.002 0.001 0.442 0.440 1 0.001 0.003 0.438 0.441 2 0.001 0.002 0.438 0.441 3 0.001 0.001 0.442 0.440 4 0.001 0.001 0.444 0.440

Fig. 3 Cross valdidation results for the Dummy Classifier baseline model.#

As we can see, the baseline obtains an accuracy of around 44.08 percent. We now use cross cross validation paired with hyperparameter tuning to identify a model that performs the best.

mean_test_score mean_train_score mean_fit_time C class_weight criterion max_depth n_neighbors gamma model_name svc 0.613 0.768 0.318 1000.0 No Class Weight NaN NaN NaN 0.01 knn 0.598 1.000 0.011 NaN NaN NaN NaN 1.0 NaN decision_tree 0.593 0.994 0.050 NaN No Class Weight gini 16.0 NaN NaN logistic 0.586 0.592 0.039 0.1 No Class Weight NaN NaN NaN NaN

Fig. 4 Grid search results for the four models: Logistic Regression, Decision Tree, kNN, and SVC.#

We see that logistic regression has a best validation score of 58.6 percent; decision tree is 59.3 percent; kNN is 59.8 percent, and RBF SVM is 61.3 percent. As a result, we will use the tuned RBF SVM as our model on the test set.

Test Set Deployment#

The best model’s score on the test set is around 62.3, which shows a slight improvement compared with the validation score. We want to further probe into its performance by looking at the confusion matrix.

Fig. 5 Confusion matrix of the SVC model performance on the test data.#

For the really mediocre wines (5 and 6), the model can predict most of them correctly, but the model fails to predict a large proportion of extreme ones correctly, suggesting that the model is not too robust against outliers.

Discussion#

In this project, we built several machine learning classification models seeking to predict the wine quality based on the physiochemical properties of the liquid. By trying out different models with different hyperparameters, we have found that for our data set, the best performing model is RBF SVM. However, despite being the best, the accuracy is only around 62.3 percent. Depending on the situation this can be poor or decent. More importantly, the algorithm seems to not be able to identify the outliers precisely, and in the case where people want to be able to find really good or bad wines, this model’s performance would not be able to meet people’s expectations. Our group’s discussion has concluded that there might be several factors leading to this phenomenon:

High correlations:#

Fig. 6 Correlation matrix for all red wine physiochemical features in the dataframe.#

Several variables in the data set appear to have a substantial amount of correlation (in the range of 0.6) and this collinearity could have potentially caused problems with some of our models. Given this and the high dimensionality, we could have implemented a dimensionality reduction algorithm (such as PCA) to reduce the number of features and therefore eliminate some of the collinearity.

Potential Interactions:#

In our logistic regression model we did not take any of the potential interaction into the account. With this many qualities it is possible that some of the features affect the effect of others [Pramoditha, 2021] [UCLA: Andvanced Research Computing and Analytics, 2021].

Problem Formulation:#

The response variable could be treated as a number instead and an approach of regression question could have better captured the nature of our problem and produced a better model. Additionally, due to the limited scope of our data set (no observation below 3 or above 8), a classification model trained on this data set would not be able to identify any observation outside of the scope correctly. A regression algorithm is more immune to this kind of problem.

Infeasibility of the Problem#

Despite the potential improvements we have identified (or not) for our project, there still exists a possibility that even with all these improvements, the accuracy would not improve that much. And that is not due to the incorrect setup for the analyses, but rather the fact that some of the underlying uncontrollable factors in the process of wine making simply makes it impossible to detect patterns for really good or bad wines, and their qualities can only determined by actually tasting rather than prediction using numerical representations of some of its properties. However, among all the possible problems we have identified, this is the only one where we have zero proposed solutions for.

Software Attributions#

To complete this analysis, visualize data, and build the machine learning model, Python [Van Rossum and Drake, 2009] and associated libraries, including Pandas [McKinney and others, 2010], NumPy [Harris et al., 2020], scikit-learn [Pedregosa et al., 2011], Altair [VanderPlas et al., 2018], Seaborn [Waskom, 2021], and Matplotlib [Hunter, 2007] were used.

We acknowledge the contributions of the open-source community and developers behind these tools, which significantly facilitated our analysis.

References# Com23

The Business Research Company. Wine Quality. 2023. URL: https://www.thebusinessresearchcompany.com/report/red-wine-global-market-report#:~:text=The%20global%20red%20wine%20market%20size%20grew%20from%20%24102.97%20billion,least%20in%20the%20short%20term.

CCA+09a

Paulo Cortez, A. Cerdeira, F. Almeida, T. Matos, and J. Reis. Wine Quality. UCI Machine Learning Repository, 2009. doi:https://doi.org/10.24432/C56S3T.

CCA+09b

Paulo Cortez, António Cerdeira, Fernando Almeida, Telmo Matos, and José Reis. Modeling wine preferences by data mining from physicochemical properties. Decision support systems, 47(4):547–553, 2009.

HMvdW+20

Charles R Harris, K Jarrod Millman, Stéfan J van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke, and Travis E Oliphant. Array programming with NumPy. Nature, 585(7825):357–362, 2020. URL: https://doi.org/10.1038/s41586-020-2649-2, doi:10.1038/s41586-020-2649-2.

Hun07

J. D. Hunter. Matplotlib: a 2d graphics environment. Computing in Science & Engineering, 9(3):90–95, 2007. doi:10.1109/MCSE.2007.55.

M+10

Wes McKinney and others. Data structures for statistical computing in python. In Proceedings of the 9th Python in Science Conference, volume 445, 51–56. Austin, TX, 2010.

PVG+11

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

Pra21

R. Pramoditha. How do you apply PCA to Logistic Regression to remove Multicollinearity? PCA in action to remove multicollinearity. 2021. URL: https://towardsdatascience.com/how-do-you-apply-pca-to-logistic-regression-to-remove-multicollinearity-10b7f8e89f9b#:~:text=PCA%20(Principal%20Component%20Analysis)%20takes,effectively%20eliminate%20multicollinearity%20between%20features.

UCLAARCA21

Statistical Methods UCLA: Andvanced Research Computing and Data Analytics. Deciphering Interactions in Logistic Regression. 2021. URL: https://stats.oarc.ucla.edu/stata/seminars/deciphering-interactions-in-logistic-regression/#:~:text=Logistic%20interactions%20are%20a%20complex%20concept&text=But%20in%20logistic%20regression%20interaction,can%20make%20a%20big%20difference.

VRD09

Guido Van Rossum and Fred L. Drake. Python 3 Reference Manual. CreateSpace, Scotts Valley, CA, 2009. ISBN 1441412697.

VGH+18

Jacob VanderPlas, Brian Granger, Jeffrey Heer, Dominik Moritz, Kanit Wongsuphasawat, Arvind Satyanarayan, Eitan Lees, Ilia Timofeev, Ben Welsh, and Scott Sievert. Altair: interactive statistical visualizations for python. Journal of open source software, 3(32):1057, 2018.

Was21

Michael L. Waskom. Seaborn: statistical data visualization. Journal of Open Source Software, 6(60):3021, 2021. URL: https://doi.org/10.21105/joss.03021, doi:10.21105/joss.03021.



【本文地址】


今日新闻


推荐新闻


CopyRight 2018-2019 办公设备维修网 版权所有 豫ICP备15022753号-3