Hello everyone and welcome to my first blog! My name is Arieh Levy and I am starting my journey into becoming a Data Scientist.

For my first time, I wanted to analyze the simple Red Wine Quality dataset that you can find in Kaggle using the following link: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

In my case, I will start with three basic questions that I found interesting to explore in this dataset. The questions are the following:

  • Is there a correlation between alcohol and quality?
  • Can we predict pH based on fixed acidity, volatile acidity, and citric acid?
  • Can we predict quality based on all the given parameters of the dataset?

Let’s take a look at the first few records of the dataset

The first thing we can appreciate is that there is only one categorical variable and that is ‘quality’. Then, we check for missing values on the dataset and find that there are none. Therefore, we can dive into our first question about the correlation between alcohol and quality.

Is there a correlation between alcohol and quality?

For this, we will create a correlation heatmap between the variables of the dataset. We expect the diagonal to have a maximum correlation value of 1 as that is when variables coincide. For the rest, we will either get positive or negative values between 0 and 1. The positive values mean positive correlation (the two variables move in the same direction) while the negative ones mean negative correlation (an increase in one variable is associated with a decrease in the other).

If we look at the last column, we can appreciate that alcohol is the one variable that has the highest positive correlation coefficient with quality. It is given by that orange color which indicates a correlation of around 0.45.

The exact values is: 0.4761663240011359

Now, correlation values that are between 0.3 and 0.5 indicate a low correlation between variables. Therefore, we can conclude that the correlation between alcohol and quality is not statistically significant to state that the more alcohol the wine has, the better the quality. However, we cannot say the opposite either. Meaning that we still cannot say that lower alcohol will result in better quality either.

Can we predict pH based on fixed acidity, volatile acidity, and citric acid?

To perform this task, we will only take the columns ‘fixed acidity’, ‘volatile acidity’, and ‘citric acid’ as the Input of our model and the column ‘pH’ as its Output.

The pH was predicted based on these parameters with a 43% accuracy

This suggests that it is not possible to predict pH based just on fixed acidity, volatile acidity, and citric acid. Our intuition might suggest that we would need other parameters such as residual sugar or chlorides to make a more accurate prediction.

Can we predict quality based on all the given parameters of the dataset?

To answer this question we follow the same procedure we did in the previous question but our Input to the model will be all parameters of the dataset except for ‘quality’ which will be the Output.

The quality was predicted with a 30% accuracy. This indicates that we cannot predict quality accurately with the given parameters of the dataset.

This is a very interesting result because if we were wine manufacturers, we would like to be able to predict the quality of a wine based on certain parameters. That way, we could try to look for the optimal wine with the best price-quality ratio without even manufacturing it. However, this analysis showed that this is not possible.

The next step from here would be to look for a more extense dataset and see if it is possible to perform this prediction when taking many more parameters into account.

I hope that you found this post helpful and I am looking forward to seeing you soon!