Deakin University MIS772 Predictive Analytics: Wine Rating Prediction

Verified

Added on 2022/11/30

AI Summary

This report presents an analysis of a wine dataset using RapidMiner, focusing on predicting wine ratings based on various attributes. The executive summary outlines the problem: Australian Wine Importers (AWI) aims to leverage customer data to determine wine preferences. The assignment involves data exploration, preparation, and the application of clustering techniques, specifically K-means, to predict wine ratings. The data preparation phase includes handling missing values, removing duplicates, and normalizing the data. The report details the use of RapidMiner for data cleansing, transformation, and the creation of visualizations to understand relationships between attributes such as country, price, and points. Clustering with K=5 is applied to group similar wines, and the performance of each cluster is evaluated to identify the most optimal cluster for rating prediction. The analysis aims to provide AWI with insights to improve wine quality and align with customer preferences. References to relevant sources, such as Garbade (2018) and Reifer (2015), are included.

Executive summary:
Expectation:
Australian Wine Importers AWI is a leading wine importer in Australia. The company purchases a
variety of winery for its customers across Australia and with the growing realm of prediction and
analysis of data, AWI aspire to utilize its data in a productive way to find the most preferred choice
of their customers. The company has provided with a set of data, which has been collected from
customers on social media, to be clustered in attributes that includes name of wine, country where it
was sold, Variety of Wine, Designation and Price of Wine in US$. In this document we would
develop a model to estimate ratings of the wine using the given attributes.
The problem statement is to use the dataset of wine reviews and perform the transformation on the
dataset and then the dataset which will be pushed to next step of rating estimation. By creating a
model using features of the wine which can predict the ratings for the wine will help the AWI to
improve the quality of wine based on it.
Extension:
We have utilized Rapid Miner for the task, it is a popular tool for data analytics, prediction, text
mining and data preparation. Rapid Miner tool based on open core model and it supports all the
steps which are need in machine learning and which includes such as data preparation, results
visualization, model validation and optimization. Hence this tool is best suited for our analysis of
AWI dataset. This tool helps to do task such as visualization and cleansing very quickly and the
interface is drag and drop interface which is friendly for the user who does not know anything about
coding. (Reifer, 2015).
Here the business problem which needs to be addressed is that the wine data set is provided and we
need to analyze it and apply the clustering model k-means on it and based on the clustering the
prediction is made. The AWI is a company which is known for its wine import and they are
particular about the variety.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Data exploration and preparation in RapidMiner:
Expectation:
Problem statement in this task is to make data set cleaned from anamolies and which can be used
for wine rating prediction and these prediction can be used by AWI for improving the wine quality.
Data exploration and Preparation is the important and integral step to perform analytics on any kind
of dataset. If the data is prepared correctly then accurate results can be seen else the analysis will
conclude in wrong analysis. (Garbade, 2018).
The purpose of this task is to prepare the dataset given into the data which is clean from anamolies
such as missing values and which has outliers hence the tool rapid miner is used which has the
operators through which cleansing of the data can be done, after cleansing the conversion of
Nominal data to numerical is required, as predictions can be done over the numerical values hence
the operator provided by the rapid miner is used. This step complete the Data preparation process.
Extension:
So the steps which are performed for data exploration is that read the data from the file and check
the value in the column, their types, their value range and look for missing values if any. After
exploration we prepare the data by performing cleansing operation such as Replace missing values,
remove duplicate values and normalize the data. Here we have applied these operation of data
preparation.
Price of the wine attribute is used as label and predictors are the points, variety, winery and
province. The values which are missing are replaced with the average values, the relationships are
derived between the attributes by creating correlation matrix for them and those attributes which
has high value of correlation are the predictors for the label value.’
Rapid miner tool provides different operator to read different kind of files, here to read the file of
wine review readcsv operator is used. Through this operator file field can be explored and data
types can be changed based on the requirement. Once the file is loaded into the tool the next step is
to check the data missing values in the column, to have the model successfull it is important to
handle missing values. Rapid miner has operator to handle the missing values hence missing values
operator is drag and in the operator it is selected to fill the missing value by average value in the
column. The next step is to make the data set normalize , by normalization outliers in the dataset is
minimized. Estimation of rating is possible when there is minimum outliers. Once data is cleaned
then correlation between the attributes is checked and based on the correlation values the attributes
are selected, for example Taster name is not much correlated to wine ratings hence this attribute is
removed from the analysis.
Selection correlated attributes is important step as the prediction is depend on the features selected
if any wrong feature is selected then accuracy of the model goes down. In this assignment the wine
review dataset is cleaned and explored with the help of rapid miner and the dataset is ready for next
step of clustering and prediction of wine rating.

Attribute parsing and the data types are changed, for example points must be integer, price should
be real and other text fields are changed to polynomial.

Columns are explored for values in the column:
Statistics are analysed which shows there are columns designation and price which has larger
number of values missing and additionally we can look for the min max values and we can make
out from it the outliers:
Below process is created to read CSV file, replace missing attributes whch has been replace by
average values and then the given data set is normalised:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

In order to explore the data certain graphs are plottted for the data which is captured from wine
reviews:
The first graph plotted is between Country and price:
This graph highlights that the expensive wine is from France which has price above $80 and it also
highlights that the wine from European countries are expensive then the other countries.
The Second graph plotted is between Country and Points: similarly the points are high for the
wines which belongs to european region. This suggest that most liked are the fines from france,
spain italy and US

Third Graph plotted is the Box plot for the points given by the tasters: This shows that the Matt has
given highest points and lowest points given by the alexander.
Fourth Graph is plotted between the Average price and Wine variety:

Fifth Graph is plotted to understand the trend of prices in wineries: the graph shows that the winery
Domaine Des has highest average price which means the wines from this winery is expensive:

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

In the next step we have selected the attributes which are relevant and which has correlation with
the help of operator select attribute and attributes are converted from nominal to numerical:
Conversion to Numerical is important as in model it is good to use numerical values for better
prediction:
In the next step
we have applied
clustering on the
data set:
Clustering is to
group similar kind
of things together hence we create different clusters and check which cluster gives good
performance and same cluster will be used in model for better results.
For Clustering we have used the value of K=5 which means five clusters will be created and we can
visualize each clusters now:

Visualizing the clusters:
Looking at the centroid values of the Attributes:

Now Distance Performance validator is applied and we can check the performance of each clusters
which we have designed here:
On checking all Cluster the more optimum is cluster 1 which as distance performance vaue as
-98.314

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

References:
Garbade,M. (2018). Understanding K-means Clustering in Machine Learning. Retreived from:
https://towardsdatascience.com/understanding-k-means-clustering-in-machine-learning-
6a6e67336aa1
Reifer, A.(2015).Inside RapidMiner's data science platform.. Retreived from :
https://searchbusinessanalytics.techtarget.com/feature/RapidMiner-predictive-analytics-platform-
overview