JCU Data Mining: Practice and Analysis Report on Car Sales Dataset

Verified

Added on 2023/06/04

AI Summary

This report presents an analysis of consumer behavior in the car sales industry using data mining techniques. The study, conducted at James Cook University, utilizes a dataset from a car dealership to examine the relationships between car specifications and sales. The report covers data preprocessing and exploration, including data importation, cleaning, and specification. It employs both regression modeling, including multiple linear and logistic regression, and clustering analysis, specifically hierarchical and k-means clustering. The findings highlight the best-fit model (Combined Categorical – Non-Categorical Logistic Model) and compares the interpretability of clustering methods. The analysis also addresses challenges such as limited data and data reduction, concluding with recommendations for future work, including the use of larger datasets and the implementation of training, validation, and testing subsets for more accurate model building. The study aims to familiarize readers with common data mining techniques and apply them to a domain-specific dataset.

DATA MINING
(PRACTICE AND ANALYSIS)
Shreyas Patel (jc490992)
Kinley Choki (jc485465)
James Cook University

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

INTRODUCTION
This paper applies statistical analysis techniques together with the data
that the car dealerships have access to, to evaluate the consumer behavior
in the car sales industry. The data considered for the analysis is from a car
dealership, providing information on the maker and model of the car, as
well as the various specs for the car. The paper presents the findings from
the analysis of the nature and interaction of data variables.

LITERATURE REVIEW
Numerous research work has been done on Regression Modelling and
Clustering algorithms. The work has been aimed at understanding data
mining and deep machine learning. The researchers have normally used
sample data sets as case studies in order to understand the application of
Regression Modelling and Clustering algorithms in data mining and deep
machine learning.

OBJECTIVES
General objectives
a. To familiarize with some of the well-known techniques in data mining.
b. To apply the predictive and clustering data mining techniques to a
domain specific data set, in our case the Car Sales data set.
c. To review the predictive and clustering data mining techniques to gain
good overview and understanding of current data mining technology.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

DATA PREPROCESSING AND EXPLORATION
The data preprocessing and exploration of the Car Sales Dataset involve
three stages:
Importation of the data.
Cleaning of the data.
Specifying of the data.

REGRESSION MODELLING
Multiple linear regression modelling
This regression modelling technique describes the nature of relationship
between a dependent variable and the non-categorical data variable(s) in a
dataset.
Logistic regression modelling:
Purely categorical logistic regression modelling
The categorical data variables represent qualitative data observations.
Combined categorical - non-categorical logistic regression model
This model has both the categorical and non-categorical data variables.

CLUSTERING ANALYSIS
Clustering Analysis is a statistical technique that groups the categories of a
data variable(s) of a dataset by considering various attributes represented by
other variables in the dataset.
• Hierarchical clustering
This is a clustering technique that displays the output in the form of a
hierarchy, starting from the category of data variable that ranks highest
according to the predetermined attribute.
• K-means clustering
The k-means clustering is a statistical grouping technique that separates
the categories of the variable to be grouped into k number of groups.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

ISSUES AND CHALLENGES
• Limited Data:
The subsets would be too small to produce inferences and
models that can be relied on for prediction.
• Data Reduction:
The data cleaning process removed up to 40 observations
that had an empty entry on any of the data variables.

CONCLUSION
• Model C, the Combined Categorical – Non-Categorical Logistic Model
is the best model that explains the relationship between the Sales and the
other data variables in the Car Sales Data.
• While the Hierarchical Clustering provides more information, the K-
Means Clustering is more interpretable.
• More data is required for the analysis to provide more accurate and
precise findings

POSSIBLE FUTURE WORK
With availability of data having more observations and complete entries, future work
would include carrying out a regression analysis that would have training subsets,
validation subsets and testing subsets. This would produce more accurate models.
THE END
THANK YOU