A Comprehensive Report on Statistical Analysis in Data Mining

Verified

Added on 2022/12/27

AI Summary

This report provides an overview of statistical methods applicable to data mining. It explores various techniques such as regression analysis, including linear and logistic regression, used to predict outcomes and analyze relationships between variables. The report also delves into classification methods, which categorize data for improved accuracy and analysis, along with correlation analysis, which examines the relationships between variables. Practical examples are provided to illustrate how these methods are applied in real-world scenarios, such as predicting revenue based on advertising spending and determining the probability of fraudulent transactions. The report emphasizes the importance of these statistical approaches in extracting meaningful insights from datasets and making informed decisions, making it a valuable resource for anyone studying data science and big data.

Applicable statistical approaches applicable to DATA
MINING.
Data mining has two ways i.e. statistical analysis and non-statistical
analysis. For extracting knowledge from database’s containing different types of
observations, a variety of statistical methods are applicable in the process. They
include; Logistics regression analysis, classification, correlation analysis,
Discriminate analysis, clustering, Linear discriminate analysis (LDA), Outlier
detection, Factor analysis etc. Few are discussed below;
 Regression analysis – Based on a set of numerical data, by use of
regression, one predicts a range of continuous values. Practically,
use regression to predict the costs of goods and services based on
other variables. A regression model is used across numerous
industries for forecasting financial data, modelling environmental
conditions and analyzing trends.
 Logistics Regression – dependent variables are either binary or
multinomial. One estimates probabilities regarding the
relationship between the independent and dependent variable.
 Linear Regression – uses the best relationship between the
independent and dependent variable to predict the target
variable. In order to achieve the best fit, make sure that all the
distances between the shape and the actual observations at each
other are as small as possible. A good fit can be determined by
determining that no other position would produce fewer errors
given the shape chosen. Types of Linear Regression; Simple and
multiple Linear Regression. By fitting a linear relationship to the
independent variable, the simple linear regression predicts the
dependent variable whereas using multiple independent variables,
multiple linear regression fits the best linear relationship with the
dependent variable.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

 Correlation analysis – captures the relationships between variables
in pair. The value of such variable is usually stored in a column or
rows of a database table and represents a property of an object.
 Classification – collection of data is categorized so that a greater
degree of accuracy can be predicted and analyzed. It improves the
quality of analysis process.
PRACTICAL EXAMPLES
 Linear Regression – Businesses often use linear regression to
understand the relationship between advertising spending and
revenue. They might fit a simple linear regression model using
advertising spending as the predictor variable and the revenue as
the response variable.
Revenue = β0 + β1(ad spending)
The coefficient β0 would represent total expected revenue when
ad spending is zero while β1 represent the average change in total
revenue when ad spending is increased by one unit. If β1 is
negative, it would mean that more ad spending is associated with
less revenue. If β1 is close to zero, it would mean that ad spending
has little effect on revenue. If β1 is positive, it would mean that
more ad spending is associated with more revenue.
Depending on the value of β1, a company may decide to either
decrease or increase their ad spending.
 Logistics regression – A credit card company wants to check
whether transaction amount and credit card score impact the
probability of a given transaction being fraudulent by performing
Logistics regression. The response variable in the model is
“fraudulent” and has two potential outcomes: transaction is

fraudulent or transaction is not fraudulent. The results of the
model will tell the company exactly how changes in transactions
amounts and credit card score affect the probability of a given
transaction being fraudulent. The company can also use fitted
Logistics regression model to predict the probability of a given
transaction being fraudulent, based on the amount and the credit
score of the individual who made the transaction.
 Classification – given a collection of records (training sets), each
record contains a set of attributes, one of the attributes is the
class. Find a model for class attribute as a function of the values of
other attributes. A test set is used to determine the accuracy of
the model. Usually, the given data set is divided into training set -
for building the models and test sets-for validating the models.