Comparing Linear Regression and Random Forest Algorithms for Big Data Analysis

Verified

Added on  2022/11/24

|41
|2948
|215
AI Summary
This project compares the performance of two machine learning algorithms, linear regression and random forest, in Weka for big data analysis. It includes experimental setups, results, and discussions on dataset 1 and dataset 2. The analysis focuses on the classification problems and the use of feature selection algorithms. The results show the performance metrics such as correlation coefficient, mean absolute error, root absolute error, and root relative squared error for each setup. The significance test and AUC curve analysis are also discussed. The project concludes with a critical understanding of the challenges in big data analysis.
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
University ****
Semester ****
Big Data
Student ID *****
Student Name *****
Submission Date *****
1
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Table of Contents
1. Introduction................................................................................................................................3
2. System Description....................................................................................................................3
3. Experimental Setup and Results.................................................................................................5
3.1 Setup 1: Linear Regression with D1 vs. Random Forest with D1.........................................6
3.2 Setup 2: Linear Regression with D1 vs. Linear Regression with Selection of Feature
algorithm F1 -D1....................................................................................................................9
3.3 Setup 3: Linear Regression with D1 vs. Linear Regression with the Selection of Feature
algorithm 2 -D1....................................................................................................................13
3.4 Setup 4: Random Forest with D1 vs. Random Forest with the Selection of Feature
algorithm 1 -D1....................................................................................................................15
3.5 Setup 5: Random Forest with D1 vs. Random Forest with Selection of Feature algorithm 2
-D1........................................................................................................................................17
3.6 Setup 6: Linear Regression with D2 vs. Random Forest with D2.......................................20
3.7 Setup 7: Linear Regression with D2 vs. Linear Regression with the Selection of Feature
algorithm 1 -D2....................................................................................................................23
3.8 Setup 8: Linear Regression with D2 vs. Linear Regression with the Selection of Feature
algorithm 2 –D2...................................................................................................................25
3.9 Setup 9: Random Forest with D2 vs. Random Forest with Selection of Feature algorithm 1
-D2........................................................................................................................................27
3.10Setup 10: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 2 -D2....................................................................................................................29
4. Discussion................................................................................................................................32
Significance Test on Dataset 1....................................................................................................33
AUC curve...................................................................................................................................34
5. Critical understanding on Big Data analysis challenges..........................................................36
6. Awareness of Implication and issues about big data...............................................................36
7. Knowledge of most significance computing techniques for dealing with Big Data................36
8. Conclusion................................................................................................................................37
9. Reference..................................................................................................................................38
2
Document Page
1. Introduction
This project is used to compare the performance of two machine learning algorithms in
Weka by using the provided dataset. Two classification problems are considered on the
provided dataset to resolve the problem by using feature selection algorithm and machine
learning algorithm.
2. System Description
Dataset Description
Epileptic Seizure Recognition dataset,
Train and test dataset
3
Document Page
Classification Problem
In dataset 3, Classification problem:
Data analyzing is specified and predicts featuring the epileptic seizure detection
process to measuree the point of Weka classification. The original dataset contains the
Epileptic seizure detection, which is commonly used for time consuming and
sensitivity on the EEG (electroencephalography) signals on Explanatory variable X1,
X2, ....X178 as the measuring points of view (Challenges with Big Data Analytics,
2015). The measuring of the Epileptic signals is used for collecting useful information
on the irrelevant information of signal noise. Analyzing and predicting the variable
can be denoted as eliminating the burden of expert clinicians when they are
processing a large number of data by visual observation, and to speed up the epilepsy
diagnosis (Classification and interpretation in quantitative structure-activity
relationships, 2017). So, the machine learning process can be applied on the feature
selection algorithm to understand the identified and irrelevant information on the
measuring point of EEG signals.
In dataset 2, Classification problem:
The test and train dataset consists of 5 different processing folders and they can
identifyy the process recognition through their calls. The dataset creates segment on
60 records and three different families, on data classification. The real data collects
the train and test on the condition variables. So this dataset understands the train and
tes, later the feature selection algorithm is implemented on the identified and resolved
problem on train and test dataset (Comparação de métodos no estudo da estabilidade
fenotípica, 2010).
Here, the selected system models are:
Dataset 1 - Epileptic Seizure Recognition Data Set
Dataset 2 - Train and test Dataset
Algorithm 1 - Linear Regression
Algorithm 2 - Random Forest
Selection Feature Algorithm 1 - Normalize
Selection Feature Algorithm 2 - String to Nominal
Class Variable in Data set 1 - Y
Class Variable in Data set 2 - Record ID
4
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
3. Experimental Setup and Results
The Experimental Setup is as follows,
Setup 1: Linear Regression with D1 vs. Random Forest with D1.
Setup 2: Linear Regression with D1 vs. Linear Regression with the Selection of
Feature algorithm 1 -D1.
Setup 3: Linear Regression with D1 vs. Linear Regression with the Selection of
Feature algorithm 2 -D1.
Setup 4: Random Forest with D1 vs. Random Forest with the Selection of Feature
algorithm 1 -D1.
Setup 5: Random Forest with D1 vs. Random Forest with the Selection of Feature
algorithm 2 -D1.
Setup 6: Linear Regression with D2 vs. Random Forest with D2.
Setup 7: Linear Regression with D2 vs. Linear Regression with the Selection of
Feature algorithm 1 -D2.
Setup 8: Linear Regression with D2 vs. Linear Regression with the Selection of
Feature algorithm 2 -D1.
Setup 9: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 1 -D2.
Setup 10: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 2 -D2.
3.1 Setup 1: Linear Regression with D1 vs. Random Forest with D1
This experimental setup is used to compare the linear regression vs. Random Forest
algorithm on the provided dataset 1. It is represented in the below figure.
Linear Regression with D1
5
Document Page
6
Document Page
Total Number of Instances 499
Random Forest with D1
7
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
3.2 Setup 2: Linear Regression with D1 vs. Linear Regression with Selection of
Feature algorithm F1 -D1
This experimental setup is used to compare the linear regression vs. linear regression
with the selection of feature algorithm 1 on the provided dataset 1 (Das et al., 2010). It is
represented as follows.
Linear Regression with D1
8
Document Page
9
Document Page
Time taken to build a model: 32.01 seconds
Linear Regression with the Selection of Feature algorithm F1 -D1
10
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
11
Document Page
12
Document Page
Time taken to build a model: 25.78 seconds
Setup 3: Linear Regression with D1 vs. Linear Regression with the Selection of Feature
algorithm 2 -D1
This experimental setup is used to compare the linear regression vs. linear regression
with the selection of feature algorithm 2 on the provided dataset 1. It is represented in the
following figure (Edwards and Gaber, 2014).
Linear Regression with D1
13
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Linear Regression with the Selection of Feature algorithm 2 -D1
14
Document Page
15
Document Page
3.3 Setup 4: Random Forest with D1 vs. Random Forest with the Selection of
Feature algorithm 1 -D1
This experimental setup is used to compare the Random Forest vs. Random Forest with
the selection of feature algorithm 1 on the provided dataset 1. It is represented in the
following figure.
Random Forest with D1
16
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Ran
dom Forest with the Selection of Feature algorithm 1 -D1
17
Document Page
3.4 Setup 5: Random Forest with D1 vs. Random Forest with Selection of
Feature algorithm 2 -D1
This experimental setup is used to compare the Random Forest vs. Random Forest with
the selection of feature algorithm 2 on the provided dataset 1. It is represented in the
following figure.
Random Forest with D1
18
Document Page
Random Forest with the Selection of Feature algorithm 2 -D1
19
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
20
Document Page
Setup 6: Linear Regression with D2 vs. Random Forest with D2
This experimental setup is used to compare the linear regression vs. Random Forest
algorithm on the provided dataset 2. It is represented in the following figure .
21
Document Page
Linear Regression with D2
22
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Random Forest with D2
23
Document Page
3.5 Setup 7: Linear Regression with D2 vs. Linear Regression with the Selection
of Feature algorithm 1 -D2
This experimental setup is used to compare the linear regression vs. linear regression
with the selection of feature algorithm 1 on the provided dataset 2 (Maimon and Rokach,
2010). It is represented in the following figure.
Linear Regression with D2
24
Document Page
Linear Regression with Selection of Feature algorithm 1 -D2
25
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
26
Document Page
Setup 8: Linear
Regression with D2 vs. Linear Regression with the Selection of Feature algorithm 2 –D2
This experimental setup is used to compare the linear regression vs. linear regression
with the selection of feature algorithm 2 on the provided dataset 2. It is represented in the
following figure.
Linear Regression with D2
27
Document Page
Linear Regression with the Selection of Feature algorithm 2 -D2
28
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Setup 9: Random Forest with D2 vs. Random Forest with Selection of Feature algorithm
1 -D2
This experimental setup is used to compare the Random Forest vs. Random Forest with
the selection of feature algorithm 1 on the provided dataset 2. It is represented in the
following figure.
Random Forest D2
29
Document Page
Tree Visualization
30
Document Page
Random Forest with the selection of feature Algorithm 1 (Nominal to Binary)
31
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Setup 10: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 2 -D2
This experimental setup is used to compare the Random Forest vs. Random Forest with
the selection of feature algorithm 2 on the provided dataset 2. It is represented in the
following figure.
Random Forest D2
32
Document Page
Tree visualization
33
Document Page
Random Forest with the selection of feature Algorithm 2 (string to nominal)
34
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Discussion
Based on experimental setup on dataset 1,
This dataset analysis experimental setup is used for identifyying eeach table on the
record ID which consists of the class variable on the column dataset (Rao and Kamra, 2018).
The findings on the classification of data analyses of the linear regression model uses the
feature selecting algorithm, which considers the total amount of instance as 499, where the
finding of correlation coefficient is -0.0671, the mean absolutes error is 1.324, the root
absolute error is 1.5513, and the root relative squared error is 110.4205%. Thus, the analyseis
on the dataset is the correct structure on the significant dataset.
Based on the experimental setup on dataset 2,
This dataset experiment is based on the records of the class variable, which can define
the columns of the dataset (Searle and Gruber, n.d.). The understanding and data analyzing on
the classification of linear regression can specify the feature algorithm selection on the total
number of instances is 1492, and the correlation coefficient is 0.9967, the mean absolute error
is 0.0613, root absolute error is 5.6659% , and the root relative square error is 81129%.
By analyzing the Random forest it selects the feature algorithm of the total instance
i.e., 1492, correlation coefficient is 0.975, mean absolutes error is 0.0582, root absolute error
is 0.0904, and the root relative square error is 25.7091% .
The analyzing process of two type of algorithms can implement the process of
predicting the resolved problems on the classicisation data analysis models system. It
depends on two algorithms that can identify the result of linear regression, which compared
the Random Forest. The identification of Linear regression algorithm of classification is
35
Document Page
implemented successfully which can provide the absolute and root square error values on the
Random tree. The dataset analyzing is the correct structure on the significant dataset.
Significance Test on Dataset 1 0 – ROC curve
Significant test on Dataset 2 – ROC Curve
Compare the Signification test on both data sets
The comparing on the significance on the test comparing on the analysing on the cut
points on the ROC curve on the operation characteristic of of true of false positive rate on the
sensitivity decision threshold. The analysing and comparing on the dataset they can display
36
Document Page
the result on the false positive rate. In implementing on the dataset 1 they can perform on the
classification algorithm of the linear regression on the area values is 0.972. The analysing on
the dataset 2 they can used for the implementing the classification algorithm of the random
Forest of the ROC curve on the area value is 0.817. Comparing and based on the linear
regression and random Forest algorithm they can used to high ROC value of the curve
successfully completed. The analysing and performing on the providing the data which is
process on the high level performance on Classification algorithm based on the Random
Forest.
AUC curve
The analysing on the provide data of the AUC curve that can performance based on the
classification algorithm analysing predicts on the implementing dataset thresholds. The
Displaying the result of ROC curve is based on the AUC curve result. The initial stages on
the AUC processing curve they can implementing the analysing on the dataset1 and dataset 2
they can performed on the high quality service on the classifiers used on the Weka analysing
on the prediction result they can displayed on the ROC curve that can view it as left uppers
corner. The Analysing the dataset 1 they can processing and displaying the result is below,
The analysing on the AUC curve they can implementing on the dataset 2 they can
display the result is given below,
37
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The plot values of the AUC curve they can implementing on the Linear regression
Classification algorithm they can predicts the AUC value is 0.6982. In Random Forest on
the AUC curve of Classification algorithm values is 0.9763. The AUC curve, based on the
linear and Random Forest classification they can performing and analysing on the data which
is provides one result is low AUC based on the Random Forest classification algorithm
4. Critical understanding on Big Data analysis challenges
The analysing and handing of the big data they can performing and understand the
process is very difficult. The implementing and processing on the integrity of the data
processing using the concept of data management service on the talent gap, and the predicts
of getting data to be displayed on the resulting analysis based on the syncing on data sources,
and big data analysing of the data structures, they can specified on the values is big data,
volume, availability on the analytical skills performance of the cost of solution. The main
challenges of the big data analysis they can performed on the classification of the data
processing, data integration, data storage classification prediction, searching on the dataset,
sharing the information on the data, analysing on the data that can display the result is
visualization.
5. Awareness of Implication and issues about big data
Real world case of the big data social media data analytic implication
Identification of the issues on privacy and security that can process on the open big
data set analysing using the vertical vector sector.
Legal frame work of the big data analysing application
38
Document Page
Identifying and analysing on the minimum risk and maximum benefit of Real world
data analytic problems and results.
6. Knowledge of most significance computing techniques for dealing with
Big Data
The knowledge based on the most significance on the data analysing techniques on
the Big data analysis they can implement on the statistical learning on the High risk
robustness of the privacy data that can performed on the personal and health care data
implementation and the analysing data which is verify and processing of the dataset they can
implement on the block box model on the classification analysis to be performed. The dealing
of most significance of the big data, they can process on the data streaming and dynamic data
integration, compressing the data processing and predicts the unwanted data to be removed,
online process on the Big data mining techniques they can used for the computation of the
increasing robustness attacks or performing learning and measuring algorithm to be reduce on
the energy level consumption of the artificial intelligence on the big data analysis
performance.
7. Conclusion
This project has successfully performed comparing and had used Weka tool to predict
the performance on the two machine learning algorithms. Further, this report has also
analyzed the classification dataset to understand the problems and the necessary steps to
resolve are used with the help of the machine learning algorithm on the feature selection
process.
39
Document Page
Reference
Challenges with Big Data Analytics. (2015). International Journal of Science and Research
(IJSR), 4(12), pp.778-780.
Classification and interpretation in quantitative structure-activity relationships. (2017). .
Comparação de métodos no estudoda estabilidade fenotípica. (2010). Biblioteca Digitais de
Tesese Dissertaçõesda USP.
Das, V., Debnath, N., Gaol, F., Meghanathan, N., Sankaranarayanan, S., Stephen, J.,
Thankachan, N., Thankachan, P. and Vijayakumar, R. (2010). Information Processing and
Management. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg.
Edwards, K. and Gaber, M. (2014). Astronomy and Big Data. Dordrecht: Springer.
Maimon, O. and Rokach, L. (2010). Data mining and knowledge discovery handbook. New
York: Springer.
Rao, S. and Kamra, R. (2018). A hybrid parallel algorithm for large sparse linear
systems. Numerical Linear Algebra with Applications, 25(6), p.e2210.
Searle, S. and Gruber, M. (n.d.). Linear models.
40
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
41
chevron_up_icon
1 out of 41
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]