University **** Big Data Analysis Project: Algorithm Performance

Verified

Added on 2022/11/24

AI Summary

This project evaluates the performance of machine learning algorithms, specifically Linear Regression and Random Forest, using the Weka data mining software. The project utilizes two datasets: an Epileptic Seizure Recognition dataset and a Train and test dataset. The study involves comparing the algorithms across various experimental setups, including different feature selection methods. The experimental setups involve comparing Linear Regression and Random Forest algorithms with and without feature selection algorithms. The results are presented with various metrics like correlation coefficient, mean absolute error, root absolute error, and root relative squared error. Furthermore, the project includes a discussion of the results, a significance test on Dataset 1, AUC curve analysis, and a critical understanding of big data analysis challenges and implications. The project concludes with a comparison of the significance tests on both datasets, and the analysis of the AUC curve based on the classification algorithm.

University ****
Semester ****
Big Data
Student ID *****
Student Name *****
Submission Date *****
1

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
1. Introduction................................................................................................................................3
2. System Description....................................................................................................................3
3. Experimental Setup and Results.................................................................................................5
3.1 Setup 1: Linear Regression with D1 vs. Random Forest with D1.........................................6
3.2 Setup 2: Linear Regression with D1 vs. Linear Regression with Selection of Feature
algorithm F1 -D1....................................................................................................................9
3.3 Setup 3: Linear Regression with D1 vs. Linear Regression with the Selection of Feature
algorithm 2 -D1....................................................................................................................13
3.4 Setup 4: Random Forest with D1 vs. Random Forest with the Selection of Feature
algorithm 1 -D1....................................................................................................................15
3.5 Setup 5: Random Forest with D1 vs. Random Forest with Selection of Feature algorithm 2
-D1........................................................................................................................................17
3.6 Setup 6: Linear Regression with D2 vs. Random Forest with D2.......................................20
3.7 Setup 7: Linear Regression with D2 vs. Linear Regression with the Selection of Feature
algorithm 1 -D2....................................................................................................................23
3.8 Setup 8: Linear Regression with D2 vs. Linear Regression with the Selection of Feature
algorithm 2 –D2...................................................................................................................25
3.9 Setup 9: Random Forest with D2 vs. Random Forest with Selection of Feature algorithm 1
-D2........................................................................................................................................27
3.10Setup 10: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 2 -D2....................................................................................................................29
4. Discussion................................................................................................................................32
Significance Test on Dataset 1....................................................................................................33
AUC curve...................................................................................................................................34
5. Critical understanding on Big Data analysis challenges..........................................................36
6. Awareness of Implication and issues about big data...............................................................36
7. Knowledge of most significance computing techniques for dealing with Big Data................36
8. Conclusion................................................................................................................................37
9. Reference..................................................................................................................................38
2

1. Introduction
This project is used to compare the performance of two machine learning algorithms in
Weka by using the provided dataset. Two classification problems are considered on the
provided dataset to resolve the problem by using feature selection algorithm and machine
learning algorithm.
2. System Description
Dataset Description
Epileptic Seizure Recognition dataset,
Train and test dataset
3

Classification Problem
 In dataset 3, Classification problem:
Data analyzing is specified and predicts featuring the epileptic seizure detection
process to measuree the point of Weka classification. The original dataset contains the
Epileptic seizure detection, which is commonly used for time consuming and
sensitivity on the EEG (electroencephalography) signals on Explanatory variable X1,
X2, ....X178 as the measuring points of view (Challenges with Big Data Analytics,
2015). The measuring of the Epileptic signals is used for collecting useful information
on the irrelevant information of signal noise. Analyzing and predicting the variable
can be denoted as eliminating the burden of expert clinicians when they are
processing a large number of data by visual observation, and to speed up the epilepsy
diagnosis (Classification and interpretation in quantitative structure-activity
relationships, 2017). So, the machine learning process can be applied on the feature
selection algorithm to understand the identified and irrelevant information on the
measuring point of EEG signals.
 In dataset 2, Classification problem:
The test and train dataset consists of 5 different processing folders and they can
identifyy the process recognition through their calls. The dataset creates segment on
60 records and three different families, on data classification. The real data collects
the train and test on the condition variables. So this dataset understands the train and
tes, later the feature selection algorithm is implemented on the identified and resolved
problem on train and test dataset (Comparação de métodos no estudo da estabilidade
fenotípica, 2010).
Here, the selected system models are:
 Dataset 1 - Epileptic Seizure Recognition Data Set
 Dataset 2 - Train and test Dataset
 Algorithm 1 - Linear Regression
 Algorithm 2 - Random Forest
 Selection Feature Algorithm 1 - Normalize
 Selection Feature Algorithm 2 - String to Nominal
 Class Variable in Data set 1 - Y
 Class Variable in Data set 2 - Record ID
4

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

3. Experimental Setup and Results
The Experimental Setup is as follows,
 Setup 1: Linear Regression with D1 vs. Random Forest with D1.
 Setup 2: Linear Regression with D1 vs. Linear Regression with the Selection of
Feature algorithm 1 -D1.
 Setup 3: Linear Regression with D1 vs. Linear Regression with the Selection of
Feature algorithm 2 -D1.
 Setup 4: Random Forest with D1 vs. Random Forest with the Selection of Feature
algorithm 1 -D1.
 Setup 5: Random Forest with D1 vs. Random Forest with the Selection of Feature
algorithm 2 -D1.
 Setup 6: Linear Regression with D2 vs. Random Forest with D2.
 Setup 7: Linear Regression with D2 vs. Linear Regression with the Selection of
Feature algorithm 1 -D2.
 Setup 8: Linear Regression with D2 vs. Linear Regression with the Selection of
Feature algorithm 2 -D1.
 Setup 9: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 1 -D2.
 Setup 10: Random Forest with D2 vs. Random Forest with the Selection of Feature
algorithm 2 -D2.
3.1 Setup 1: Linear Regression with D1 vs. Random Forest with D1
This experimental setup is used to compare the linear regression vs. Random Forest
algorithm on the provided dataset 1. It is represented in the below figure.
Linear Regression with D1
5

Total Number of Instances 499
Random Forest with D1
7

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

3.2 Setup 2: Linear Regression with D1 vs. Linear Regression with Selection of
Feature algorithm F1 -D1
This experimental setup is used to compare the linear regression vs. linear regression
with the selection of feature algorithm 1 on the provided dataset 1 (Das et al., 2010). It is
represented as follows.
Linear Regression with D1
8