FIT 3152: Data Analytics Assignment

29 Pages3405 Words427 Views

Added on 2021-09-27

FIT 3152: Data Analytics Assignment

Added on 2021-09-27

Related Documents

FIT 3152: DATA ANALYTICS
Assignment 2 – Australian Rainfall Data

Name: Giridhar Gopal Sharma
Student ID: 29223709
Email: gsha0009@student.monash.edu
Semester 1 2020

Table of Contents
S No. Title
1. Overview
2. Task 1 and 2: Exploring Data/ Pre-Processing of Data
3. Task 3: Diving data into Train and Test sets
4. Task 4: Classification Models
5. Task 5: Confusion Matrix for each Model
6. Task 6: ROC curve and Area under curve (AUC)
7. Task 7: Combining and Commenting on results in Tasks 5 and 6
8. Task 8: Examining each of the Model
9. Task 9: Best Tree Classifier using Cross - Validation
10. Task 10: Artificial Neural Network (ANNs) Classifier
11. R-Codes

Overview
This assignment deals with the dataset “WAUS.csv” provided to us which contains a modified
version of the Kaggle competition data and we need to analyse the Australian rain data and
predict whether it is going to rain tomorrow in Australia or not which is the classification (class) of
our dataset. The analysis shall include plotting trees and confusion matrix along with other
classification models such as Decision tree, Naïve Bayes, Bagging, Boosting, Random Forest,
ROC curve, best Accuracy scores, Area Under Curve, best-worst classifiers, performance
affecting variables , cross-validation of models using alternative tree based learning algorithm
and Artificial Neural Network (ANNs). All the above mentioned analysis will be presented with
graphs along with explanation and their respective codes can be found in the R file with proper
comments for ease of understanding. Random generated data of 2000 lines will be generated for
10 locations in Australia using the code snippet provided in the assignment question pdf.

After looking at the csv, there were 100,000 observations and 25 attributes (20 numerical,
5 categorical). We then use the randomly imported data and the summary and str can be
observed in the images below:

Figure 1: Summary of the original WAUS dataset after reading data

Figure 2: str(WAUS)

Task 1 and 2: Exploring Data/ Pre-Processing of Data
For this task, multiple subtasks were coded and following output were observed.

Response Variable: The ‘RainTomorrow’ attribute is the response/dependent variable which
makes it a very important variable for our analysis. To calculate the Proportion of Rainy to Fine
Days, the frequency of No and Yes were calculated using xtabs and for each RainToday and
RainTomorrow attributes and also Percentage for each were calculated and tables were created
as follows. (Fine days = No rain, Rainy days = Yes)

If we notice the values below, we can infer from the ratio (21.3 : 78:8 or 1 : 3) is that for every 4
consecutive days, approximately 1 day will be raining and 3 days will be fine.

The variable L also can be seen below as they contain the Locations of 10 randomly generated
places as we will be using for predicting whether it is going to rain tomorrow or not for these 10
sample Locations in Australia.

Figure 4: Summary and str of data frame L

Upon further analysis, a lot of observations for the columns had NA values, thus all the NA
values for each column were calculated and later on those observations were removed. Also
columns with high NA values were removed from analysis too( ex. Cloud3pm, Cloud9am, etc.)
The total number of NA values in the entire table were found to be 5053. It can also be observed
that the column Location has no NA values, as Location cannot be represented by NA.

Figure 5: All columns and the Frequency of NA values

Figure 3: Tables showing frequency and percentage for rainy and fine days for RainToday and RainTomorrow

The mean or the average values of all the columns can also be observed as follows. All the
values fall in the range of normal and practically possible value. For example, as temperature
columns don’t exceed 50 making it a false entry and Day and Month are less than 30/31 and 12.

Figure 6: All columns and their Mean values

The standard deviation values for all the columns can also be observed as follows.

Figure 7: All columns and their Standard Deviation values

Removal of Columns : Data Cleaning and Removing unnecessary columns included removing
Date (Day, Month, Year) columns as they were unnecessary columns and no need in our
analysis, removing columns with too many NA values from previously calculated from the table in
Figure 5 and removing unwanted columns such as WindDir3pm, Pressure3pm with very few
datapoints making data inconsistent and makes analysis inaccurate. Thus after all the columns
were removed the data set reduced to 13 columns. New Table can be analysed below.

Figure 8: Summary of new table after removing unwanted columns

Categorical Attributes : The next step was to handle categorical attributes and after doing a
test, it was observed that RainTomorrow is dependent on almost all categorical variables. Also,
for RainToday and RainTomorrow attributes, Yes was assigned as 1 and No was assigned as 0
making it numeric. Later on all the categorical attributes – RainTomorrow, RainToday and
Location were converted using as.factor.

Figure 9: RainToday and RainTomorrow updated to binary values

T-Test (Significance testing) : The Hypothesis or T.test were also performed for all the mean
values of all columns against the RainTomorrow column as it is the important and response
variable in the entire table. The test should indicate that the indicator has no relationship with the
target. It was observed that if the attribute has high standard deviation from their mean, e.g.
Humidity3pm seem to have the lowest p-value indicating it could be the best predictor. Temp9am
had the higher significance value than the alpha(0.05), hence this was the reason, to remove this
column was also removed from our table.

Figure 10: T-Test of all columns against RainTomorrow

Correlated Attributes : Some of the models had slightly lower performance by including all the
highly correlated values. Thus, columns like Pressure3am, Pressure9am, Temp3pm,
WindDir3pm, Pressure3pm, WindGustDir, WindDir9am, WindDir3pm were removed due to high
correlations with other attributes in the table, thus making better analysis of data.

Task 3: Diving data into Train and Test sets
For this task, the data was divided into Train set (70% of the original data) and Test set (30% of
original data) using the codes provided in the questionnaire. We can see below the final overall
data after pre-processing and cleaning with a total of 1808 observations had been divided into
Train set having 1265 rows and 13 columns and Test set having 543 rows and 13 columns.

The Training set will be used for Classification Models ( Decision Tree, Naïve Bayes, Bagging,
Boosting and Random Forest) while the Test Set was used to make Predictions and report the
Accuracy and AUC for each model.

Figure 11: Final Data Divided into Train and Test sets

End of preview

Want to access all the pages? Upload your documents or become a member.

Object and Data Modelling

|30

|2071

|459

FIT 3152: Data Analytics Assignment

FIT 3152: Data Analytics Assignment

End of preview

Object and Data Modellinglg...

Object and Data Modelling