School of Business Big Data Analysis Project, Semester 1, 2019

Verified

Added on 2022/11/09

AI Summary

This project report details a big data analysis assignment, encompassing data understanding, preprocessing, and modeling using a NoSQL database (Cassandra). The project begins with data exploration, focusing on forest cover prediction using cartographic data and variables from USGS and USFS. It then delves into data preprocessing using Weka, covering standardization and normalization techniques. The core of the project involves modeling data with a NoSQL database, specifically Cassandra, including database setup, data loading, and querying. Furthermore, the project includes cluster analysis and data visualization to identify patterns and relationships within the data. The report includes references and screenshots of the implementation process, providing a comprehensive overview of big data analysis techniques and their application in a real-world scenario.

BIG DATA ANALYSIS

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Data Understanding.........................................................................................................................2
Data pre-processing.........................................................................................................................2
Modelling NoSQL Database............................................................................................................4
Cassandra.........................................................................................................................................5
Cluster............................................................................................................................................11
Visualization..................................................................................................................................14
References......................................................................................................................................17

Data Understanding
How to predict the type of forest cover, given cartographic data (without remote sensing). A
region measuring 30x30 was observed to deter actual forest cover by USFS (United States Forest
Service) second region RIS (Resource Information System data). Variables (independent) were
obtained from USGS (Geological Survey) and USFS. These were raw data in binary e.g. 0 or 1;
in column form for soil types and wilderness (Qualitative Independent variables).
The area under study comprised of four different places within Roosevelt forest in the northern
part of Colorado. These are areas with little human disturbances, such that the forest covers in
existence reflect the impact of ecological systems (Bacardit & Llorà, 2013) ("Data Mining from
Heterogeneous Data Sources", 2017).
Information about these four forests:Neota – classified as Area 2. It has an elevational mean
value of four, which is the highest. Rawah- classified as area 1.Comanche peak- classified as
area 3, which has a low value of elevation. Cache-la-Poudre- classified, as area 4 is the least in
terms of elevation.
Dominant species of tree (primary) at Neota spruce and fir; Rawah/Comanche has pine
lodgepole 2nd type as the dominant vegetation, spruce and fir follow with aspen 5th type closing
the list. Cache-la-Poudre has third type ponderosa and sixth type Douglas fir. Willow and
cottonwood also grow here (Dong, 2014) (Yildirim, Birant & Alpyildiz, 2017).
Comanche peak and Rawah dominate dataset as compared to Neota and Cache-la-Poudre. The
latter is distinct as compared to the rest of the places because of low elevation values and
vegetative composition
Data pre-processing
This is a data mining method by which raw facts are transformed into meaningful information.
Raw facts or data are neither consistent nor complete and are often lacking characteristics.

Concisely, they contain many errors. To resolve all these issues, data preprocessing is the
solution.
First, execute Weka application software, and open the windows explorer then choose
preprocessor tab. Click to open Iris data and insert all the information you need concerning the
dataset such as class, attribute, and instances. Find the attributes of the dataset i.e.
numeric/nominal. Find out the classes in the dataset (Elo & Schwikowski, 2011) (Gaber, 2011)
(Wegman, 2012). Find out the highest standard deviation. What can you learn about the
attribute? From filter, select standardize, then give provide attributes. Find out the results; Note
how the statistics of the attribute change? Press undo. To gain a deep insight into the data, click
normalize filter and enter the attributes
Note down what happens. Find out how statistics of the attribute changes? Note; how it differs
from standardization? Undo to move back to initial data. A graph should appear at the bottom
right to display the data set. From the drop-down box, press Visualize/All. How would you
interpret the resulting graph? Find the discriminating attributes’, among the classes in that
particular data set? What happens to the graph if you select standardize or Normalize filter. Go to
filter and choose selection/filter attribute.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

The above figure is the Weka GUI chooser. It has five applications such as explorer,
experimenter, knowledge flow, workbench and simple CLI. Click the explorer application for
big data analysis.
First, perform the preprocess. Click the open file for uploading the covtype.csv dataset.
Modelling NoSQL Database
NoSQL is a new technology that provides rapid large-scale data access at a low cost. This
technology is an architecture that leverage distributed processes and systems through the use of
less expensive servers.
So far, NoSQL has given birth to many technological innovations that addressed the
shortcomings of the relational database (Guha Neogi, 2013) (Leung, 2011). One such solution is
the flexibility of data models. Modern applications can be used to configure or to process large

volumes of data of various types. In relational databases, data modelling is quite static and
cannot adapt to rapid change in modern business requirement (Wang, Qi, Sebe & Aggarwal,
2015) (Wang, Qi, Sebe & Aggarwal, 2016). Several benefits come with the flexibility of data
models.
 Ability to evolve as the business environment changes.
 No need for unnecessary updates to already captured data.
 Make merging of data simple from a variety of sources.
Cassandra
An example of a NoSQL database is Cassandra. Features of Cassandra include;
 Stores data in rows in table format
 Tables form column families
 A row must have a unique value also called primary key for identification purposes.
 Data is kept by the primary key
 With a primary key, it is possible to access data.
With Cassandra, large scale data handling is possible. It is column NoSQL by orientation. Weka
3.7.5 is a package that provides additional links for Cassandra through “savers” and “loaders.”
This functionality is borrowed from Cassandra output and input functions in Kettle ETL toolkit.
CassandraConverters” is a package that provides connectivity to Weka software. It can be
installed easily from the inbuilt packages manager. All the basic graphical user interface of
cassandraLoader is also present in Weka 3.7.5. The enhanced graphical user interface shown
below requires some alterations into the core for Querry and configuration for
Cassandraloader.Text streaming and mining is possible (Mahbubul Majumder, 2013) (Mikut &
Reischl, 2011). Reuter's texts can be easily loaded. Sgdtext is classifier acting on string attributes
with raw facts.

The Cassandra database is run in the commnd prompt. Type the csaaandra command in the cmd.
Then it is created the Cassandra default super user. Next open the new command prompt for
creating the Cassandra keyspace.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Befor working Cassandra we check the java and python version. The java JDK version is 1.8.
The python version is 2.7. then type the cqlsh (Cassandra query language shell) command. It is
used to execute the Cassandra qury language.
Create covtype keyspace

Create the new keysapce. Use the create keysapce command for creating the keyspace. Here
covtype is the keyspace. The describe command used to view the existing keyspace.
Create covtypetable

The above screenshot shows the create table functionality. The covtypetable is the table name.
The create columnfamily is a keyword for creating the table. It has the many variables such as id,
Elevation, Aspect, Slope, Horizontal_Distance_To_Hydrology,
Vertical_Distance_To_Hydrology, Horizontal_Distance_To_Roadways, Hillshade_9am,
Hillshade_Noon, Hillshade_3pm, Horizontal_Distance_To_Fire_Points,
Rawah_Wilderness_Area, Neota_Wilderness_Area, Comanche_Peak_Wilderness_Area,

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Cachela_Poudre_Wilderness_Area, Soil_Type_2702, Soil_Type_2703, Soil_Type_2704,
Soil_Type_2705, Soil_Type_2706, Soil_Type_2717, Soil_Type_3501, Soil_Type_3502,
Soil_Type_4201, Soil_Type_4703, Soil_Type_4704, Soil_Type_4744, Soil_Type_4758,
Soil_Type_5101, Soil_Type_5151, Soil_Type_6101, Soil_Type_6102, Soil_Type_6731,
Soil_Type_7101, Soil_Type_7102, Soil_Type_7103, Soil_Type_7201, Soil_Type_7202,
Soil_Type_7700, Soil_Type_7701, Soil_Type_7702, Soil_Type_7709,
Soil_Type_7710 ,Soil_Type_7745 ,Soil_Type_7746 ,Soil_Type_7755 ,Soil_Type_7756 ,Soil_T
ype_7757 ,Soil_Type_7790 ,Soil_Type_8703 ,Soil_Type_8707 ,Soil_Type_8708 ,Soil_Type_87
71 ,Soil_Type_8772 ,Soil_Type_8776 ,Cover_Type.
Import covtype dataset
The copy command used to import the dataset. The covtypetable is the table name. The dataset
name is the covtype.csv.