logo

Machine Learning Research Paper 2022

   

Added on  2022-10-13

12 Pages1485 Words13 Views
Contents
1.0 Introduction.......................................................................................................... 2
2.0 Machine learning implementation:.......................................................................8
2.1 Collaborative filtering........................................................................................ 8
2.3 Logistic regression:........................................................................................... 9
2.4 K-Means:......................................................................................................... 10
3.0 Conclusion:......................................................................................................... 12
References:.............................................................................................................. 13

1.0 Introduction
Machine learning is a field related to computational statistics which enables
computer to learn by its own instead to explicit programming. From estimation of
insurance risk and home loans to self driving cars the use of machine learning has
gained significant importance in data analytics (Aziz, Zaidouni & Bellafkih, 2019). To
compute huge amount of data for data analytics various tools are available in the
market. Spark is one of the fastest and reliable tool to compute streams of data.
This report shows the utilization of Pyspark, which is the library provided by Python
to use spark, for the analysis of data on to gain meaningful insights. The dataset on
which we would be working in this assignment is taken from Chocolate bar ratings.
Pyspark is able to solve problems on parallel data proceeding and can handle
multiprocessing complexities for example, distributing code, dta and collection of
output on cluster of machines (Asri, Mousannif & Moatassime, 2019). Pyspark
provide the data scientist with the interface to use Resilient Distributed Datasets
(RDD) and many functions to perform machine learning computations in Python
language. Thus, the complete analysis on dataset is done with the use of Pyspark
alone with the use of Jupyter notebook as a tool.
The chosen dataset has in total 9 columns on which entire analysis would be
performed. The dataset has been attached with the report for reference. The
libraries that are to be used in the notebook are imported in the first section and are
used in the rest of the code. with spark, seaborn and matplotlib libraries has been
used in the notebook to visualize the dataset. Kmeans, LinearRegressionwithSGD
and ALS are the Machine learning classes imported in the notebook.

Jupyter notebook are incredibly versatile, powerful, shareable and the main
advantage is that data visualization can be performed in the same environment
without going anywhere else. In short the environment allows for coding, running
and looking at the code outcome and analyze the results in the same environment
(). It has independent cells for writing different parts of codes and run the them and
view the results independently. As shown in the image below we have performed
analysis of the data-sets in the succeeding cells. within the first cell (as shown in the
image) CSV file structure has been created. The structure is then passed in read.csv
method. To store the file in cache, .cached method has been utilized which can be
later retrieved from the cache. To enable the first column as header the parameter
"header" is set "true". In the definition of structure we have also specified the data-
types as per them the fields within the data-frame are imported.
In the succeeding cell the file schema, that has been imported into the Pyspark
data-frame, is displayed with the help of method “printSchema()”. In the below
image we can see the output that shows the columns within the file along with their
data-types details and details of nullable.

The next cell is shown in the below snapshots shown the total number of rows and
columns in the Pyspark dataframe. It can be seen that there are 1795 rows and 9
columns in the dataframe.
In the next cell we have prepared the data by data cleaning for performing further
analysis. Data cleaning is important to avoid unrealistic values in the anaysis of
data in graphs and models. As a part of cleaning operations, the missing values are
checked first. We have performed the same and also checked for the number of null
values in specific columns. The result of data cleaning operation can be seen in the
snapshot below:
First we drop the rows which has company column null (because company name is
important for chocolate hence will remove rows which is not having company name)

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Data Science Practices Using Pyspark Project 2022
|13
|1910
|10

The University of Sydney Page 2 Jupyter Notebooks: The Python Environment in DATA2001
|73
|3869
|94

Big Data Analytics - Road Safety
|14
|3249
|14

Data Science Practice Assignment 2022
|12
|1674
|15

Big Data Analytics - HM Land Registry
|12
|3288
|14

Report on Bike Sharing Assignment
|3
|1279
|419