Machine Learning Application: Data Science Practice

Added on 2022-12-30

13 Pages3166 Words445 Views

Title: Data Science Practice
Task: Machine Learning Application Using PySpark
Lecturer:
School:
Course Name:
Unit Code:
Due Date:

Machine Learning Application
1. Introduction
Several factors have been posited to affect the quality of manufactured
goods. However, the effect of specification when it comes to pricing of
technological products goes a long way to define every aspect related to the
manufacturing of these technological items such as: phones, software,
personal computers, etcetera.
In practice, factors such as the quality of a phone’s camera, the bandwidth of
its connectivity, and battery life alongside other factors have been theorized
to influence the price category that a phone falls in. Nevertheless, it is
prudent that both the customer and the manufacturer get value for the
product in focus. In line with this view, to achieve optimal value for both
sides, methods such as data mining and machine learning can be used to
enable the determination of the best price category that a phone falls in.
1.1 Objective
The objective of this study is to conduct an evaluation of the principles
underlying several machine learning methods.
2. Key System Concepts
2.1 Machine learning pipelines
To understand the concept of pipelines in machine learning (ML) practice, it
is important that we begin with the objective of MLs according to (Koen,
2018) i.e.:
i. Reduction of latency
ii. Fairly integrated with sufficient coupling to other ML system parts such
as the GUI, data storage etc.
iii. Ability to be scaled
iv. Message driven
Ideally, pipelines involve collection of data, “...sending it through an
enterprise message bus and processing it to provide pre-calculated results
and guidance for next day’s operations” (Koen, 2018). Pipelines therefore
comprise of two straight forward components which include, Online Model
Analytics and Offline Data Discovery.
2.1.1 Application of Pipelines to the original ML project
From the process of data ingestion using PySpark which ensured that the
dataset is processed differently from the rest of the code so as to utilize the
functionality of multiple server cores and processors which are provided in
the documentation of PySpark. Our application of pipelines included data
extraction, feature transformation, and feature selection.
2.2 Collaborative filtering
In machine learning, Collaborative filtering (CF) is among the most applied
techniques when it comes recommender systems (Ricci, et al., 2011). It
2
Name:

Machine Learning Application
generally has two senses i.e., a narrow one and a more general one. From a
narrower sense, a CF is used in defining processes that enable automatic
predictions (filtering). In this view, the assumptions that underlay the CF is
that, in the event that “...a person A has the same opinion as a person B on
an issue, A is more likely to have B's opinion on a different issue than that of
a randomly chosen person” (Wayback Machine, 2012).
2.2.1 Application of CF
The application of CF in this study involved the adoption of an Alternative
Least Squares (ALS) Algorithm which is traditionally used to decompose large
matrix to a lower dimensional user factors and item factors. The advantages
of ALS lie in its ability to provide an alternative approach to optimizing the
loss function (Koehler, 2017). After the application of data and model
training, we conducted prediction using the ALS model to aid in the
prediction of the price category in which a mobile phone can be classified.
2.3 Logistic Regression (LR)
Unlike linear regression models whose dependent variable is often a
continuous, logistic regression models are used in the prediction of nominal
response variables that have two or more categories. In modern parlance
practices, LR “...is viewed as a generalized linear model. The parameters for
the best fit model are estimated using maximum likelihood rather than least
squares” (Dransfield & Phil, 2013). As such, the implementation of LR models
often takes two paths, i.e. the unconditional approach and the conditional
approach whereby the unconditional approach suffices in cases where the
number of degrees of freedom of a given model is relatively small compared
to the number of observations.
2.3.1 Application of LR
However, in cases where there are larger degrees of freedom, it is important
that the conditional approach must be used as in our case where the number
of observations were approximately 3000.
In application, we applied a logistic regression using a PySpark library on
training a model with 2129 observations as a training set and 871
observations as the test set. Moreover, to test the performance of the LR
model, we examined the area under curve alongside the model’s prediction
accuracy.
2.4 K-Means
Categorized under unsupervised learning algorithms, the K-Means is
implemented as a clustering model which follows a simple approach i.e.:
through the assumption k clusters fixed Apriori, the algorithm follows the
principle of defining k centers which are allocated to every cluster. To
improve performance of the model, the centers are placed far away from
each other as possible. After which each point of the data is associated it to
3
Name:

Machine Learning Application
the nearest center. Generally, the objective of the K-means model is to
minimize the objective function also known as a squared error function.
Such that:
||xi - vj|| is the Euclidean distance between xi and vj, ci is the number of data
points in ith cluster, and c is the number of cluster centers.
2.4.1 Application of K-Means
In our study, we applied the Knn model with 3 initial clusters on the data
“features which we had defined earlier during pipeline development and data
transformation. The evaluation of the model’s performance was examined
using the Squared Errors and squared Euclidean distance.
3. Conclusion
Machine learning is a relatively extensive concept in the modern era of data
integration into real life. From Artificial intelligence, Business intelligence,
medical application and many other fields, the application of ML is crucial in
enabling both organizations and individuals reach the vast benefits offered
by data. Despite all the potential in ML, determination of the best model that
can be applied to a given set of data is still a concern a fact that often
prompts the application of a number of models from which the optimum is
examined and chosen as in our case.
In conclusion, it is therefore crucial that the right preliminary steps are
adopted and the correct methods for the application of ML algorithms are
used if an organization is to benefit from the concept of machine learning.
4
Name:

End of preview

Want to access all the pages? Upload your documents or become a member.