logo

Machine Learning With Python

   

Added on  2023-02-01

13 Pages3415 Words24 Views
 | 
 | 
 | 
Learning 1
Machine Learning With Python
Name of Student
Name of Class
Name of Professor
Name of School
State and City
Date
Machine Learning With Python_1

Learning 2
Abstract
The aim of the assignment is to actually analyze a suitable dataset via machine learning
in python. Thinking of it, you will realize that we have moved from an era where computers used
to be large frames and now they are just clouds (Kuppusamy, Gopal and Prabhu, 2015). This
though is not the most interesting part; the world of data science has introduced most amazing
works that computers have, enabling individuals with data science specialists; data scientists and
data analysts to have jobs to do in different ways when data is involved and end up making a few
dollars. The modification of the data science world has brought into play several algorithms
which are run in different data software to help in understanding what datasets actually are and
what they portray. The analysis of datasets using algorithms is called machine learning. Machine
learning is seen as a subset or an application of artificial intelligence (Steels and Brooks, 2018).
Machine learning is the con of algorithms and statistical models by computers, which computer
later use to perfectly execute definite computational works. Here there are no explicit
instructions needed (Parmar, Grossmann, Bussink, Lambin and Aerts, 2015). Artificial
intelligence on the other is the ability of computers to perform tasks with the intelligence
matched to that of humans (Russell and Norvig, 2016).
Machine learning; on the other hand, has different categories. The categories or types are
actually three in total. There is the supervised learning where a machine learning algorithm is
actually fit with a dependent (return or response variable) to be determined by a set of the
predictor (independent variables) to see the outcome. Then there is the unsupervised learning
where there is no target variable to be determined. Only variables are clustered in different
groups. Then there is the reinforcement learning where the computer is trained to work in a
specific way while in an environment. They are forced to think and identify activities found in
that environment. Unsupervised learning is used mostly by most tech firms like Google to help
bring people what they search for with ease. Meaning the machine has been trained and knows
what your device needs (Baydin, Pearlmutter, Radul and Siskind, 2018).
As per our assignment; the machine learning algorithms that I will use is one supervised
learning; logistic regression and one classification machine learning algorithm; decision tree.
Introduction
This area we will do a thorough description of the dataset that we are supposed to use.
The requirement of the assignment was that we use a movies dataset that was provided or choose
any publicly available dataset that is reliable. In this case, I chose the dataset from a bank; the
Universal Bank. The dataset had 5000 observations; this is 5000 customer details of the bank's
customers were to be studied. Of the actual variables, there were; the customers' ID, representing
the number of the customer given by the bank. Age of each customer, from the dataset all of
them, were adults. The next variable is Experience; this illustrates the number of years a
Machine Learning With Python_2

Learning 3
customer has been a customer to the bank. The younger the customer the lesser the number of
years as younger customers experience is lower in number. The next variables are Income and
ZIP Code. The family variable represents the number of people in each and every house where a
customer comes from. The CCAvg is actually the rate at which a customer's risks are going to
be covered by a bank. Education is the number of institution-level that a customer has attended
up to the ladder since elementary school. The Mortgage variable is the existence of a customer
with a mortgage loan. This then is illustrated by actually the number of mortgage taken by
hundreds of dollars. A personal loan is an actual loan taken for personal use and upkeep.
Securities are the stock investments that a person has with the bank. The last three variables; CD
Account, Online and credit card are modes that customers use while banking.
Descriptive Statistics
Verbal analysis of datasets only enables us to understand actual variables meaning. This will
prompt us to lead the data into Jupyter notebook to run the group analysis and the individualistic
analysis. But since the dataset is in excel and is saved in xlsx format. We will have to change it
to CSV comma delimited file. This is the format which we will use to upload the dataset into
python's Jupyter Notebook. We then, from Jupyter Notebook, install libraries pandas and os
(Howse, 2013). The actual dataset when loaded, the code variable = pd.read_csv(r"C:\Users\
Name of Device User\Desktop\Python Machine Learning\UniversalBank.csv"). Variable is the
name in which save our dataset as in the working directory. The part that starts with C: is where
our dataset is saved as per my computer (VanderPlas, 2016). After which we will be able to see
the set of data but only the top four columns will be displayed.
Cleaning of a dataset is always required in machine learning or any other data science
works. The reason for this is to help data more desirable to work with. Imagine using a dataset
that has more un-meaningful variables or missing entries that would be quit ignoring right?
Looking at the cleaning codes; the actual codes used are variable.isnull(). This is only done using
Python. This is after installing the numpy library. The code when run gives us FALSE results in
each and every entry of the rows and the columns. This means that there are no missing values
that there exist in the dataset (McKinney, W., 2012.) The dataset to has numerical values.
Numerical values are the best to work within python, this is because they are the only one that
python accepts. Another way of checking for missing values is; variable.isnull().any() which
gives missing values by column and not the entire dataset (Donnelly, 2014).
Individualistic descriptive statistics can be gotten by running the codes;
variable.median(), this gives us the actual median by column. The code variable.mean(), gives
variable mean by columns. The code variable.max() gives maximum values in each and every
column. The code variable.std() shows the standard of each and every variable of a column from
the mean of the respective column. The code variable.var() gives us the variance of each every
variable from each other (Schutt and O'Neil, 2014). Descriptive statistics can as well be done to
Machine Learning With Python_3

Learning 4
the entire dataset using the code; variable.describe() which will give the same values as the ones
we were getting in the individualistic analyses.
Machine Learning Models
Set of questions to be answered:
There is one set of two questions that I developed for my machine learning project.
1. How do Experience and Income affect a customer's ability to getting a Personal Loan?
2. How do Experience, Education and Income affect a customer’s decision of having
Securities with the bank?
The very first question will be done under a logistic regression analysis in python. In addition to
the other libraries that have been installed, we will add libraries; from scipy. stats import
spearman, from pylab import rcParams, import seaborn as sb, import matplotlib.pyplot as plt,
import sklearn, from sklearn.preprocessing import scale, from sklearn.linear_model import
LogisticRegression, from sklearn.model_selection import train_test_split, from sklearn import
metrics, from sklearn import preprocessing (Raschka, 2015).
We answer our first question using logistic regression. Here we will use a binary form of a
dependent variable against a set of independent variables. We will predict the probability of
occurrence by fitting data into a logistic model. We will check for the independence of each and
every variable before we actually proceed with the actual algorithm code development. You
should know that you need a dataset with at least 50 observations for better results. Checking for
the independence of variables we will actually end up visualizing that on a plot. The code that
gives us that is something that goes like; sb.regplot(x='Experience', y='Income', data = variable,
scatter = True). The actual plot that we have to visualize the variables points’ independence is;
Machine Learning With Python_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents