Modeling and Predicting NYC Real Estate Sales Prices: Project Report

Verified

Added on 2022/08/24

AI Summary

This project focuses on predicting real estate prices in New York City using machine learning techniques. The student utilized a dataset containing sales information of various properties across five boroughs over a 12-month period. The analysis employed an artificial neural network model using deep learning, specifically the Keras Regressor algorithm, to train and test the data. The project includes data exploration, visualization, and model development using Keras layers and TensorFlow in the backend. The report provides an executive summary, introduction, discussion of the dataset, the proposed price prediction model, and conclusions. The model's performance and predictions are evaluated, with the aim of identifying patterns and insights within the dataset. The project highlights the effectiveness of deep learning in predicting real estate prices and provides a comprehensive analysis of the data, including the impact of different features on sales prices.

Running head: MODELING & COMPUTING TECHNIQUES
Modeling & Computing Techniques
Students Name:
Student ID:
University Name:
Paper code:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

2MODELING & COMPUTING TECHNIQUES
Executive Summary
Machine learning and Artificial Intelligence is considered to be one of the leading and powerful
technologies used in the recent world. The most important part is that human’s haven’t seen the
full potential of such technologies. This is because the system has the ability to learn
automatically from historical data and from past experience. Machine learning technology are
generally used to transform information into knowledge. Machine learning models are used to
gather useful information and the hidden patterns inside the data and make decisions based on
the data with minimum human involvement.
The dataset used in the analysis contains information of each and every building like
home, apartment etc. which are sold in the New York City property market over the period of 12
months. The dataset contains information of five different places. A total number of 84548
numbers of information are present into the data file.
The model used for classifying the sales is artificial neural network model using deep
learning. Basically Keras Regressor algorithm is used to train the training dataset and will be
tested over the tested dataset as how well the classifier classified the target variable values. It can
be said that deep learning or the neural network models provides better prediction rate as
compare to other model.
In the analysis a deep thorough analysis, data exploration, visualization and at the end
prediction has been performed to get in-depth knowledge of the dataset. Proper machine learning
model with keras layers and tensorflow in the backend has been developed using neural network
techniques. At the end a conclusion will be concluded on how the model predicts the sale price
and different hidden patterns and information will be drawn in the end.

3MODELING & COMPUTING TECHNIQUES
Table of Contents
Executive Summary.........................................................................................................................2
Introduction......................................................................................................................................4
Discussion........................................................................................................................................4
Introduction and observation of the dataset.................................................................................4
The proposed model for price prediction.....................................................................................8
Conclusion.....................................................................................................................................10
References......................................................................................................................................11
Appendix........................................................................................................................................13

4MODELING & COMPUTING TECHNIQUES
Introduction
Machine learning and Artificial Intelligence is considered to be one of the leading and
powerful technologies used in the recent world (Alpaydin, 2020). The most important part is that
human’s haven’t seen the full potential of such technologies. This is because the system has the
ability to learn automatically from historical data and from past experience. Machine learning
technology are generally used to transform information into knowledge (Bishop, 2006). Machine
learning models are used to find the hidden patterns inside the data and make decisions based on
the data with minimum human involvement (Moolayil, Moolayil & John, 2019). Mainly there
are 2 types of machine learning algorithm categories mainly supervised and unsupervised
learning (Brownlee, 2016).
In supervised learning the inputs of the dataset is known and the dataset contains labelled
data with known output, whereas in unsupervised learning the input is known but the dataset
contains, un-labelled data with unknown outputs (Campesato, 2020). In this analysis the target
variable is the sale price attribute and the goal is to predict the sales price using artificial neural
network using deep learning methods (Chernick, 1998).
Deep learning is another field or it can be said that it is a subpart of machine learning
which consist of network based layers and capable of learning from the unsupervised data which
are generally unstructured and unlabeled (Daniel, 2013). There are different kinds of layers used
to build a neural network model. For this analysis only dense layer has been used to build the
artificial neural network model (Dietterich, 1997).
The accuracy and the performance of the models also depends on the data. If the data
contains more missing values or null values then the model will not be able to classify properly
as the data is not a good fit for the model (Géron, 2019). The more cleanly the data the more
acutely the model will classify the target variables. It has been seen that using deep learning
more accurate result has been observed instead of using older learning algorithms (Mitchell,
1997).
Discussion
Introduction and observation of the dataset
Exploring the attributes of the dataset:
1. BOROUGH: The Borough attribute consist of 5 different classes which are basically five
location where properties have been sold which are basically, 1 for 'Manhattan', 2 for
'Bronx', 3 for 'Brooklyn', 4 for 'Queens' and 5 for 'Staten Island' and these should be
considered to be categorical values.
2. NEIGHBORHOOD: This attribute tells the neighborhood name for the particular
properties. The name is given by the department of finance assessors also the name is
similar to the name of the Finance designates. Also it can be seen that there may be few
differences in the neighborhood attributes and few sub- neighborhood might not be
include also with respect to the value of the attribute the attribute will be categorical.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

5MODELING & COMPUTING TECHNIQUES
3. BUILDING CLASS CATEGORY: This attribute is used to identify similar properties of
the Rolling sales files without having look into individual building classes (Norris, 2020).
The data and files are been store by the Neighborhood, Block, Borough, Building Class
Category and lot. Also the values of the attribute are said to be categorical.
4. TAX CLASS AT PRESENT, there consist of 4 tax classes mainly, 1,2,3 and 4 which are
assigned to each property of the city which are totally based on the use of the property
also this attribute consist of categorical values.
 Class 1: This includes most of the attribute which are mainly one, two or three family
houses with such small store or offices, vacant lands are used for residential use and most
important there should not be any three stories.
 Class 2: This shoes the properties which are generally primarily residents mainly the
condominiums and the cooperatives.
 Class 3: This includes the properties which are generally equipped and owned by
telephones, gas and electric companies.
 Class 4: This includes which are not included in the class1, class2 and class3 mainly the
factories, garage, offices, warehouses and many more.
5. BLOCK and LOT: Here the tax block is termed to be as the sub division of borough
attributes. The block and lot distinguishes one unit of real property from another, such as
the different condominiums in a single building (Yao, 1999). The tax lot represent the
unique location of the properties which is generally a subdivision of a tax block. Also
making it categorical doesn’t make any sense as there are 11k unique blocks available in
the dataset. Hence both block and lot will be uses as numerical attributes for the analysis
purpose.
6. BUILDING CLASS AT PRESENT: This attribute is used for describing the constructive
use of properties. The first letter describe the individual class of the properties for
example “A” signifies one-family homes, “O” signifies office buildings. “R” signifies
condominiums (Michie & Spiegelhalter, 1994). For the second position some numbers
are been added with the previous examples which can be written in the form of “A0” is a
Cape Cod style one family home, “O4” is a tower type office building and “R5” is a
commercial condominium unit. The values of the attribute will be categorical as there
will be unique code given for the properties.
7. ADDRESS: The address basically consist of the street address for the property which are
been listed in the sales file. Apartment number are use in the address field for the coop
sale.
8. ZIP CODE: It tell the postal code for each property. This variable should be categorical.
9. RESIDENTIAL UNITS: This attributes tell the total number of residential unit which are
listed for each property. This variable should be numeric.
10. COMMERCIAL UNITS: This attributes tell the total number of commercial unit which
are listed for each property. This variable should be numeric.

6MODELING & COMPUTING TECHNIQUES
11. TOTAL UNITS: This attribute tell the total number of units that are listed for each
property. This variable should be numeric
12. LAND SQUARE FEET: It consist of the total land area for particular property measure
in square feet. This attribute should be numeric
13. GROSS SQUARE FEET : It is the measurement of the total measured area including the
exterior surface then the outside wall of the building also the outside space are also taken
to consideration. This attribute will be numeric.
8. YEAR BUILT: The attribute indicates the year when the property was built also the
values of the attributes will be categorical.
9. TAX CLASS AT TIME OF SALE and BUILDING CLASS AT TIME OF SALE. Both
these attributes will be categorical.
10. SALE PRICE: This variable should be numeric.
11. SALE DATE: This variable should be data time. However, we can save the "year" or
"month" part as a new categorical variable.
12. EASEMENT: This attributes indicates some right which needs to be followed, it depicts
some entity which have limited rights to use another’s property.
The dataset contains lots of blank spaces and null values which are not good for any
model to process. Thus data cleaning and pre-processing of data need to be performed in order to
get a cleaner dataset to work on.
Figure 1: Distribution of sales over the year

7MODELING & COMPUTING TECHNIQUES
Figure 1 represents the trend of sale price over the specific time period.
Figure 2: Average SALE PRICE on each BOROUGH
Figure 3 depicts the average sales price for 5 different Borough which are basically 5
different location. Also it can be said that Manhattan has the highest number of sales whereas
Staten Island has the lowest average sales throughout the year (Zirilli, 1996).
Figure 3: Sales per months

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

8MODELING & COMPUTING TECHNIQUES
There are different analysis which can be used to find different patterns and useful
information from any dataset. Figure 3 shows the sales count for the 12 months of all the
Borough in total.
The dataset contains null values, missing values and duplicate values which needs to be
remove to get close insight of the data (Zurada, 1992). Also using correlation matrix other
important less attributes were deleted which doesn’t have such importance in the dataset (Gulli &
Pal, 2017).
After different analysis and visualization it’s now time to do different tweaks to the pre-
processed data and split the data into training and testing set which will be used to feed the
model (Jain, Mao & Mohiuddin, 1996).
The proposed model for price prediction
One of the most popular libraries which is used for deep learning is known as Keras, it is
widely to build neural network due to its simplicity and ease of use (Hassoun, 1995). Keras is a
high-level Python neural networks library that runs on top of the TensorFlow (Ketkar, 2017). In
the backend of the neural network tensorflow has been used during the model builup
(Limsombunchai, 2004).
There are various layers used to build artificial neural network, but for this particular
analysis only dense layer has been used to implement the model using KerasRegressor wrapper
estimator (Liu, Yang & Gamal, 2017). The summary of the model is been shown below with the
total number of parameters-
Model: "sequential_1"
_____________________________________________________________
Layer (type) Output Shape Param #
=================================================================
dense_1 (Dense) (None, 18) 342
_________________________________________________________________
dense_2 (Dense) (None, 18) 342
_________________________________________________________________
dense_3 (Dense) (None, 18) 342
_________________________________________________________________
dense_4 (Dense) (None, 1) 19
=================================================================
Total params: 1,045
Trainable params: 1,045
Non-trainable params: 0
The above is the summary of the model which will be used for training and testing
purposes of the dataset. The number of parameter was seen to be 1045. The lower the number of
parameter the better the model produces result.

9MODELING & COMPUTING TECHNIQUES
The dense layer is said to be a fully connected layer, which means that in a layer the
neurons are connected to those in the next layer (Marsland, 2015). Also it should be taken into
consideration for regression task accuracy is not the best way to judge the performance of the
model. Using error function it can also be possible to judge the model as lower the error rate
higher will be the accuracy or the performance of the model.
Figure 4: Actual price vs. predicted price
Figure 4 shows the actual vs. predicted prices which are stored in different variables in
the form of list (Mehrotra, Mohan & Ranka, 1997). Thus, the plotting has been performed using
different index position for each graph. And also it can say that actual prices are much higher
with respect to the predicted price as the
graphs shows huge spikes for actual sales
prices.
Figure 5: Scatter plot of predicted price
against the actual price

10MODELING & COMPUTING TECHNIQUES
Conclusion
From the above analysis and result it can be concluded that the dataset given is not a
clean dataset due to which many data cleaning and pre-processing of data has been performed.
Also different finding have been shown using various visualization function. Although the data
was not a good one to feed into any machine learning model in spite a KerasRegressor with
dense layer has been built to check the predicted sales price.
Also accuracy measurement for regression algorithm is not a good choice to go in spite
looking at the error rate it can also be told how well the model has performed. In the discussion
portion of the report various conclusion has been made with respect to different graphs. Various
analysis has been performed to get in-depth knowledge of the dataset.
Some of the major improvement is to use cross validation after the model has been built
and after performing the estimator. Also error rates need to be calculated in order to see how well
the process reduced the error rate from previous. Also different layers need to be built to test how
well the newly designed model works with the dataset.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

11MODELING & COMPUTING TECHNIQUES
References
Alpaydin, E. (2020). Introduction to machine learning. MIT press.
Bishop, C. M. (2006). Pattern recognition and machine learning. springer.
Brownlee, J. (2016). Deep learning with Python: develop deep learning models on Theano and
TensorFlow using Keras. Machine Learning Mastery.
Campesato, O. (2020). Artificial Intelligence, Machine Learning, and Deep Learning. Stylus
Publishing, LLC.
Chernick, H. (1998). Fiscal capacity in New York: The city versus the region. National Tax
Journal, 531-540.
Daniel, G. (2013). Principles of artificial neural networks (Vol. 7). World Scientific.
Dietterich, T. G. (1997). Machine-learning research. AI magazine, 18(4), 97-97.
Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow:
Concepts, Tools, and Techniques to Build Intelligent Systems. O'Reilly Media.
Gulli, A., & Pal, S. (2017). Deep learning with Keras. Packt Publishing Ltd.
Hassoun, M. H. (1995). Fundamentals of artificial neural networks. MIT press.
Jain, A. K., Mao, J., & Mohiuddin, K. M. (1996). Artificial neural networks: A tutorial.
Computer, 29(3), 31-44.
Ketkar, N. (2017). Introduction to keras. In Deep learning with Python (pp. 97-111). Apress,
Berkeley, CA.
Limsombunchai, V. (2004, June). House price prediction: hedonic price model vs. artificial
neural network. In New Zealand agricultural and resource economics society conference
(pp. 25-26).

12MODELING & COMPUTING TECHNIQUES
Liu, X., Yang, D., & El Gamal, A. (2017, October). Deep neural network architectures for
modulation classification. In 2017 51st Asilomar Conference on Signals, Systems, and
Computers (pp. 915-919). IEEE.
Marsland, S. (2015). Machine learning: an algorithmic perspective. CRC press.
Mehrotra, K., Mohan, C. K., & Ranka, S. (1997). Elements of artificial neural networks. MIT
press.
Michie, D., Spiegelhalter, D. J., & Taylor, C. C. (1994). Machine learning. Neural and Statistical
Classification, 13(1994), 1-298.
Mitchell, T. M. (1997). Machine learning.
Moolayil, J., Moolayil, & John, S. (2019). Learn Keras for Deep Neural Networks. Apress.
Norris, D. J. (2020). Predictions using ANNs and CNNs. In Machine Learning with the
Raspberry Pi (pp. 387-451). Apress, Berkeley, CA.
Yao, X. (1999). Evolving artificial neural networks. Proceedings of the IEEE, 87(9), 1423-1447.
Yegnanarayana, B. (2009). Artificial neural networks. PHI Learning Pvt. Ltd..
Zirilli, J. S. (1996). Financial prediction using neural networks. International Thomson Computer
Press.
Zurada, J. M. (1992). Introduction to artificial neural systems (Vol. 8). St. Paul: West.

13MODELING & COMPUTING TECHNIQUES
Appendix
# importing all the necessary libraries
import numpy as np
import pandas as pd
from keras.models import Sequential
from keras.layers import Dense
from keras.wrappers.scikit_learn import KerasRegressor
from sklearn.preprocessing import StandardScaler
from matplotlib import pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.metrics import accuracy_score
# Loading the dataset
df=pd.read_csv('dataset.csv')
# Showing the top 10 data of the dataset
df.head(10)
df.info() # Data information and type
df.describe() # Statistical information of the data
df1=df.copy()
# First Let's remove irrelavant columns:
df.drop(["Unnamed: 0"], axis=1, inplace=True)
df.head()

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

14MODELING & COMPUTING TECHNIQUES
# constructing the date time variable
df['SALE DATE']= pd.to_datetime(df['SALE DATE'], errors='coerce')
df['sale_year'] = pd.DatetimeIndex(df['SALE DATE']).year.astype("category")
df['sale_month'] = pd.DatetimeIndex(df['SALE DATE']).month.astype("category")
pd.crosstab(df['sale_month'],df['sale_year'])
# constructing the numerical variables:
numeric = ["RESIDENTIAL UNITS","COMMERCIAL UNITS","TOTAL UNITS", "LAND
SQUARE FEET" , "GROSS SQUARE FEET","SALE PRICE" ]
for col in numeric:
df[col] = pd.to_numeric(df[col], errors='coerce') # coercing errors to NAs
# constructing the categorical variables:
categorical = ["BOROUGH","NEIGHBORHOOD",'BUILDING CLASS CATEGORY', 'TAX
CLASS AT PRESENT', 'BUILDING CLASS AT PRESENT','ZIP CODE', 'TAX CLASS AT
TIME OF SALE']
for col in categorical:
df[col] = df[col].astype("category")
# getting sum of null values for each attributes
df.isna().sum()
df.replace(' ',np.nan, inplace=True) # Replacing the blank spaces
# Showing the correlation using heatmap
plt.figure(figsize=(10,7))
sns.heatmap(df.isnull(),yticklabels=False,cbar=False,cmap='OrRd_r')
# Dropping useless attributes
df.drop(["EASE-MENT","APARTMENT NUMBER"], axis=1, inplace=True)
df=df.dropna()
# finally check if there is any duplicated value:
sum(df.duplicated())
# Dropping the duplicate column
df.drop_duplicates(inplace=True)
#Capture necessary columns

15MODELING & COMPUTING TECHNIQUES
variables=df.columns
count=[]
for variable in variables:
length=df[variable].count()
count.append(length)
#Plot number of available data per variable
plt.figure(figsize=(30,6))
sns.barplot(x=variables, y=count)
plt.title('Available data in percent', fontsize=15)
plt.show()
# Sale price according to sale date
df.groupby('SALE DATE').agg({'SALE PRICE': ['sum']}).plot(figsize=(28,12))
df2= df[(df['SALE PRICE']>10000) & (df['SALE PRICE']<10000000)].copy()
plt.figure(figsize=(12,6))
sns.distplot(df2['SALE PRICE'], kde=True, bins=50, rug=True,color='#D0DB24')
plt.show()
df2= df2[(df2['SALE PRICE']<4000000)]
plt.figure(figsize=(12,6))
sns.distplot(df2['SALE PRICE'], kde=True, bins=50, rug=True,color='g')
plt.show()
# Plotting according to YEAR BUILT
df3=df2[df2['YEAR BUILT']!=0].copy()
plt.figure(figsize=(12,6))
sns.distplot(df3['YEAR BUILT'], bins=50, rug=True,color="r")
plt.show()
# Plotting according to TOTAL UNITS
df4=df3[df3['TOTAL UNITS']!=0].copy()
plt.figure(figsize=(12,6))

16MODELING & COMPUTING TECHNIQUES
sns.distplot(df4['TOTAL UNITS'], bins=50, rug=True,color='#BE19EE')
plt.show()
# Converting the numeric to proper name of the places
#'1':'Manhattan', '2':'Bronx', '3': 'Brooklyn', '4':'Queens','5':'Staten Island'
df4['BOROUGH']= df4['BOROUGH'].map({1:'Manhattan', 2:'Bronx', 3: 'Brooklyn',
4:'Queens',5:'Staten Island'})
df4.head()
plt.figure(figsize=(12,5))
#Plot the data and configure the settings
#CountPlot --> histogram over a categorical, rather than quantitative, variable.
plt.title('Counting number of BOROUGH')
sns.countplot(x='BOROUGH',data=df4)
# Plotting Average SALE PRICE on each BOROUGH
df_bar=df4[['BOROUGH','SALE PRICE']].groupby(by='BOROUGH').mean().sort_values by =
'SALE PRICE', ascending = True).reset_index()
plt.figure(figsize=(10,8))
sns.barplot(x = 'BOROUGH', y = 'SALE PRICE', data = df_bar)
plt.title('Average SALE PRICE on each BOROUGH')
plt.show()
# Plotting box plot for SALE PRICE on each BOROUGH to find if outliers are present or not
plt.figure(figsize=(12,6))
sns.boxplot(x = 'BOROUGH', y = 'SALE PRICE', data = df4 )
plt.title('Box plots for SALE PRICE on each BOROUGH')
plt.show()
# Plotting Count Sales by each month
df5=df4[['sale_month', 'SALE PRICE']].groupby(by = 'sale_month').count().sort_values(by =
'sale_month', ascending = True).reset_index()
df5.columns.values[1]='Sales_count'
plt.figure(figsize=(12,6))
sns.barplot(x = 'sale_month', y = 'Sales_count', data = df5 )

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

17MODELING & COMPUTING TECHNIQUES
plt.title('Count Sales by each month')
plt.show()
# Plotting Commercial Units vs Sale Price
dataset = df4[(df4['COMMERCIAL UNITS']<20) & (df4['TOTAL UNITS']<50) & (df4['SALE
PRICE']<5000000) & (df4['SALE PRICE']>100000) & (df4['GROSS SQUARE FEET']>0)]
plt.figure(figsize=(10,6))
sns.boxplot(x='COMMERCIAL UNITS', y="SALE PRICE", data=dataset)
plt.title('Commercial Units vs Sale Price')
# Plotting Residential Units vs Sale Price
plt.figure(figsize=(10,6))
sns.boxplot(x='RESIDENTIAL UNITS', y='SALE PRICE', data=dataset)
plt.title('Residential Units vs Sale Price')
plt.show()
# Plotting Quantity of properties sold by year built
plt.figure(figsize=(10,6))
plotd=sns.countplot(x=dataset[dataset['YEAR BUILT']>1900]['YEAR BUILT'])
#plotd.set_xlim([1900, 2020])
plt.tick_params(labelbottom=False)
plt.xticks(rotation=30)
plt.title("Quantity of properties sold by year built")
plt.show()
#Generate a column season
def get_season(x):
if x==1:
return 'Summer'
elif x==2:
return 'Fall'
elif x==3:

18MODELING & COMPUTING TECHNIQUES
return 'Winter'
elif x==4:
return 'Spring'
else:
return ''
dataset['seasons']=dataset['SALE DATE'].apply(lambda x:x.month)
dataset['seasons']=dataset['seasons'].apply(lambda x:(x%12+3)//3)
dataset['seasons']=dataset['seasons'].apply(get_season)
plt.figure(figsize=(20,25))
df_wo=dataset
sns.relplot(x="BOROUGH", y="SALE PRICE",hue='seasons' ,kind="line",
data=df_wo,legend='full')
df4['SALE DATE'] = df1['SALE DATE'].apply(lambda x: int(x[:4]+x[5:7]+x[8:10]))
df4['SALE DATE'] = df4['SALE DATE'].astype(int)
df4 = df4[df4['SALE PRICE'] != 0]
# Taking the important attributes of the dataset
X = df4[['BOROUGH','NEIGHBORHOOD','BUILDING CLASS CATEGORY','TAX CLASS
AT PRESENT','BLOCK','LOT','BUILDING CLASS AT PRESENT','ADDRESS','ZIP
CODE','RESIDENTIAL UNITS','COMMERCIAL UNITS','TOTAL UNITS','LAND SQUARE
FEET','GROSS SQUARE FEET','YEAR BUILT','TAX CLASS AT TIME OF
SALE','BUILDING CLASS AT TIME OF SALE','SALE DATE']].values
y = df4['SALE PRICE'].values
# Labeling all the string values of the specific attributes
labelencoder_X_0 = LabelEncoder()
X[:, 0] = labelencoder_X_0.fit_transform(X[:, 0])
labelencoder_X_1 = LabelEncoder()
X[:, 1] = labelencoder_X_1.fit_transform(X[:, 1])
labelencoder_X_2 = LabelEncoder()
X[:, 2] = labelencoder_X_2.fit_transform(X[:, 2])
labelencoder_X_3 = LabelEncoder()

19MODELING & COMPUTING TECHNIQUES
X[:, 3] = labelencoder_X_3.fit_transform(X[:, 3])
labelencoder_X_6 = LabelEncoder()
X[:, 6] = labelencoder_X_6.fit_transform(X[:, 6])
labelencoder_X_7 = LabelEncoder()
X[:, 7] = labelencoder_X_7.fit_transform(X[:, 7])
labelencoder_X_16 = LabelEncoder()
X[:, 16] = labelencoder_X_16.fit_transform(X[:, 16])
# Splitting the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)
# Feature Scaling
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
def baseline_model():
# create model
model = Sequential()
model.add(Dense(18, input_dim=18, kernel_initializer='normal', activation='relu'))
model.add(Dense(output_dim = 18, init = 'uniform', activation = 'relu'))
model.add(Dense(output_dim = 18, init = 'uniform', activation = 'relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='mean_squared_error', optimizer='adam')
return model
x=baseline_model()
x.summary()
# Fitting to the training set
estimator = KerasRegressor(build_fn=baseline_model, epochs=100, batch_size=10,
verbose=False)
estimator.fit(X_train, y_train)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

20MODELING & COMPUTING TECHNIQUES
prediction = estimator.predict(X_test)
# Visualization the results and evaluation
n, length = 5, len(prediction)
sns.set_style('darkgrid', {'axis.facecolor':'black'})
f, axes = plt.subplots(n, 1, figsize=(20,50))
times = 0
for i in range(n):
if i == 0:
plt.sca(axes[0])
plt.plot(y_test[:round(length/n)], color = '#19E3EE', label = 'Real Price')
plt.plot(prediction[:round(length/n)], color = '#EE1966', label = 'Predicted Price')
plt.title('NYC Property Price Prediction', fontsize=30)
plt.ylabel('Price', fontsize=20)
plt.legend(loc=1, prop={'size': 10})
else:
if i == n-1:
plt.sca(axes[n-1])
plt.plot(y_test[round(length/n*(n-1)):], color = '#19E3EE', label = 'Real Price')
plt.plot(prediction[round(length/n*(n-1)):], color = '#EE1966', label = 'Predicted Price')
plt.ylabel('Price', fontsize=20)
plt.legend(loc=1, prop={'size': 10})
else:
plt.sca(axes[i])
plt.plot(y_test[round(length/n*i):round(length/n*(i+1))], color = '#19E3EE', label = 'Real
Price')
plt.plot(prediction[round(length/n*i):round(length/n*(i+1))], color = '#EE1966', label =
'Predicted Price')
plt.ylabel('Price', fontsize=20)
plt.legend(loc=1, prop={'size': 10})

21MODELING & COMPUTING TECHNIQUES
plt.show()
df_n = pd.DataFrame(list(zip(y_test.astype(int), prediction.astype(int))),columns =['Actual
Price', 'Predicted Price'])
df_n.head(10)