ICT707 Big Data Project: Analysis of Bike Sharing Regression Models

Verified

Added on 2023/06/03

AI Summary

This project focuses on analyzing a bike sharing dataset using various regression models to predict bike usage patterns. The assignment involves data preprocessing, including handling categorical variables and scaling target variables. Three regression models are implemented: Decision Tree, Gradient Boosted Tree, and Linear Regression. The project includes the implementation of each algorithm, the application of the algorithms to the provided bike-sharing dataset, and the analysis of the results. The student utilizes Python and relevant libraries like Pandas, Scikit-learn, and Matplotlib to build, train, and evaluate the models. The project aims to address the demand imbalance problem in bike-sharing systems by predicting bike rental demand.

DATA SCIENCE PRACTICE
Student name
Professor’s name
Affiliation
Date

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
Introduction......................................................................................................................................1
Meeting the Demand Imbalance Problem.......................................................................................1
Dataset Description..........................................................................................................................1
Scaling target variables.................................................................................................................3
Building of Regression Models.......................................................................................................3
PART 1........................................................................................................................................4
PART TWO................................................................................................................................5
PART 3........................................................................................................................................6
Building a linear regression mode¶.................................................................................................7
Conclusion.......................................................................................................................................7

Introduction
Improvement of the traditional bike rental systems needs bike sharing system whereby the
process needs members to register to obtain membership, after which the members can hire and
return back the bikes.
This process has now been digitalized on the novel bike sharing systems. Modern bike sharing
system are being adopted largely in big cities throughout the world. The reason being bikes give
cheap and affordable transport system between close distances of neighborhoods. However, the
management of bike sharing systems poses challenges. The major constraints facing the bike
sharing system is rebalancing of the bicycles. An imbalance is established in the system when the
customers create an asymmetrical demand pattern. For enhanced functioning of the system, there
needs to be rebalancing of bicycles in each bicycle center. To solve the routing challenges
especially during the rush hour, machine learning algorithms come into effect to help solve this
challenge.
For a faultless and flawless performance of the bike sharing system, ever responsive clustering
frameworks need to be deployed for predicting the over demand pattern of requisition of the
bikes.
Meeting the Demand Imbalance Problem
In order to get effective bike rebalance, the inventory target levels have to be accurately
predicted. In this assignment, three regressing models have been functionalized on a bike sharing
dataset from Kaggle, and as provided in the assignment paper dataset
(bike sharing dataset). The algorithms are as follows:
i. Decision tree algorithm
ii. Linear regression algorithm
iii. Gradient boost algorithm
Dataset Description
The dataset is being supported with semi-permanent and data related to weather. In the scholar
work which was done at the University of Porto in a paper that was written by Gama Joao and
Fanaee and Hadi.
The dataset has been obtained from the UCI data repository from the following web link:
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset . The dataset has hourly data and
daily data which contains column names as headers. The headers are as shown below in the
screen shot from the coding section of this assignment:

data_path = 'C:/Users/ROSANA/Desktop/bike/hour.csv'
rides = pd.read_csv(data_path)
rides.head()
insta
nt
dted
ay
seas
on
y
r
mn
th
h
r
holid
ay
week
day
working
day
weathe
rsit
te
mp
ate
mp hum windsp
eed
casu
al
registe
red
cn
t
0 1
2011
-01-
01
1 0 1 0 0 6 0 1 0.24 0.28
79 0.81 0.0 3 13 1
6
1 2
2011
-01-
01
1 0 1 1 0 6 0 1 0.22 0.27
27 0.80 0.0 8 32 4
0
2 3
2011
-01-
01
1 0 1 2 0 6 0 1 0.22 0.27
27 0.80 0.0 5 27 3
2
3 4
2011
-01-
01
1 0 1 3 0 6 0 1 0.24 0.28
79 0.75 0.0 3 10 1
3
4 5
2011
-01-
01
1 0 1 4 0 6 0 1 0.24 0.28
79 0.75 0.0 0 1 1
dummy_fields = ['season', 'weathersit', 'mnth', 'hr', 'weekday']
for field in dummy_fields:
dummies = pd.get_dummies(rides[field], prefix=field)
rides = pd.concat([rides, dummies], axis=1)
fields_to_drop = ['instant', 'dteday', 'season', 'weathersit', 'mnth', 'hr', 'weekday', 'atemp',
'workingday']
data = rides.drop(fields_to_drop, axis=1)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

data.head()
y
r
holi
day
te
m
p
h
u
m
wind
spee
d
cas
ual
regis
tered
c
n
t
seas
on_
1
seas
on_
2
.
.
.
hr
_2
1
hr
_2
2
hr
_2
3
week
day_
0
week
day_
1
week
day_
2
week
day_
3
week
day_
4
week
day_
5
week
day_
6
0 0 0 0.
24 0.81 0.0 3 1
3 16 1 0... 0 0 0 0 0 0 0 0 0 1
1 0 0 0.
22 0.80 0.0 8 3
2 40 1 0... 0 0 0 0 0 0 0 0 0 1
2 0 0 0.
22 0.80 0.0 5 2
7 32 1 0... 0 0 0 0 0 0 0 0 0 1
3 0 0 0.
24 0.75 0.0 3 1
0 13 1 0... 0 0 0 0 0 0 0 0 0 1
4 0 0 0.
24 0.75 0.0 0 1 1 1 0... 0 0 0 0 0 0 0 0 0 1
5 rows × 59 columns
Scaling target variables
After the target variables were scaled, the following was the output:
y
r
hol
ida
y
te
m
p
hu
m
wind
spee
d
cas
ual
regis
tere
d
cnt
seas
on_
1
seas
on_
2
.
.
.
hr
_2
1
hr
_2
2
hr
_2
3
week
day_
0
week
day_
1
week
day_
2
week
day_
3
week
day_
4
week
day_
5
week
day_
6
0 0 0
-
1.33
460
9
0.94
7345
-
1.55
384
4
-
0.66
2736
-
0.93
016
2
-
0.95
631
2
1 0... 0 0 0 0 0 0 0 0 0 1
1 0 0
-
1.43
847
5
0.89
5513
-
1.55
384
4
-
0.56
1326
-
0.80
463
2
-
0.82
399
8
1 0... 0 0 0 0 0 0 0 0 0 1
2 0 0
-
1.43
847
5
0.89
5513
-
1.55
384
4
-
0.62
2172
-
0.83
766
6
-
0.86
810
3
1 0... 0 0 0 0 0 0 0 0 0 1
3 0 0
-
1.33
460
9
0.63
6351
-
1.55
384
4
-
0.66
2736
-
0.94
998
3
-
0.97
285
1
1 0... 0 0 0 0 0 0 0 0 0 1

y
r
hol
ida
y
te
m
p
hu
m
wind
spee
d
cas
ual
regis
tere
d
cnt
seas
on_
1
seas
on_
2
.
.
.
hr
_2
1
hr
_2
2
hr
_2
3
week
day_
0
week
day_
1
week
day_
2
week
day_
3
week
day_
4
week
day_
5
week
day_
6
4 0 0
-
1.33
460
9
0.63
6351
-
1.55
384
4
-
0.72
3582
-
1.00
944
5
-
1.03
900
8
1 0... 0 0 0 0 0 0 0 0 0 1
Since this assignment involve plotting of the given dataset, the python notebook has been used.
There is a total of 17379 records on a horly basuiis of the dataset.
Building of Regression Models
PART 1
DATA PREPROCESSING
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
I. Decision Trees
The following are some of the code snippet implements the decision tree algorithm:
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score
from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz
#READ DATA
data_path = 'C:/Users/ROSANA/Desktop/bike-sharing/hour.csv'
rides = pd.read_csv(data_path)
rides.head()

rides.shape
#modeling utilities
from sklearn import metrics
from sklearn import preprocessing
from sklearn.tree import DecisionTreeRegressor
from sklearn.model_selection import GridSearchCV, cross_val_score, cross_val_predict,
train_test_split
# Plotting parameters tuning
sns.set_style('whitegrid')
sns.set_context('talk')
params = {'legend.fontsize': 'x-large',
'figure.figsize': (30, 10),
'axes.labelsize': 'x-large',
'axes.titlesize':'x-large',
'xtick.labelsize':'x-large',
'ytick.labelsize':'x-large'}
# categorical variables
hour_df['season'] = hour_df.season.astype('category')
hour_df['is_holiday'] = hour_df.is_holiday.astype('category')
hour_df['weekday'] = hour_df.weekday.astype('category')
hour_df['weather_condition'] = hour_df.weather_condition.astype('category')
hour_df['is_workingday'] = hour_df.is_workingday.astype('category')
hour_df['month'] = hour_df.month.astype('category')
hour_df['year'] = hour_df.year.astype('category')
hour_df['hour'] = hour_df.hour.astype('category')
# Defining categorical variables encoder method
def fit_transform_ohe(df,col_name):
# label encode the column

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

le = preprocessing.LabelEncoder()
le_labels = le.fit_transform(df[col_name])
df[col_name+'_label'] = le_labels
# one hot encoding
ohe = preprocessing.OneHotEncoder()
feature_arr = ohe.fit_transform(df[[col_name+'_label']]).toarray()
feature_labels = [col_name+'_'+str(cls_label) for cls_label in le.classes_]
features_df = pd.DataFrame(feature_
PART TWO
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import math
from sklearn import ensemble
from sklearn.cross_validation import train_test_split
from sklearn.metrics import mean_absolute_error
from sklearn.grid_search import GridSearchCV
from datetime import datetime
data_path =
train_data =
train_data.head(3)
#NOTE: Dataset imported from https://www.kaggle.com/c/bike-sharing-demand/data
#Load Data with pandas, and parse the
#first column into datetime
train = pd.read_csv('C:/Users/ROSANA/Desktop/bike/train.csv', parse_dates=[0])
test = pd.read_csv('C:/Users/ROSANA/Desktop/bike/test.csv', parse_dates=[0])

#Implementing the Gradient Boosting model
clf = ensemble.GradientBoostingRegressor(
n_estimators=200, max_depth=3)
clf.fit(train[features], train['log-count'])
result = clf.predict(test[features])
result = np.expm1(result)
PART 3
import pandas as pd
data_path = pd.read_csv('C:/Users/ROSANA/Desktop/bike/train.csv')
data_path.head()
def calculate_period(timestamp):
initial_date = date(2011, 1, 1)
current_date = timestamp.date()
return (current_date.year - initial_date.year) * 12 + (current_date.month - initial_date.month)
ossible_features = [
'season', 'holiday', 'workingday', 'weather',
'temp', 'atemp', 'windspeed', 'month',
'hour', 'year', 'week_day']
target = 'count'
Building a linear regression mode¶
# import, instantiate, fit
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
linreg.fit(X, y)
# print the coefficients

print (linreg.intercept_)
print (linreg.coef_)
Intercept
# print the coefficients and the intercept
print (linreg.intercept_)
print (linreg.coef_)
Conclusion
This scholarly work is among many analysis that have been researched on the bike sharing
dataset which shows that machine learning algorithms can be used to solve the problem that is
facing the bike sharing systems which mostly is the prediction challenge in various urban
centers in the world. Analysis of the user behavior, bicycle usage behavior patterns can be
observed in the regression models which are implemented. The many experiments that have been
done in this real life situation dataset explains how effective and efficient the regression models
is able in addressing the bike sharing problem.

1 out of 10

ICT707 Big Data Project: Analysis of Bike Sharing Regression Models

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Related Documents

ICT707 Big Data: Regression Models Implementation on Bike Sharing Data

+13062052269

info@desklib.com

ICT707 Big Data Project: Analysis of Bike Sharing Regression Models

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Related Documents

ICT707 Big Data: Regression Models Implementation on Bike Sharing Data

+13062052269

info@desklib.com