ICT707 Big Data: Regression Models Implementation on Bike Sharing Data

Verified

Added on 2023/06/04

AI Summary

This project focuses on implementing regression models to predict bike sharing demand, addressing the challenge of bike rebalancing in urban centers. The solution involves applying decision tree, gradient boosting, and linear regression algorithms to a bike sharing dataset from Kaggle. The dataset, enhanced with weather and seasonal information, includes hourly and daily data. The project details the data preprocessing steps, including the creation of dummy variables and scaling of target variables. Code snippets illustrate the implementation of each regression model, including the decision tree regressor, gradient boosting regressor, and linear regression. The conclusion emphasizes the effectiveness of machine learning algorithms in addressing forecasting problems in bike sharing systems and highlights the insights gained into user behavior and bike usage patterns through the implemented regression models. Desklib offers this solution and many other resources for students.

DATA SCIENCE AND PRACTICE
Student Name
Professor’s Name
Affiliation

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Introduction
Bicycle sharing framework is a change of the customary bicycle rental frameworks whereby the
procedure included enlistment to get participation, after which the individuals can lease and
return bicycles, ( Kiefer & Behrendt 2016 pp.79-88).
This procedure has now been computerized on the novel bicycle sharing frameworks. Present
day bicycle sharing framework are progressively getting to be valuable in urban focuses all
through the world, (Kumar et al 2016 p.21597). This since bicycles give shabby and reasonable
transport between short separations. Nonetheless, the administration of bicycle sharing
frameworks presents issues. The significant issue is rebalancing of the bikes, (Rivers &
Koedinger 2017 pp.37-64). An irregularity is made in the framework when the clients make a
hilter kilter request design. For viable working of the framework, there should be rebalancing of
bikes in each bike focus. To take care of the directing issues particularly amid the surge hour,
machine learning calculations prove to be useful to help explain this test, (Jian et al 2016 pp.
602-613).
For a consistent task of the bicycle sharing framework, dynamic grouping systems should be
actualized for anticipating the over interest example of demand of the bicycles, (Carpenter et al
2017).
Taking care of the Demand Imbalance Problem
All together for the bicycle rebalancing to be compelling, the stock target levels must be
precisely anticipated. In this task, three relapsing models have been actualized on a bicycle
sharing dataset from Kaggle, and as gave in the task paper dataset (bicycle sharing dataset),
(Orfanakis & Papadakis 2016). The calculations are as per the following:
i. Decision tree calculation
ii. Gradient help calculation
iii. Linear relapse calculation
Dataset Description

The dataset has been recovered from the UCI information store from the accompanying url:
http://archive.ics.uci.edu/ml/datasets/Bike+Sharing+Dataset . The dataset has been enhanced
with occasional and information identified with climate. This was done at the University of Porto
in a The dataset has hourly information and every day information which contains segment
names as headers. The headers are as appeared underneath in the screen shot from the coding
segment of this task:
data_path = 'C:/Users/ROSANA/Desktop/bike/hour.csv'
rides = pd.read_csv(data_path)
rides.head()
insta
nt
dted
ay
seas
on
y
r
mn
th
h
r
holid
ay
week
day
workin
gday
weathe
rsit
te
mp
ate
mp
hu
m
winds
peed
cas
ual
registe
red
cn
t
0 1
201
1-
01-
01
1 0 1 0 0 6 0 1 0.24 0.28
79 0.81 0.0 3 13 1
6
1 2
201
1-
01-
01
1 0 1 1 0 6 0 1 0.22 0.27
27 0.80 0.0 8 32 4
0
2 3
201
1-
01-
01
1 0 1 2 0 6 0 1 0.22 0.27
27 0.80 0.0 5 27 3
2
3 4 201
1-
01-
1 0 1 3 0 6 0 1 0.24 0.28
79
0.75 0.0 3 10 1
3

insta
nt
dted
ay
seas
on
y
r
mn
th
h
r
holid
ay
week
day
workin
gday
weathe
rsit
te
mp
ate
mp
hu
m
winds
peed
cas
ual
registe
red
cn
t
01
4 5
201
1-
01-
01
1 0 1 4 0 6 0 1 0.24 0.28
79 0.75 0.0 0 1 1
dummy_fields = ['season', 'weathersit', 'mnth', 'hr', 'weekday']
for field in dummy_fields:
dummies = pd.get_dummies(rides[field], prefix=field)
rides = pd.concat([rides, dummies], axis=1)
fields_to_drop = ['instant', 'dteday', 'season', 'weathersit', 'mnth', 'hr', 'weekday', 'atemp',
'workingday']
data = rides.drop(fields_to_drop, axis=1)
data.head()

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

y
r
hol
ida
y
te
m
p
h
u
m
wind
spee
d
cas
ual
regis
tere
d
c
n
t
seas
on_
1
seas
on_
2
..
.
hr
_2
1
hr
_2
2
hr
_2
3
week
day_
0
week
day_
1
week
day_
2
week
day_
3
week
day_
4
week
day_
5
week
day_
6
0 0 0 0.
24 0.81 0.0 3 1
3 16 1 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0 0.
22 0.80 0.0 8 3
2 40 1 0 ... 0 0 0 0 0 0 0 0 0 1
2 0 0 0.
22 0.80 0.0 5 2
7 32 1 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0 0.
24 0.75 0.0 3 1
0 13 1 0 ... 0 0 0 0 0 0 0 0 0 1
4 0 0 0.
24 0.75 0.0 0 1 1 1 0 ... 0 0 0 0 0 0 0 0 0 1
5 rows × 59 columns
Scaling target variables
After the target variables were scaled, the following was the output:
y
r
hol
ida
y
te
m
p
hu
m
win
dspe
ed
cas
ual
regi
ster
ed
cnt
sea
son
_1
sea
son
_2
.
.
.
hr
_2
1
hr
_2
2
hr
_2
3
wee
kda
y_0
wee
kda
y_1
wee
kda
y_2
wee
kda
y_3
wee
kda
y_4
wee
kda
y_5
wee
kda
y_6
0 0 0
-
1.3
346
09
0.94
7345
-
1.5
538
44
-
0.66
273
6
-
0.9
301
62
-
0.9
563
12
1 0 ... 0 0 0 0 0 0 0 0 0 1
1 0 0
-
1.4
384
75
0.89
5513
-
1.5
538
44
-
0.56
132
6
-
0.8
046
32
-
0.8
239
98
1 0 ... 0 0 0 0 0 0 0 0 0 1

y
r
hol
ida
y
te
m
p
hu
m
win
dspe
ed
cas
ual
regi
ster
ed
cnt
sea
son
_1
sea
son
_2
.
.
.
hr
_2
1
hr
_2
2
hr
_2
3
wee
kda
y_0
wee
kda
y_1
wee
kda
y_2
wee
kda
y_3
wee
kda
y_4
wee
kda
y_5
wee
kda
y_6
2 0 0
-
1.4
384
75
0.89
5513
-
1.5
538
44
-
0.62
217
2
-
0.8
376
66
-
0.8
681
03
1 0 ... 0 0 0 0 0 0 0 0 0 1
3 0 0
-
1.3
346
09
0.63
6351
-
1.5
538
44
-
0.66
273
6
-
0.9
499
83
-
0.9
728
51
1 0 ... 0 0 0 0 0 0 0 0 0 1
4 0 0
-
1.3
346
09
0.63
6351
-
1.5
538
44
-
0.72
358
2
-
1.0
094
45
-
1.0
390
08
1 0 ... 0 0 0 0 0 0 0 0 0 1
Since this assignment involve plotting of the given dataset, the python notebook has been used.
There is a total of 17379 recoords on a horly basuiis oof the dataset.
Building of Regression Models
PART 1
I. Decision Trees
The following are some of the code snippet implements the decision tree algorithm:
import pandas as pd
import numpy as np
from sklearn.cross_validation import cross_val_score

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor, export_graphviz
# read the data and set "datetime" as the index
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)
bikes.rename(columns={'count':'total'}, inplace=True)
bikes['hour'] = bikes.index.hour
bikes.head()
bikes.tail()
season holida
y
workingd
ay
weathe
r
tem
p
atem
p
humidit
y
windspe
ed casual registere
d
tota
l
hou
r
datetim
e
2012-
12-19
19:00:0
0
4 0 1 1 15.58 19.695 50 26.002
7 7 329 336 19
2012-
12-19
20:00:0
0
4 0 1 1 14.76 17.425 57 15.001
3 10 231 241 20
2012-
12-19
21:00:0
0
4 0 1 1 13.94 15.910 61 15.001
3 4 164 168 21

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

season holida
y
workingd
ay
weathe
r
tem
p
atem
p
humidit
y
windspe
ed casual registere
d
tota
l
hou
r
datetim
e
2012-
12-19
22:00:0
0
4 0 1 1 13.94 17.425 61 6.0032 12 117 129 22
2012-
12-19
23:00:0
0
4 0 1 1 13.12 16.665 66 8.9981 4 84 88 23
treereg = DecisionTreeRegressor(max_depth=7, random_state=1)
scores = cross_val_score(treereg, X, y, cv=10, scoring='mean_squared_error')
np.mean(np.sqrt(-scores))
OUTPUT: 107.64196789476493
treereg = DecisionTreeRegressor(max_depth=3, random_state=1)
treereg.fit(X, y)
DecisionTreeRegressor(criterion='mse', max_depth=3, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
presort=False, random_state=1, splitter='best')
PART TWO

from sklearn.ensemble import Gradient Boosting Regressor
rfr = RandomForestRegressor().fit(train_x, train_y)
prediction_rfr = rfr.predict(train_x)from sklearn.ensemble import RandomForestRegressor
rfr = RandomForestRegressor().fit(train_x, train_y)
prediction_rfr = rfr.predict(train_x)
data_path = 'C:/Users/ROSANA/Desktop/bike/hour.csv'
train_data = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/
bikeshare.csv")
test_data = pd.read_csv("https://raw.githubusercontent.com/justmarkham/DAT8/master/data/
bikeshare.csv")
train_data.head(3)
dateti
me season holida
y
workingd
ay
weath
er
tem
p
atem
p
humidi
ty
windspe
ed
casu
al
register
ed
coun
t
0
2011-
01-01
00:00:
00
1 0 0 1 9.84 14.395 81 0.0 3 13 16
1
2011-
01-01
01:00:
00
1 0 0 1 9.02 13.635 80 0.0 8 32 40
2 2011-
01-01
02:00:
1 0 0 1 9.02 13.635 80 0.0 5 27 32

dateti
me season holida
y
workingd
ay
weath
er
tem
p
atem
p
humidi
ty
windspe
ed
casu
al
register
ed
coun
t
00
prediction_rfr = rfr.predict(train_x
plt.figure(figsize=(5, 5))
plt.scatter(prediction_rfr, train_y)
plt.plot( [0,1000],[0,1000], color='red')
plt.xlim(-100, 1000)
plt.ylim(-100, 1000)
plt.xlabel('prediction')
plt.ylabel('train_y')
plt.title('Random Forest Regressor Model')
PART 3
import pandas as pd
url = 'https://raw.githubusercontent.com/justmarkham/DAT8/master/data/bikeshare.csv'
bikes = pd.read_csv(url, index_col='datetime', parse_dates=True)
def calculate_period(timestamp):
initial_date = date(2011, 1, 1)
current_date = timestamp.date()
return (current_date.year - initial_date.year) * 12 + (current_date.month - initial_date.month)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

ossible_features = [
'season', 'holiday', 'workingday', 'weather',
'temp', 'atemp', 'windspeed', 'month',
'hour', 'year', 'week_day']
target = 'count'
Building a linear regression mode¶
feature_cols = ['temp']
X = bikes[feature_cols]
y = bikes.total
bikes.groupby('hour').total.mean().plot()
<matplotlib.axes._subplots.AxesSubplot at 0x27dc36739b0>
feature_cols = ['hour', 'workingday']
X = bikes[feature_cols]
y = bikes.total
linreg = LinearRegression()
linreg.fit(X, y)
linreg.coef_
Use 10-fold cross-validation for the linear regression model.
Scores = cross_val_score(linreg, X, y, cv=10, scoring='mean_squared_error')

np.mean(np.sqrt(-scores))
Output: 165.2232866891297
Conclusion
In conclusion, this task and numerous more examinations that have been done on the bicycle
sharing dataset demonstrate that machine learning calculations can be utilized to take care of the
forecast issue that is looked by bicycle sharing frameworks in different urban areas on the planet,
(Diamond & Boyd 2016 pp.2909-2913). Examination of the client conduct, bike use personal
conduct standards taxi be saw in the relapse models executed. The numerous tests that have been
done in this genuine dataset exhibit how powerful the relapse models can be in tending to the
bicycle sharing issue, (Salvatier, Wiecki, Fonnesbeck 2016 p.55).

1 out of 13

ICT707 Big Data: Regression Models Implementation on Bike Sharing Data

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Paraphrase This Document

Related Documents

ICT707 Big Data Project: Analysis of Bike Sharing Regression Models

+13062052269

info@desklib.com

ICT707 Big Data: Regression Models Implementation on Bike Sharing Data

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Paraphrase This Document

⊘ This is a preview!⊘

Related Documents

ICT707 Big Data Project: Analysis of Bike Sharing Regression Models

+13062052269

info@desklib.com