Regression Analysis and Report on Bike Sharing System

Verified

Added on 2019/09/22

AI Summary

This report details an analysis of a bike-sharing system, focusing on predicting bike rental counts using regression models. The project begins with understanding the problem and data, followed by preprocessing steps like handling missing data and removing irrelevant features. The report explores three regression models: Decision Tree, Gradient Boosted Tree, and Linear Regression, with a primary focus on Random Forest due to its ability to prevent overfitting. The methodology includes importing libraries, reading data, and applying machine learning techniques using Python and libraries like Pandas and scikit-learn. The analysis involves model training, parameter tuning, and evaluation using metrics like R-squared and Mean Absolute Error. The report emphasizes the advantages of Random Forest over Decision Trees, particularly in handling variance and improving accuracy on validation datasets. The conclusion highlights Random Forest as the most suitable algorithm for this regression-type dataset, providing a robust and accurate predictive model.

=============Report on Bike Sharing==================
The problem Is count the bike on rental hourly and day.
Bike sharing system are new generation of traditional bike rentals where whole process from
membership, rental and return back has become automatic. In this project task is count the how
many bike on rent and which season and number of parameter is affected.
Before the coding some of important step which are as following:-
1) Understanding the question and thinking the solution.
2) Find out the missing data points.
3) Prepare the data.
4)Decide the model that you aim to exceed.
5)Train the model on the training data.
6) Compare the prediction to the test data.
7) Interpret the model and report results visually and numerically.
This model is regression type of model i.e. output is in continuous form. As per your references me
used three types of regression model like
1) Decision Tree
2) Gradient Boosted Tree
3) Linear regression
When we look at dataset(.csv file) some of parameter useless. So remove it means drop out. As per
my thinking I removed “Date” column. Then after that check out the shape of all data set. I will
check out season column and also describe it. But is this case target column is “cnt”(i.e. count). In
the count column mentioned as a how many bike on rent on a particular season and time. So many
count are available thats why main approach is calculate the count.
Requirements :- Python 2 or 3.
Step 1:- Import the all library such as Pandas , Numpy, ploting (matplot.lib), number of packages
from sklearn.
Step 2:- Read the input file from pandas ex. pd.read_csv(File Path)
Step 3:- find out shape.
Check the column which include 0. Because some time it is not important of 0 number.
Step 4:- Remove the irrelevant features. Such as remove (“Date”)
Step 5:- Important step is to see unique value because its most of the time unique value in the
dataset so thats are removed it is best. Otherwise some time they are overlapped. So, thats time
maximum Chances are happening to overfit model. And affected on accuracy (i.e. Final Output.)
Step 6:- After that delete the unique value column.
Step 7:- Another the main important thing at the pre-processing step is to check the null value in
dataset. Null value replace by mean or mode or median. It is depend upon visualization of data like
‘Normal distribution Curve’(“Bell-Shaped”). This graph very usefull to visualization of data.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Step8:- Data Partition – Most important in this machine learning . Because machine learning is a
train the dataset with the help of algorithms and tested on test dataset. That’s time you don’t have a
test data so split the dataset in a particular standard format like ex. Suppose you have 100 example
it can be split in a such manner (80% data as a training and 20% data as test data) and applying the
machine learning technique on training data and tested on test data (20% data). And check out the
this model is perfect or not. We can used small package from sklearn i.e. train_test_split and split
the dataset. For ex. test_size=20% means 80% data as a training.
Step 9:- After preprocessing step used the machine learning model. These example is a regression
type so used the regression algorithm likeDecision Tree,Gradient Boosted Tree,Linear regression.
Step 10:- Use the regressor packages like
from sklearn.ensemble import RandomForestRegressor.
from sklearn.ensemble import DecisionTreeRegressor.
from sklearn.ensemble import GredientTreeRegressor.
1) First-up all import and decide the max_depth and random_state because these are despondency.
They will be decide how many tree and branch created.
Fit the model using randomforest and predict. I the regressor find out the r2_score and Mean
absolute error.
2) Same thing used in remaining algorithm and tune the parameter.
3) r2_square it is best in a regression as compare to other like RMSE(Root Mean square Error).
Why regression ?
In decision tree classifier one disadvantage is that they need a target feature which is categorically
scaled like for weather = {Sunny, Rainy, Overcast, Thunderstorm}.
Here arises a problem: What if we want our tree for instance to predict the price of a house given
some target feature attributes like the number of rooms and the location? Here the values of the
target feature is such as (prize) are no longer categorically scaled but are continuous - A house can
have, theoretically, a infinite number of different prices.
Thats where Regression Trees come in. Regression Trees work in principal in the same way as
Classification Trees with the large difference in between classifier and regressors is that the target
feature values can now take on an infinite number of continuously scaled values.
Why RandomForestTree?
When using a decision tree model on a training data the accuracy is improving with more and more
splits. So more chances overfit the data . The advantage of a simple decision tree is model is easy to
interpret, you know what variable and what value of that variable is used to split the data and
predict outcome.
In random forest you can specify the number of trees you want in your forest such as
(n_estimators) and also you can mentioned max_num of features to be used in each tree. But you
cannot control the randomness, you cannot control which feature is part of which tree in the forest,
you cannot control which data point is part of which tree. Accuracy keeps increasing as you
increase the number of trees, but becomes constant at certain point. Unlike decision tree, it won't
create highly biased model and reduces the variance.

When to use to decision tree:
1. When you want your model to be simple.
2. When you want non parametric model.
3. When you don't want to worry about feature selection or regularization or worry about
multi-collinearity.
When to use random forest :
1. Random forest will reduce variance part of error exceot than bias part, so on a given training
data set decision tree may be more accurate than a random forest. But on an unexpected
validation data set, Random forest always wins in terms of accuracy.
Conclusion :-
For regression types of dataset random forest is the best algorithm as compare to another. Because
model can not be overfit and predict the perfect node. In random forest splitting of node is more as
compare to decision tree so it is easy work on each and every node.
Reference :-
1) https://stats.stackexchange.com/questions/285834/difference-between-random-forests-and-
decision-tree
2) Towards Data science.
3) Analytic Vidhya