Building and Evaluating Predictive Models for Desklib
Verified
Added on  2023/06/08
|17
|3111
|68
AI Summary
This article discusses data exploration and cleaning techniques and building predictive models using real world business case with Desklib. It covers continuous and categorical variables, summary statistics, missing values, and correlation analysis.
Contribute Materials
Your contribution can guide someone’s learning journey. Share your
documents today.
Building and Evaluating Predictive Models PartA: DATA EXPLORATION AND CLEANING The first part of the research is focused on the data exploration and the cleaning of the given data. For this part the cereals (for breakfast) data has been used. This data set contains 76 data points and 18 different features. The results from the data exploration 1.The list of the continuous variables are shown in the table below. ## [1] "Calories""Protein""Fat" ## [4] "Sodium""Fiber""Complex.Carbos" ## [7] "Tot.Carbo""Sugars""Calories.fr.Fat" ## [10] "Potassium""Enriched""Wt.serving" ## [13] "cups.serv" pTable1list of the numerical variables Categorical variables are shown in the table below. The categorical variables are those variables where there are more than one category. There are two types of the categorical variables. The first one is the nominal variable where there is no particular order of the category. Secondly, the other type of categorical variable is the ordinal variable, where the category have particular order. In the current case among all the categorical variables, only Fibre.gr is the ordinal variable. All other variables are the nominal variable. # Ordinal: Fiber.Gr # Nominal: Name, Manufacturer, Mfr, Hot.Cold Table2Ordinal and the nominal variables
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
2.The summary statistics for the continuous variables has been calculated and for the calculation purpose various measures of the central tendencies such as mean and median has been included. The results are shown in the table below. CaloriesProteinFatSodium ## C:73Min.: 50.0Min.:1.00Min.:0.000Min.: 0.0 ## H: 31st Qu.:110.01st Qu.:2.001st Qu.:0.5001st Qu.:147.5 ##Median :120.0Median :3.00Median :1.000Median :210.0 ##Mean:140.5Mean:3.25Mean:1.447Mean:194.9 ##3rd Qu.:192.53rd Qu.:5.003rd Qu.:2.0003rd Qu.:262.5 ##Max.:250.0Max.:7.00Max.:9.000Max.:420.0 ## FiberComplex.CarbosTot.CarboSugars ## Min.: 0.000Min.: 7.00Min.:11.00Min.: 0.000 ## 1st Qu.: 1.0001st Qu.:13.001st Qu.:24.001st Qu.: 4.000 ## Median : 3.000Median :17.50Median :27.00Median :11.000 ## Mean: 3.066Mean:19.16Mean:31.37Mean: 9.145 ## 3rd Qu.: 5.0003rd Qu.:26.003rd Qu.:41.003rd Qu.:14.000 ## Max.:13.000Max.:38.00Max.:50.00Max.:20.000 Calories.fr.FatPotassiumEnrichedWt.serving ## Min.: 0.00Min.: 0.0Min.: 0.00Min.:12.00 ## 1st Qu.: 5.001st Qu.: 35.01st Qu.: 25.001st Qu.:30.00 ## Median :10.00Median : 92.5Median : 25.00Median :30.00
## Mean:12.37Mean:122.0Mean: 28.62Mean:36.65 ## 3rd Qu.:20.003rd Qu.:200.03rd Qu.: 25.003rd Qu.:49.00 ## Max.:50.00Max.:390.0Max.:100.00Max.:60.00 ##NA's:13 cups.servFiber.Gr ## Min.:0.3300Low:33 ## 1st Qu.:0.7500Medium:32 ## Median :1.0000High:11 ## Mean:0.8911 ## 3rd Qu.:1.0000 ## Max.:1.3300 Table3Table for the summary statistics The summary statistics of the continuous variable is shown in the table above. On the basis of the results, the variable which has extreme values is the potassium. In case of potassium the mean value is 194 whereas the minimum and the maxim value are 0 and 420 respectively. Similarly the range of calories also lies between as low as 50 to as high as 250. Count of the categorical variables 100% Bran: 1American Home: 1A: 1 ## 100% Nat. Bran Oats & Honey: 1General Mills:25G:25 ## 100% Nat. Low Fat Granola w raisins: 1Kelloggs:23K:23 ## All-Bran: 1Nabisco: 5N: 5 ## All-Bran with Extra Fiber: 1Post:10P:10 ## Almond Crunch w Raisins: 1Quaker Oats :12Q:12 ## (Other):70
Fibre.Gr # Low:33 ## Medium:32 ##High:11 In terms of the categorical variable, one of the variable hot and cold shows higher variation. There are 73 cold cereals whereas the number of hot cereals is only 3. Histogram of the variables (continuous) is shown below.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
On the basis of the results from the histograms it can be concluded that: Following findings has been found on the basis of the results from the histogram: a)The highest variability among the countinous variable is shown in the variables Enriched, Cups.serv and Fat. This is because the data points are more scattered in tails rather than concentrating around the mean value. b)The variables which shows highly skewed graphs are the Calories, Fiber , protein and the Fat. These histograms are either skewed towards the left tail or the right tail.
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
c)In terms of the extreme values, Calories.fr.fate and Enriched are the variables with some outliers. 4) Missing values varlist nmiss complete complete_permean median minimum ## 1Calories076100.00000 140.5263158 120.050.00 ## 2Protein076100.000003.25000003.01.00 ## 3Fat076100.000001.44736841.00.00 ## 4Sodium076100.00000 194.8684211 210.00.00 ## 5Fiber076100.000003.06578953.00.00 ## 6Complex.Carbos076100.00000 19.157894717.57.00 ## 7Tot.Carbo076100.00000 31.368421127.011.00 ## 8Sugars076100.000009.144736811.00.00 ## 9 Calories.fr.Fat076100.00000 12.368421110.00.00 ## 10Potassium076100.00000 121.973684292.50.00 ## 11Enriched076100.00000 28.618421125.00.00 ## 12Wt.serving136382.89474 36.650793730.012.00 ## 13cups.serv076100.000000.89105261.00.33 In the current case only Wt. serving has missing values where 13 data points are missing. To handle the missing data 3 different methods has been used namely the mean value imputation, median value imputation and the mode value imputation. In mean value imputation the missing values are replaced by the mean value of the series. Similarly in the median and the mode value imputation the median and the mode values are replaced in place of the missing values. varlist nmiss complete complete_permean median minimum Wt.serving136382.89474 36.650793730.012.00
The transformation of the mean value imputation in the wt.serving is shown in the figure above. This shows that with the mean value imputation, now more values lies around the mean. However there are still some extreme values in the data set.
Part B : Building predictive models using real world business case After the data exploration in the first section, the second section deals with the model building. In this case using the car sales data of Toyota Corolla, the prediction model has been developed. The data set contains the data for 37 different features for 1400 cars sold Australia. 1)Data Exploration and Cleaning a)To examine the price distribution of the car, the histogram has been used and the result from the histogram plot is shown in the following figure. Results shows that price for most of the cars is between 5000 and 10000. Also there are cars whose price is more than 30000.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Min. 1st Qu.MedianMean 3rd Qu.Max. 435084509900107311195032500 Furthermore the results from the descriptive statistics indicates that the average price of the Toyota Corolla car is 10731. As discussed the price range from as low as 4350 to as high as 32500. This also indicates that there are some outliers in the data set. b)Checking for the missing values Results from the analysis indicates there is no case of missing values in the current data set. varlist nmiss complete complete_per ## 1Id01436100 ## 2Model01436100 ## 3Price01436100 ## 4Age_08_0401436100 ## 5Mfg_Month01436100 ## 6Mfg_Year01436100 ## 7KM01436100 ## 8Fuel_Type01436100 ## 9HP01436100 ## 10Met_Color01436100 ## 11Automatic01436100 ## 12cc01436100 ## 13Doors01436100 ## 14Cylinders01436100 ## 15Gears01436100 ## 16Quarterly_Tax01436100 ## 17Weight01436100 ## 18Mfr_Guarantee01436100 ## 19BOVAG_Guarantee01436100 ## 20 Guarantee_Period01436100 ## 21ABS01436100 ## 22Airbag_101436100 ## 23Airbag_201436100 ## 24Airco01436100
## 25Automatic_airco01436100 ## 26Boardcomputer01436100 ## 27CD_Player01436100 ## 28Central_Lock01436100 ## 29Powered_Windows01436100 ## 30Power_Steering01436100 ## 31Radio01436100 ## 32Mistlamps01436100 ## 33Sport_Model01436100 ## 34 Backseat_Divider01436100 ## 35Metallic_Rim01436100 ## 36Radio_cassette01436100 ## 37Tow_Bar01436100 c)Since the R software is being used for the analysis, there is no need to separately convert the categorical variables into the numerical as the software itself create dummy variables for it. d)Correlation Analysis Since there are 37 variables in the data set, the correlation coefficients have been included for only those variables which have high correlation coefficient. The correlation coefficient which are higher than 0.8 and less than 1 has been calculated. This will eliminate the correlation coefficients which are less. This dimension reduction has shown that manufactured year and the age of the car are highly correlated with the price of the car. So, it is better to drop one variable. Since the coefficient of manufacturing year is higher with price, the age has been dropped. Furthermore among the categorical variables central lock and the power windows shows high correlation with the price. While comparing the coefficient, it has been shown that coefficient between price and powered windows is higher as compared to the coefficient between price and central lock, so powered windows has been selected. 2)Regression Modelling Regression analysis is used to predict the response variables using the explanatory variables. In the current case, the price of the cars is the response of the dependent variable and all other features as the explanatory variables.
The first model is as follows; Residuals: ##Min1QMedian3QMax ## -7904.0-718.82.5728.36027.0 ## ## Coefficients: ##Estimate Std. Error t value Pr(>|t|) ## (Intercept)1.058e+039.350e+021.1320.25788 ## Age_08_04-1.169e+023.363e+00 -34.776< 2e-16 *** ## Boardcomputer-2.363e+021.050e+02-2.2510.02455 * ## Automatic_airco2.611e+031.658e+0215.744< 2e-16 *** ## Weight1.410e+017.735e-0118.227< 2e-16 *** ## KM-1.799e-021.103e-03 -16.311< 2e-16 *** ## CD_Player2.784e+029.307e+012.9910.00283 ** ## Powered_Windows4.329e+027.022e+016.165 9.18e-10 *** ## HP2.086e+012.396e+008.708< 2e-16 *** ## ABS-2.111e+029.186e+01-2.2980.02168 * ## --- ## Signif. codes:0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1222 on 1426 degrees of freedom ## Multiple R-squared:0.8871, Adjusted R-squared:0.8864 ## F-statistic:1245 on 9 and1426 DF,p-value: < 2.2e-16 In this model the R squared is 0.88 which shows that 88 % of the variation is being explained by he explanatory variables in the model. In the next model some of the variables which were not significant in this model were removed, such as the central lock. Similarly in the second model the variables which were not highly significant were removed. Finally the optimal model has been found in the third time and the results are shown in the table below. In this case all the coefficient are highly significant and the R squared is also high. So this is considered as the optimal model(Bai & Ng, 2009). Coefficients:
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
##Estimate Std. Error t value Pr(>|t|) ## (Intercept)1.058e+03 9.350e+021.132 0.25788 ## Age_08_04-1.169e+02 3.363e+00 -34.776 < 2e-16 *** ## Boardcomputer-2.363e+02 1.050e+02 -2.251 0.02455 * ## Automatic_airco 2.611e+03 1.658e+02 15.744 < 2e-16 *** ## Weight1.410e+01 7.735e-01 18.227 < 2e-16 *** ## KM-1.799e-02 1.103e-03 -16.311 < 2e-16 *** ## CD_Player2.784e+02 9.307e+012.991 0.00283 ** ## Powered_Windows 4.329e+02 7.022e+016.165 9.18e-10 *** ## HP2.086e+01 2.396e+008.708 < 2e-16 *** ## ABS-2.111e+02 9.186e+01 -2.298 0.02168 * ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 1222 on 1426 degrees of freedom ## Multiple R-squared: 0.8871, Adjusted R-squared: 0.8864 ## F-statistic: 1245 on 9 and 1426 DF, p-value: < 2.2e-16 b) Evaluating the accuracy of the regression model The optimal regression model can be evaluated on the basis of R squared, multicollinariy and the statistical significance of the coefficients. In the current case the R squared is decently higher at 0.88. The Vif test indicates there is no problem of multicollinerity and the regression coefficients are also statistically significant(Armstrong, 2012; Dufour & Dagenais, 1985; Lanfranchi, Viola, & Nascimento, 2010). 3)Decision tree
Results from the decision tree are discussed in the current section. As the figure shows the error do not decrease after the 100 trees. This is because the line is flat after the 100 trees. The random forest package has been used for the decision tree. In this case four different decision tree models has been run and the results shows that the after 4 iteration, the variance do not increase, even when the trees are increased. So the fourth model is the optimal model and the results are as follows: ##%IncMSE IncNodePurity ## Age_08_0413967550.6611424442205 ## Boardcomputer523680.861820432381 ## Automatic_airco279111.441186671614 ## Weight1218683.981446642212 ## KM1053264.602015571025 ## CD_Player50608.29113664776
## Powered_Windows200498.28140180060 ## HP389816.37486749793 ## ABS11802.0046241351 b) One of most popular method of decision tree is random forest. In some cases the decision tree fit multiple lines in the data and predict the data. So, in some cases there is a problem of overfittting while running the model on training data. To solve this problem the random forest technique is used, as it builds multiple trees on the basis of the given data and give the average prediction. This method is considered to be less biased(Fokin & Hagrot, 2016). 4)Comparison of the models On the basis of the results from the both the regression analysis and the decision tree it can be concluded that the decision tree is able to explain higher variance (91%) as compared to regression ( where R squared is 0.88). However impact of each explanatory variable on the dependent variable is missing from the random forest. So, from business point of view the regression model is appropriate as the variables can be clearly identified. References Armstrong, J. S. (2012). Illusions in regression analysis.International Journal of Forecasting,6, 689–694. Bai, J., & Ng, S. (2009). Tests for Skewness, Kurtosis, and Normality for Time Series Data. Journal of Business & Economic Statistics,23(1), 49–60. https://doi.org/10.1198/073500104000000271 Dufour, J. M., & Dagenais, M. G. (1985). Durbin-Watson tests for serial correlation in regressions with missing observations.Journal of Econometrics,27(3), 371–381. https://doi.org/10.1016/0304-4076(85)90012-0 Fokin, D., & Hagrot, J. (2016).Constructing decision trees for user behavior prediction in the online consumer market. Kth Royal Institute of technology. Lanfranchi, L. M. M. M., Viola, G. R., & Nascimento, L. F. C. (2010).The use of Cox regression to estimate the risk factors of neonatal death in a private NICU. Taubate.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser