Analyzing BMW Used Car Prices: A Business Statistics Regression Report

Verified

Added on  2020/05/08

|13
|2009
|97
Report
AI Summary
This report presents a statistical analysis of a dataset of 1049 BMW used cars from 2016, focusing on the relationship between price (dependent variable) and independent variables such as age, mileage, engine power, and various dummy variables. The analysis includes descriptive statistics, such as central tendency, variability, and skewness, along with graphical representations like histograms and box plots. Regression analysis is performed to determine the significance of each independent variable on price, including hypothesis testing and interpretation of slope coefficients. The report also addresses model assumptions and limitations, offering suggestions for future improvements, such as including additional variables and employing stratified sampling for a more representative dataset. The analysis concludes with a predicted regression equation and an assessment of its predictive capabilities.
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
BUSINESS STATISTICS
STUDENT ID:
[Pick the date]
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
BUSINESS STATISTICS
Introduction
Data has been collected based on a used car website in 2016. 1049 sample of BMW have been
included in the study. The objective of the given study is to determine the undelrying relationship
between price (dependent variable) and various independent variables which include age, km
travelled, engine power rating besides hosts of other factors which are captured through dummy
variables. Various statistical technqiues have been used to conduct the analysis on the given data
to answer the various questions raised.
Analysis
A. Descriptive statistics for the data is shown below:
1
Document Page
BUSINESS STATISTICS
The first quartile, third quartile and inter quartile range for variables price, kilometer and Power
KW is shown below:
Central tendency: measures of central tendency are termed as agreed when mean, median, and
mode are equal. Further, when the data variable is having skew, then the distribution of the data
would not be termed as normal.
Variability: In order to comment on the dispersion of the variables, variance, standard deviation,
range, inter quartile range would be taken into consideration. Degree of dispersion would be
analyzed by taking the standard deviation along with the mean value.
Shape: The positive skew in the data would represent that data is showing rightward tail and
similarly negative skew represent that data is sowing leftward tail. Variables Price, Age,
Convertible, Power KW, Hatchback and coups are showing positive skew and Automatic,
Kilometer, Petrol and Sedan are showing negative skew.
2
Document Page
BUSINESS STATISTICS
Comment on Dummy Variability: The mean value for variable, Petrol comes out to be 0.63. This
indicates that nearly 63% of the vehicles are petrol cars and rest 37% of the vehicles are diesel
vehicles.
B. Graphical representation of distribution of price is represented below:
From the above histogram, it would be fair to conclude that data represents long right ward tail
which is the indicative of positive skew. Therefore, the data is not normally distributed and also,
that nearly 70% of the vehicles are representing the price below $18,081.
C. Box and whisker plot for the variable age is represented below:
3
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
BUSINESS STATISTICS
It is apparent from the above box and whisker plot that the data of variable age is representing
some outliers and positive skew. Additionally, upper limit and lower limit of the value would be
taken into consideration to find the presence of outliers.
Lower Limit = Q1 – 1.5IQR and Upper Limit = Q3 + 1.5IQR
After computing lower and upper limit, it is apparent that some of the values do not fall in this
range and hence, classified as outliers.
D. Pivot tale to represents the relative frequency events of variable age and convertible is shown
below:
4
Document Page
BUSINESS STATISTICS
Probability (vehicle is convertible) = 0.1033
Probability (vehicle is having age of convertible higher than 25 years) = 0.021
Yes, it can be said that variables age and vehicle convertible are dependent to each other. This is
because vehicles which are convertible would be younger. This is also evident from Chi-square
test.
E. 95% confidence interval for population mean price of Hatchback is computed through
KADD add ins of excel and is represented below:
95% confidence interval for population mean price of coups is computed through KADD add ins
of excel and is represented below.
5
Document Page
BUSINESS STATISTICS
F. Level of significance = 5%
Null Hypothesis H0 : μ=0.5
Alternative Hypothesis H1 : μ<0.5
6
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
BUSINESS STATISTICS
The p value is 0.0824 which is higher than level of significance. Hence, null hypothesis would
not be rejected. Therefore, alternative hypothesis would not be accepted and thus, the claim that
population proportion of convertibles having a manual transmission is less than 50% is incorrect.
G. Multiple linear regression output is represented below:
H. Level of significance = 5%
Relevant hypotheses (for variable age)
Null Hypothesis: H0 : βage = 0
Alternative Hypothesis: H1 : βage ≠ 0
P value approach
Based on the above linear regression model, it can be said that test statistics of variable age
comes out to be –15.23 and the corresponding p value comes out to be 0. It is apparent that p
7
Document Page
BUSINESS STATISTICS
value is lower than level of significance and therefore, statistically evidence present to reject null
hypothesis and to accept alternative hypothesis. Therefore, slope coefficient of variable age is
statistically significant and cannot be assumed equal to zero.
Critical value approach
Degree of freedom = Sample size – 1 = 1049 – 1 = 1048
Based on the above linear regression model, it can be said that test statistics of variable age
comes out to be –15.23 and the value of degree of freedom comes out to be 1048 and hence, the
corresponding critical value comes out to be ± 1.96. It is apparent that t statistics value does not
fall in the range of critical values (i.e. -1.96 and +1.96). Therefore, statistically evidence present
to reject null hypothesis and to accept alternative hypothesis. Therefore, slope coefficient of
variable age is statistically significant and cannot be assumed equal to zero.
I. The slope coefficients in the regression can be interpreted in the following manner.
 Automatic – Higher price is commanded by the used car equipped with automatic
transmission to the tune of $ 791.51. The positive sign of the coefficient is expected
considering the higher comfort and convenience for automatic transmission car.
 Kilometre – As the used car tends to travel for incrementally 1 km, the used car price tends to
diminish by $ 0.09. The negative sign of the coefficient is expected considering greater usage
leading to higher wear and tear.
 Petrol –Lower price is commanded by the used car running on petrol to the tune of $
1,492.76. The negative sign of the coefficient is expected considering the lower operating
costs of diesel run car.
 Damage – Lower price to the tune of $ 2,166.19 is commanded by the used car having
damage. The negative sign of the coefficient is expected considering the lower value of the
car with any damage.
 PowerKW – As the used car engine power rating tends to increase by 1KW, the used car price
tends to enhance by $ 100.04. The positive sign is expected as the cars with higher power
rating engines tend to be expensive.
8
Document Page
BUSINESS STATISTICS
 Hatchback – Lower price to the tune of $ 2,116.31 is commanded by the used car which is a
hatchback. The negative sign is expected as the hatchbacks are typically used for cargo and
hence result in higher depreciation on car value.
 Sedan – Lower price to the tune of $ 2,871.32 is commanded by the used car which is a
Sedan. The negative sign is expected since for used cars people prefer higher end models
where the price differential is higher.
 Convertible –Higher price to the tune of $ 2,443.33 is commanded by the used car which is a
convertible. The positive sign is expected as convertibles tend to be expensive than the regular
cars.
J. The adjusted R2 has come out as 0.6932. This is indicative that 69.32% of the changes in price
are accounted for by the independent variables considered significant in the regression model
highlighted above. Adjusted R2 and R2 do not show high difference which may be on account of
the various independent variables (expect hatchback and automatic) being proved significant.
K. Hypothesis Testing
H0: No slope is significant and hence all the slopes can be taken as zero.
H1: Atleast one independent variable exists which has a significant slope which cannot be taken
as zero.
ANOVA ouiput
The p value (also denoted by Significance F) from the above output is 0. As p value < α, hence
reject H0. Hence, it implies that the regression model is significant owing to existence of atleast
one significant slope coefficient.
9
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
BUSINESS STATISTICS
L. Considering the low value of R2, it is apparent that incremental independent variables would
need to be inserted into the regression model. One of the variables could be the number of
annual services given to the car by the previous owner as the condition of the vehicle is also an
imperative factor. For instance, it is possible that even though a car may be 5 years old but with
proper maintenance the condition of the five year old vehicle might be superior to a 2 year old
car not well maintained. Another critical factor could be the difference in resale value associated
with certain models. Currently even though the model of the car is recorded but in regression
analysis the difference in brands when it comes to resale value is not reflected.
M. The predicted regression equation is indicated below.
Price = 23435.55 -735.02Age +791.51Automatic -0.09*Kilometer -1492.76Petrol -
2166.19Damage +100.04PowerKW -2116.31hatchback -2871.32Sedan +2443.33Convertible
Given the various input values, the price of the car is estimated as follows.
Price = 23435.55 -735.02*5 +791.51*1 -0.09*75000 -1492.76*1 -2166.19*0 +100.04*110 -
2116.31*0 -2871.32*1 +2443.33*0 = $ 20,279
Yes, it is appropriate to predict the value of the used car under these circumstances since the
value of the independent variables lie in the range of the values which have been utilized for
derivation of the regression model.
N. The requisite normality plot and the scatter plot for residual is indicated below.
0 20 40 60 80 100 120
0
20000
40000
60000
80000
100000
Normal Probability Plot
Sample Percentile
price
10
Document Page
BUSINESS STATISTICS
0 5 10 15 20 25 30 35 40 45
-40000
-20000
0
20000
40000
60000
80000
age Residual Plot
age
Residuals
It is apparent from the above residual plot against independent variable age that there seems to be
pattern which is presented and the random distribution of residuals is not observed. Also, similar
observation is also observed in case of normal plot which tends to deviate from the linear trend.
Hence, it would be fair to opine that the various assumptions of linear regression related to error
terms such as linearity, normality and heteroscedasticity do not seem to be satisfied for the given
regression model.
O. The given data does not indicate the actual distribution of population of various vehicles in
Berlin as the sample information is collected from a particular website which may not be
representative of the population. In order to select a more representative sample, stratified
sampling method is advisable to collect a representative sample. This is because it would ensure
that the critical parameters such as model distribution, age etc. can be represented in the sample
in the same proportion as in the population. No, I would not expect the results to hold for
Mercedes as the two are competitors and would not necessarily have comparable sales.
P. Based on the given data, number of cars with engine exceeding 200KW = 75
Number of cars out of the above which are Sedan = 53
Hence, probability of a high performance vehicle being Sedan = (53/75) = 0.7067
Also, number of trials = 5
Consider X be the number of Sedans out of the above trials.
P ( X=0 )=BINOMDIST ( 0,5 ,0.7067 , False )=0.0022
11
Document Page
BUSINESS STATISTICS
P ( X=1 )=BINOMDIST ( 1,5 , 0.7067 , False ) =0.0261
P( X=2)=BINOMDIST (2,5 ,0.7067 , False)=0.1260
P( X=3)=BINOMDIST (3,5 , 0.7067 , False)=0.3036
P ( X=4 )=BINOMDIST ( 4,5 , 0.7067 , False )=0.3658
P ( X=5 ) =BINOMDIST ( 5,5 , 0.7067 , False ) =0.1763
Graphical representation
0 1 2 3 4 5 6
0.0000
0.0500
0.1000
0.1500
0.2000
0.2500
0.3000
0.3500
0.4000 Probability Graph
x
Probability (P)
Conclusion
The given data has been analysed by constructing a regression model besides highlighting the
descriptive statistics. It has been found that majority of the independent variables are statistically
significant for the price determination. However, more predictor variables need to be introduced
to the given model so that the predictive and explanatory power of the model can be improved.
Also, the data collected does not seem representative of the population which is a matter of
concern and needs to be addressed for future studies.
12
chevron_up_icon
1 out of 13
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]