Principles of Data Science for Business

Verified

Added on  2022/08/25

|21
|5092
|26
AI Summary

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Principles of Data Science
for Business

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Table of Contents
Section 1: Assessment of Itineract Travel Co briefing note:............................................4
Section 2: Overview of investigation.....................................................................................5
Section 3: Analysis and results...............................................................................................6
Section 4: Ethical and security considerations.....................................................................17
Section 5: Data Science in Next Steps and Potential Solutions:..........................................17
Report Appendix: Statistics and Methodology:...................................................................19
REFERENCES..............................................................................................................................22
Document Page
ITINERACT TRAVEL CO – SEARCHABILITY CHALLENGE:
REPORT & RECOMMENDATIONS
Section 1: Assessment of Itineract Travel Co briefing note:
Data studies develop to be one of qualified experts that help in making various important
results which are beneficial for making effective decision. Now effective computer experts
recognize that they need to learn the conventional expertise in data analysis, data collection and
coding in large quantities (Data science, 2020). Data scientists need to monitor the full scope of
the data science development cycle and have a degree of independence and comprehension to
optimize return in every step of the way in which to discover valuable intelligence for certain
companies.
As the business growth Itineract Travel Company strategies focus on bringing the number
of tourists to the website and the service provided to thousands of people, it will become
unbelievably complex to align correct interactions with each future customer, while also time
becoming crucial to meet the development targets of the organization. Heterogenous nature of
the clients and a unique nature of the activities makes a challenging product searchability as the
company plans on expanding its users. Based on the current problem on the knowledge gap on
user experience, Data Science tools are important in identifying a balancing act between aimed
at achieving positive user experience which will not only aid in upscaling Itineract Travel Co’s
Market but also improving the overall customer experience. Using historical data based on a
simple pilot recommendation system currently used in the company where the customers see
travel experiences that relates to causes they initially booked, data science becomes an effective
tool in matching the right experiences to individual customers thus simplifying an otherwise
complicated process.
From Itineract Travlel Co core question, the company’s primary goal in resolving the
current issue is the desire to match accurately match customer interactions and interest to the
services offered in order to improve customer satisfaction. This enhances the complexity of size
finding commodity appropriate to need and desire. If a clear information set is confirmed,
Itineract Travel Company must establish and manage a state-of-the-art recommendation
framework and build an internal data science team. Considering company’s vulnerability to
numerous political causes, even a recommendation system would be cautiously prepared and
monitored. The multiple decision-making approaches include a variety of parameters for
Document Page
consideration of choices. Strategic decision-making for the performance of companies is also
essential. As a data scientist in digital marketing and analytics consultancy different decision
policies must be followed according to the principles, risk behaviours and the expectation of
future results of the decision-makers (Provost and Fawcett, 2013). The policy-making
mechanisms, for example the decision-maker, circumstance in decisions and problem solving
procedures, have particular characteristics.
The dataset provided including age, favourite cause, total revenue earned, age, experience
and id are adequate in answering investigation questions aimed at assessing functional properties
of customer’s willingness to travel, cause preference and as decision making tool. How well this
however depend on the accuracy and sample size of the data provided
The work involves data preparation, study of exploratory information and inferential
numerical analysis. Within this report, basic specifics are provided within the annex. This should
end by presenting potential answers to the findings of the study as proposed by (Lakowicz,
2013). This done using Excel data analysis tool. The statistical measures used include Pearson
Correlation coefficient and regression.
Section 2: Overview of investigation
The analysis started by presenting the information in a manner in which exploratory data
(EDA) can be analysed. It included the reorganization of data and any processing of data that I
thought may also induced partiality. EDA is "the tool for the standardized representation of all
factors through the data visualization. The trends are defined through EDA which is an indicator
of the in which the shift in travelling need of the customer occurred and a statistical insight for
exploration reasons. The effect was a huge number of excellent visualizations, demonstrating
how the traffic problem evolved overtime. The next step of the investigation revolved about
establishing a statistical pattern aimed at creating a better ground for statistical analysis could be
cantered on. The distribution of the results were then checked and I discovered that the results
were not normally distributed and the values were more nearer to the Poisson test. This gave me
an opportunity to understand what sort of inferential figures are vital in decision making.
Inferential statistics were then done as a bootstrap. This provided for an opportunity to quantify
confidence intervals aimed at deciding if statistically significant variations are known, whether
or not these discrepancies were the result of a change, or could have happened within the
customer preferences and the tourist destination. This was done in order to ensure that the

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
management is more confident of the improvements noticed and can therefore depended upon in
the decision making process. Eventually, these findings can be used considering possible
approaches to the problems and proposed methods for data science that could be adapted and
recommended for execution and effectiveness. The different data set related to customer
experience including age, favourite reason for the rating for experience, the gender within 1000
observations were employed in the process. Moreover the data set used also includes the id code
which was allotted for each individual and group of customer visiting specific place in that time
period as per their desire and requirement. In addition the entire observation also included the
total revenue generated from specific location and each customer were selected for the pilot or
not.
Section 3: Analysis and results
In order to perform proper and authentic analysis and determine the suitable results,
regression models are used. The models help to define that the customer visiting a particular
place are satisfied or not.
Liner regression analysis is beneficial in determining the suitable values which support in
making proper recommendation. The four features considered are age, experiences purchased, id
and total revenue.
Descriptive Statistics
Mean Std.
Deviation
N
experiences_purchase
d 1.66 2.037 1000
age 65.97 54.317 1000
id 499.50 288.819 1000
total_revenue 96.20 268.240 1000
Table 1: Descriptive Statistics of the main variables (Experiences Purchase)
Document Page
Correlations
experiences_
purchased
age id total_revenu
e
Pearson
Correlation
experiences_purchase
d 1.000 .011 -.023 .382
age .011 1.000 .030 -.012
id -.023 .030 1.000 -.012
total_revenue .382 -.012 -.012 1.000
Sig. (1-tailed)
experiences_purchase
d . .368 .229 .000
age .368 . .169 .355
id .229 .169 . .357
total_revenue .000 .355 .357 .
Table 2: Correlation (Experiences Purchase)
Model Summaryb
Model R R Square Adjusted R
Square
Std. Error of
the Estimate
1 .383a .146 .144 1.885
a. Predictors: (Constant), total_revenue, id, age
b. Dependent Variable: experiences_purchased
Table 3: Model Summary (Experiences Purchase)
ANOVAa
Model Sum of
Squares
df Mean
Square
F Sig.
1
Regression 607.447 3 202.482 56.975 .000b
Residual 3539.657 996 3.554
Total 4147.104 999
a. Dependent Variable: experiences_purchased
b. Predictors: (Constant), total_revenue, id, age
Table 4: ANOVA (Experiences Purchase)
Document Page
Coefficientsa
Model Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
1
(Constant) 1.415 .140 10.119 .000
age .001 .001 .016 .537 .591
id .000 .000 -.020 -.666 .506
total_revenu
e .003 .000 .382 13.043 .000
a. Dependent Variable: experiences_purchased
Table 5: Model Coefficients (Experiences Purchase)
Residuals Statisticsa
Minimu
m
Maximu
m
Mean Std.
Deviation
N
Predicted Value 1.29 14.79 1.66 .780 1000
Residual -10.795 17.659 .000 1.882 1000
Std. Predicted
Value -.475 16.839 .000 1.000 1000
Std. Residual -5.726 9.367 .000 .998 1000
a. Dependent Variable: experiences_purchased
Table 6: Residual Statistics (Experiences Purchase)

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Figure 1: Standardized residual histogram (Experiences Purchase)
Regression analysis between pilot, age and total revenue
Descriptive Statistics
Mean Std.
Deviation
N
pilot .33 .472 1000
age 65.97 54.317 1000
total_revenu
e 96.20 268.240 1000
Table 7: Descriptive Statistics (Pilot)
Document Page
Correlations
pilot age total_revenu
e
Pearson
Correlation
pilot 1.000 .040 .080
age .040 1.000 -.012
total_revenu
e .080 -.012 1.000
Sig. (1-tailed)
pilot . .105 .005
age .105 . .355
total_revenu
e .005 .355 .
N
pilot 1000 1000 1000
age 1000 1000 1000
total_revenu
e 1000 1000 1000
Table 8: Correlations (Pilot)
Model Summaryb
Model R R Square Adjusted R
Square
Std. Error of
the Estimate
1 .090a .008 .006 .470
a. Predictors: (Constant), total_revenue, age
b. Dependent Variable: pilot
Table 9: Model Summary (Pilot)
ANOVAa
Model Sum of
Squares
df Mean
Square
F Sig.
1
Regression 1.806 2 .903 4.079 .017b
Residual 220.638 997 .221
Total 222.444 999
a. Dependent Variable: pilot
b. Predictors: (Constant), total_revenue, age
Table 10: ANOVA (Pilot)
Document Page
Coefficientsa
Model Unstandardized
Coefficients
Standardized
Coefficients
t Sig.
B Std. Error Beta
1
(Constant) .297 .024 12.346 .000
age .000 .000 .041 1.288 .198
total_revenu
e .000 .000 .081 2.565 .010
a. Dependent Variable: pilot
Table 11: Model Coefficients (Pilot)
Figure 2: Standardized residual histogram (Pilot)

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
N is the sample size or the number of data points analyzed, while mean is the average value.
Standard deviation tells how measurements for a group are spread out from the average (mean)
value. A low standard deviation means that most of the numbers are close to the average. A high
standard deviation means that the numbers are more spread out.
The correlation shows how strongly pairs of variables are related. A variable will have the
strongest relationship (1.0) with itself.
ANOVA refers to the analysis of variance. It is a statistical technique that is used to check if the
means of two or more groups are significantly different from each other.
The coefficients of a linear regression model are the constants c and mi from the general
regression equation:
Residual statistics is the difference between the observed value of the dependent variable (y) and
the predicted value (ŷ).
From the analysis, the following is noted:
1. The experiences purchased and the total revenue are widely spread from the sample mean.
The total revenue is of important interest and the travel company should invest in market
research to find out the cause of large disparity. Focusing on the higher revenue records and
help them diagnose and increase the lower revenue values, increasing the the total revenue
and hence the overall mean.
2. There is a strong correlation between the total revenue and experiences purchased. This is a
good suggestion for the company to narrow down their market research with an aim of
improving the general experience since it suggests an increase in revenue.
3. Since the R-square value explain the variance of the model prediction, low R-Square value
(0.146) is undesirable. It tells that the model is not accurate and the company should
consider increasing the sample size (N) so that data is not over-fitted. If done, the model
accuracy is increased, making it reliable.
Document Page
Classification analysis
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
age *
pilot 1000 100.0% 0 0.0% 1000 100.0%
Table 12: Case Processing Summary (Pilot)
Chi-Square Tests
Value df Asymp. Sig. (2-sided)
Pearson Chi-Square 173.357a 177 .563
Likelihood Ratio 202.130 177 .095
Linear-by-Linear
Association 1.572 1 .210
N of Valid Cases 1000
a. 290 cells (81.5%) have expected count less than 5. The minimum
expected count is .33.
Table 13: Chi-Square Tests (Pilot)
Symmetric Measures
Value Asymp. Std.
Errora
Approx.
Tb
Approx.
Sig.
Interval by
Interval Pearson's R .040 .032 1.254 .210c
Ordinal by
Ordinal
Spearman
Correlation .044 .032 1.401 .161c
N of Valid Cases 1000
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Based on normal approximation.
Table 14: Symmetric Measures (Pilot)
Document Page
Figure 3: Age Bar Chart
Case Processing Summary
Cases
Valid Missing Total
N Percent N Percent N Percent
total_revenue *
experiences_purchased 1000 100.0% 0 0.0% 1000 100.0%
Table 15: Case Processing Summary (Experiences Purchase)
Chi-Square Tests
Value df Asymp. Sig.
(2-sided)
Pearson Chi-Square 5655.041a 476 .000

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Likelihood Ratio 901.403 476 .000
Linear-by-Linear
Association 145.720 1 .000
N of Valid Cases 1000
a. 497 cells (95.2%) have expected count less than 5. The
minimum expected count is .00.
Table 16: Chi-Square Tests (Experiences Purchase)
Symmetric Measures
Value Asymp. Std.
Errora
Approx.
Tb
Approx.
Sig.
Interval by
Interval Pearson's R .382 .058 13.055 .000c
Ordinal by
Ordinal
Spearman
Correlation .393 .028 13.502 .000c
N of Valid Cases 1000
a. Not assuming the null hypothesis.
b. Using the asymptotic standard error assuming the null hypothesis.
c. Based on normal approximation.
Table 17: Symmetric Measures (Experiences Purchase)
Document Page
Figure 4: Total Revenue Bar Chart
The case processing summary informs the user of any data that the SPSS program could not use
in the analysis, usually due to missing values.
Chi-square (χ2) statistics are tests that measure how expectations compare to actaul observed
data.
Symmetric measures describe the relationship between two variables, say x and y, without
differentiating if either variable is an antecedent (or independent variable) or a consequent (or
dependent variable).
Document Page
Section 4: Ethical and security considerations
Informed consent. Personal customer information such as gender especially in extremely
sensitive data such as bisexual will only be extracted when it is voluntary and the customer is
fully aware of the application of such knowledge. Just as detailed by Fairfield, J., & Shtein, H.
(2014customer autonomy will be respected. Respect for confidentiality and anonymity. The
customer’s identity will not be associated with personal responses. Information such as favorable
cause of travel and expenditure on such travels shall remain confidential. Respect to privacy.
Customer’s private information such as attitudes, beliefs and records will not be subjected to
analysis without their knowledge. This will include other concerns such as marital status, age and
income as suggested by (Fairfield, & Shtein, 2014)
Cyber Crime. All the data used in the investigation are to be acquired within the bounds
of Cyber legislation. No data on the customer such as travelling characteristics will be acquired
through means that violation Cyber use convention as suggested (Doan, Halevy and Ives, 2012).
When large data is to be collected, the data is subjected to more personnel. This can
jeopardize the both the privacy and confidentiality of the information as more people access the
data. Besides, an increase in data is likely to be associated with sensitive concerns such as
bisexual proportion of the sample. Cyber security also becomes an issues with a larger sample
size because the increased handling capacity. .
Section 5: Data Science in Next Steps and Potential Solutions:
The key aspect of data science developed to help Itineract Travel Co achieve its objective
of increasing the website visitors into millions and experience offerings into thousands, and to
simplify matching of the right experience to each potential customer while achieving the
company’s growth objectives include Pearson correlation, Descriptive statistics, regression, EDA
and API.
Lifecycle Consideration Potential Options
Objectives defined Develop Company's growth plan
Data preparation Counts are summarised within specific
intervals.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Data collection techniques to be used Data mining
EDA EDA is performed in excel applying data
through device network.
The analytical modelling process Applying APIs to assess sales and growth.
Communication and use of results Agreeing on defined objectives with
Key stakeholders and growth outcomes with
regional community (Teo, 2012).
Practical deployment of solutions Customise selling strategies according to
preference of customers.
How to evaluate success Percentage increase in sales.
Definition of failure Percentage decline in sales.
Data Science Concepts and Tools
Describing current customer population characteristics and current preference
behaviours. Descriptive statistics such as mean and variance are used to execute this task because
descriptive statistics offers better summary of data. The same technique is applicable when the
set of data grows because it will still achieve the objective.
Evaluating the relationship among customer population factors such age, pilot among total
revenue. Pearson correlation coefficient is computed to determine both the strength and direction
for two particular variable. This technique is used because the data is assumed to be linear at this
stage. As the data grows, a more intuitive and flexible technique will be required. This will call
for the use of quotient correlation that is more flexible and measures the anticipated non-linear
relationship.
In order to forecast preference characteristics of population with a unique characteristics
for example while evaluating how customer of a given age bracket respond to a particular
travelling course, linear regression is conducted. Based on the size of the current data, this
technique offer the best relational forecast as it reflect dependency of one variable on another.
However as the size of the data grow, linear regression will include a lot of assumptions. It will
Document Page
therefore be substituted by more complex forecasting algorithms including predictive algorithms
and Big Data
Because there are future plans of upscaling the data, more complex analytical techniques
will be necessary in helping Itineract Travel Co improve customer experience for their services
such as Big Data and Machine Learning. Big Data will be useful in observing and evaluating
customer related trends and patterns such as favourite cause of traveling and purchase
experience. This techniques will be useful in simplifying an otherwise complications as a result
of large volumes of data
Report Appendix: Statistics and Methodology:
Statistics is a method of mathematical analysis, used for a particular category of empirical
data and specific studies in quantified model, depictions and synopsis. Stats tests techniques for
data collection, interpretation, analysis and drawing results. The some comparative measures are
mean, median, mode and variance (Spitzer, 2013).
A1. Pre-processing and EDA:
Data preprocessing involves data munging and analysis. In our case, the data provided is
clean without inconsistent or missing values, so the munging processes is taken care of. For the
regression case, the numerical variables are used since regression works best in continuous
numerical data. To visualize these variables and their relationships, correlation is used.
Data per-processing also entails transforming data from one form to another. For
instance, in the data-set given, gender might have a role to play in determining the best target for
the travel company. However, since it is given in text form, it is not useful in the regression
model. Transforming the data, for instance using 1 for male and 0 for female can render it useful
in both regression and classification.
If it is possible or not to use a mathematical model, specifically EDA should look at what
data will offer us beyond structured modelling or testing task. EDA relates to critical process of
preliminary data analysis in order to establish patterns, detect anomalies, test hypotheses, and
track hypotheses using descriptive statistics including graphic images/representations (Thiess
and Müller, 2018). Iternact travel co. started by processing data in order to render analysis
possible, as shown below:
Document Page
EDA has mainly carried out through pivot tables, summing up data and utilizing average or
cumulative counts (Gitlin, Hayes and Weinstein, 2012). Such charts are plotted, as well as
related charts provided in third section 3.
A2. Statistical Distribution Investigation:
The descriptive statistics is useful in investigating statistical distribution for continuous variables
and is used in the project, together with distribution of probability. A distribution of probability
is statistical method supplying probabilities of different outcomes in any experiment.
Distributions of probability are utilized to describe various forms of random factors and
determine on the basis of such models. Random variables exist in 2 kinds: discrete and
continuous. In conjunction with which group random variable falls in, a statistician could choose
a distinct equation correlated with Random Variable form to measure mean, median, variances,
likelihood or any other statistical formulas. Discrete distribution is being used to design any
discrete random variable as well as to show the probability for a random variables with end
results. In this scenario, total revenue from, the Poisson distribution provides a means to depict
non-uniformity of the result flow while counting random cases. Distribution of Poisson is
discrete function, which means the occurrence may be evaluated in entire numbers as a matter of
fact or not. Fractional events are not included in model (Prevos, 2019). Poisson distribution
(applying the POISSON.DIST feature) is also compared (presuming that mean population is
mean sample / median-sample). Here data is tested from this distribution with counts against
their concentrations.
When median is considered to be pospulation average, and is distorted left once the
sample average is being used, the results follow similar direction to Poisson distribution. One
explanation why the forms that vary when utilizing sample means is that these age, gender and
favourite cause do not meet Poisson's requirement that "variables are independent". This due to
that total sales is linked with age, gender and favourite cause.
A3. Bootstrapping
According to central theorem of limits, if bootstrapping does not matter, "distribution of
mean of random sample... is essentially normal...regardless of how population is distributed."
The statistic on right is based on sample which have booted and shows how re-sampled means

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
are spread more closely than usual in our results. A confidence interval of 95% is established,
and a test is made to verify if total sales are overlapping. The phrase "boot up" can be derived by
bootstrapping to begin operating system of device. Hypothesis conclude that variation is mean
among favourite choices and total revenue. Entire data has been split out into favourite choice
types for each customer (Dominiczak and Khansa, 2018).
A4. Sampling Error and Bias:
A sampling error arises when analyst take any random sample rather than observing each
individual term which comprises population. Sampling error is statistical error when researcher
is not choosing a sample representing the entire datum population and findings in survey are not
results from the whole population (Brous, Janssen and Vilminko-Heikkinen, 2016). Sampling is
experiment that selects multiple items from the wider populations; both sampling errors and non-
sampling errors that occur in selections. A sampling bias is difference in the value of sample
over the actual population values due to fact that the sample does not constitute or partly
represent population. Even random samples may have sampling error since it is just act as
estimate of population. Sampling error is due to the fact that researchers select various subjects
from same group, but yet individual subjects vary. Please note that if anyone take a sample, it's
just a sub-set of entire population, thus the sample will vary widely accordingly. Systematic bias
is most frequent consequence of sampling-error, in which the survey outcomes vary greatly from
that of population as a whole. Logically, if the survey does not reflect population as a whole, it is
most possible that its outcomes will vary from those of entire population. With two similar
studies, equivalent sampling procedures and same population, larger sample study is fewer
sampling error than smaller sample study. As sample size increases, this targets population as a
whole and thereby tackles more of the population's features, reducing sampling errors (Evergreen
and Metzner, 2013).
Document Page
REFERENCES
Books and Journals:
Provost, F. and Fawcett, T., 2013. Data Science for Business: What you need to know about data
mining and data-analytic thinking. " O'Reilly Media, Inc.".
Provost, F. and Fawcett, T., 2013. Data science and its relationship to big data and data-driven
decision making. Big data. 1(1). pp. 51-59.
Larson, D. and Chang, V., 2016. A review and future direction of agile, business intelligence,
analytics and data science. International Journal of Information Management. 36(5). pp.
700-710.
Van Der Aalst, W., 2016. Data science in action. In Process mining (pp. 3-23). Springer, Berlin,
Heidelberg.
Doan, A., Halevy, A. and Ives, Z., 2012. Principles of data integration. Elsevier.
Fairfield, J., & Shtein, H. (2014). Big data, big problems: Emerging issues in the ethics of data science
and journalism. Journal of Mass Media Ethics, 29(1), 38-51.
Teo, B. K., 2012. EXAFS: basic principles and data analysis (Vol. 9). Springer Science &
Business Media.
Brous, P., Janssen, M. and Vilminko-Heikkinen, R., 2016, September. Coordinating decision-
making in data management activities: a systematic review of data governance
principles. In International Conference on Electronic Government (pp. 115-125).
Springer, Cham.
Evergreen, S. and Metzner, C., 2013. Design principles for data visualization in evaluation. New
Directions for Evaluation. 2013(140). pp. 5-20.
Gitlin, R. D., Hayes, J. F. and Weinstein, S. B., 2012. Data communications principles. Springer
Science & Business Media.
Lakowicz, J. R. ed., 2013. Principles of fluorescence spectroscopy. Springer Science & Business
Media.
Copeland, L. O. and McDonald, M. F., 2012. Principles of seed science and technology.
Springer Science & Business Media.
Spitzer, F., 2013. Principles of random walk (Vol. 34). Springer Science & Business Media.
Thiess, T. and Müller, O., 2018. Towards Design Principles for Data-Driven Decision Making–
An Action Design Research Project in the Maritime Industry.
Prevos, P., 2019. Principles of Strategic Data Science: Creating value from data, big and small.
Packt Publishing Ltd.
Dominiczak, J. and Khansa, L., 2018. Principles of automation for patient safety in intensive
care: learning from aviation. The Joint Commission Journal on Quality and Patient
Safety. 44(6). pp.366-371.
Online
Data science. 2020. [Online] Available Through:
<https://datascience.berkeley.edu/about/what-is-data-science/>
1 out of 21
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]