Data Science Project: Exploratory Data Analysis and Linear Regression

Verified

Added on  2025/06/23

|19
|1865
|100
AI Summary
Desklib provides solved assignments and past papers to help students succeed.
Document Page
Business Intelligence
1
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Table of Contents
Task 1 Data Quality and Data Warehousing...................................................................................4
Task 1.1........................................................................................................................................4
Task 1.2........................................................................................................................................5
Task 1.3........................................................................................................................................5
Task 2 Exploratory Data Analysis and Linear Regression Analysis...............................................7
Task 2.1 Exploratory data analysis (EDA)..................................................................................7
Task 2.2 Linear Regression model.............................................................................................13
Task 3 Tableau Desktop View of Weather Traffic Volume..........................................................17
Task 3.1......................................................................................................................................17
Task 3.2......................................................................................................................................18
References......................................................................................................................................19
2
Document Page
List of Figures
Figure 1: importing data set.............................................................................................................6
Figure 2: the EDA process...............................................................................................................7
Figure 3: selected data.....................................................................................................................7
Figure 4: statistics of clouds............................................................................................................8
Figure 5: statistics of holiday...........................................................................................................8
Figure 6: statistics of snow_1h........................................................................................................9
Figure 7: statistics of temp...............................................................................................................9
Figure 8: statistics of traffic_volume.............................................................................................10
Figure 9: traffic_volume vs holiday..............................................................................................11
Figure 10: temp vs clouds_all........................................................................................................11
Figure 11: traffic_volume..............................................................................................................12
Figure 12: linear Regression model...............................................................................................13
Figure 13: data split in the ratio.....................................................................................................14
Figure 14: results...........................................................................................................................14
Figure 15: Linear Regression.........................................................................................................15
3
Document Page
Task 1 Data Quality and Data Warehousing
Task 1.1
Data quality is an assessment or a perception of the fitness of data which serves the purpose in
the provided context. It is also defined as the set of values or the variables either qualitative or
quantitative. The data is usually of high quality so that it can be used for planning and decision
making. It is also used in operations. The data is said to be of high quality when it is able to
represent properly the construct of the real world. It is the examination of the application’s
reliability, efficiency, and fitness of data. In any organization, the quality of data is very
important for operational and transactional processes. It is the dataset’s utility as function which
has the capability of easy processing and da analysis for other uses. This analysis is done by the
systems of da warehouse, database, or data analytics. The data quality is the most useful data and
it can be of high quality if the data remains unambiguous and consistent. If the data is not of the
high quality then it goes through the process of data cleansing so that the quality of the data can
be raised. The activities that the data quality involves are data validation and rationalization. The
main benefits of this are that it provides high-quality data that can be used by the organization
and it can help them to make better decisions. It is necessary for the efforts of business
intelligence and data analytics for good operational efficiency. The maintenance of data quality
requires monitoring and cleaning of data periodically. The key components of data quality are:
Credibility: it refers to the extent to which the data is considered true and credible. It can
differ from source to source.
Completeness: the stored data should be 100% complete. It defines the level at which the
attributes of the data are supplied.
Timeliness: it refers to the extent to which the data is updated for the current time. It is
affected by the process the data is collected.
Accuracy: it defines the degree the data represents or measures the status of the real
world. It can be calculated by the use of methods that are automated.
Integrity: it checks for the reference validity and the joining of the data sets. Data is said
to be valid if the syntax is correct.
4
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Consistency: it measures the difference between the two comparing representations. It
mainly assesses that the facts of the other datasets match or not.
Uniqueness: the data is not recorded more than once i.e., it does not support duplication.
Task 1.2
In the data warehouse, the data quality is very essential more than any operational system. The
decisions that are strategic are based on the data warehouse information which is far-reaching in
consequences and scope. The reasons for which the data quality is critical for the data warehouse
are:
It improves productivity in the process of streamlining.
It enables the best customer service.
It helps in boosting the confidence in making decisions.
It helps in the enhancement of strategic decision making.
It increases the opportunity for adding the best values to the services.
It aims to reduce costs mainly in the field of marketing campaigns.
It helps in the risk reduction from disastrous decisions.
It avoids the effects of the contamination of data.
It helps to guarantee and prioritizes the effective utilization of resources.
It offers information that can handle the service effectively.
It provides well-timed and precise information for handling services and accountability.
It helps in enhancing the performance of the business.
The operational cost can be reduced for the business by doing each process analysis.
It can help the sales team for identifying the leads of high-quality sales and good target
sales.
The customers and the support team can access the records of the contact to whom they
can trust and also the history of complete interaction.
Task 1.3
Maintaining quality data is a challenging task in a data warehouse because it incorporates new
features. The following challenges are:
5
Document Page
The different data sources bring many complex data and data types which increases the
difficulties in data integration. The scope of data for the analysis has been surpassed by
their own business systems. The dataset comes from sources like mobile internet, from
internet of things, observational data and scientific experimental data, and gathered data
from various industries. The data collected are of different types like they are structured,
semi-structured, and unstructured. It increases the inconsistency and conflict.
There is a huge volume of data and to judge its quality becomes difficult in a certain
amount of time. The amount of information gets increased in a certain amount of time
and to deal with this data and to apply the operations like cleaning, integrating,
collecting, and to obtain the data of high quality becomes quite difficult. This all is the
result of the unstructured data type and hence takes an ample amount of time for
transforming it into the structured form.
The data timeliness is very short and hence it changes very fast. The invalid information
will be result if the industry takes a long time to deal with it and the gathering it. This,
producing meaningless results and the analysis and processing lead to the conflicts.
Hence, this is another biggest challenge to analyze the data in real-time.
6
Document Page
Task 2 Exploratory Data Analysis and Linear Regression Analysis
Task 2.1 Exploratory data analysis (EDA)
EDA is an approach for the analysis of data that helps to incorporate the techniques that can be
able to maximize the data set insight, detect the anomalies and outliers, uncover underlying
structure, test underlying assumptions, extract important variables, determine the settings of
optimal factor, and to develop the parsimonious. It is a precise approach and not the attitude or
the philosophy on the carrying out of the data analysis. To derive the EDA process of the data
set, the first approach is to import the data on the rapid miner.
Figure 1: importing data set
For the EDA process the two parameters are used, the first one is the Retrieve
29_weather_traffic_volume and the second one is the select attributes. The select attributes
allow selecting the attributes.
7
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Figure 2: the EDA process
The five selected attributes are clouds_all, holiday, snow_1h, temp, and traffic_volume. The
parameters are connected through the ports. The output of one becomes the input of the other.
The Retrieve 29_weather_traffic_volume contains the full data set and select attributes
contain the attributes that are selected. The selected data is as follows:
Figure 3: selected data
8
Document Page
The results of the EDA process are as follows:
Figure 4: statistics of clouds
The values are:
Minimum value: 0
Maximum value: 100
Average value: 49.362
Missing value: 0
Standard deviation: 39.016
Figure 5: statistics for holiday
The values are:
Least: Washington’s Birthday (5)
Most: none (48143)
9
Document Page
Missing value: 0
Figure 6: statistics of snow_1h
The values are:
Minimum value: 0
Maximum value: 0
Average value: 0
Missing value: 0
Standard deviation: 0
Figure 7: statistics of temp
The values are:
Minimum value: 0
10
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Maximum value: 310.070
Average value: 281.206
Missing value: 0
Standard deviation: 13.338
Figure 8: statistics of traffic_volume
The values are:
Minimum value: 0
Maximum value: 7280
Average value: 3529.818
Missing value: 0
Standard deviation: 1986.861
The statistics mentioned below in the form of a graph:
11
Document Page
Figure 9: traffic_volume vs holiday
Figure 10: temp vs clouds_all
12
chevron_up_icon
1 out of 19
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]