Comprehensive Data Analysis Project: CardioGood Fitness Data in R
VerifiedAdded on 2022/01/21
|29
|7418
|424
Project
AI Summary
This project analyzes the CardioGood Fitness dataset using R, aiming to generate insights into customer profiles and product preferences. The analysis begins with data import, environment setup, and variable identification. It then proceeds with univariate and bivariate analysis, utilizing graphical representations such as histograms, boxplots, and stacked column charts to visualize data distributions and relationships between variables. The project explores categorical and numerical variables, examining customer demographics (age, gender, education, marital status, income) and their relationship with product choices (TM195, TM498, TM798), usage patterns, and fitness levels. The findings highlight key customer segments, product popularity, and potential areas for further investigation, such as the correlation between product features and customer characteristics. The analysis concludes with insights derived from the data, offering a comprehensive understanding of the CardioGood Fitness customer base and their interactions with the treadmill products.

Project in R - Cardio Good Fitness
Data Analysis
Data Analysis
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Table of Contents
1 Project Objective............................................................................................................................2
2 Assumptions...................................................................................................................................2
3 Exploratory Data Analysis – Step by step approach ...............................................................3
3.1 Environment Set up and Data Import........................................................................................3
3.1.1 Install necessary Packages and Invoke Libraries..............................................................3
3.1.2 Set up working Directory .................................................................................................3
3.1.3 Import and Read the Dataset...........................................................................................3
3.2 Variable Identification...............................................................................................................3
3.2.1 Variable Identification – Inferences.................................................................................3
3.3 Univariate Analysis...................................................................................................................4
3.4 Bi-Variate Analysis....................................................................................................................9
3.5 Missing Value Identification...................................................................................................16
3.6 Outlier Identification..............................................................................................................16
3.7 Variable Transformation / Feature Creation .................................................................. 16 - 19
4 Conclusion.................................................................................................................................... 20
5 Appendix A – Source Code................................................................................................. 21 - 28
1 Project Objective............................................................................................................................2
2 Assumptions...................................................................................................................................2
3 Exploratory Data Analysis – Step by step approach ...............................................................3
3.1 Environment Set up and Data Import........................................................................................3
3.1.1 Install necessary Packages and Invoke Libraries..............................................................3
3.1.2 Set up working Directory .................................................................................................3
3.1.3 Import and Read the Dataset...........................................................................................3
3.2 Variable Identification...............................................................................................................3
3.2.1 Variable Identification – Inferences.................................................................................3
3.3 Univariate Analysis...................................................................................................................4
3.4 Bi-Variate Analysis....................................................................................................................9
3.5 Missing Value Identification...................................................................................................16
3.6 Outlier Identification..............................................................................................................16
3.7 Variable Transformation / Feature Creation .................................................................. 16 - 19
4 Conclusion.................................................................................................................................... 20
5 Appendix A – Source Code................................................................................................. 21 - 28

1. Project Objective
The objective of the report is to explore the cardio data set (“CardioGoodFitness”) in R and generate
insights about the data set. This exploration report will consist of the following:
- Importing the dataset in R
- Understanding the structure of dataset
- Graphical exploration
- Descriptive statistics
- Insights from the dataset
2. Assumptions
After analysing the data, we can say that this is to identify the profile of the typical customer for
each treadmill product offered by CardioGood Fitness. We can decide to investigate whether there
are differences across the product lines with respect to customer characteristics. Therefore, it has
been decided to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail
store during the prior three months. The data are stored in the CardioGoodFitness.csv file.
It has been identified from the dataset that the following customer variables to study: (product
purchased), TM195, TM498, or TM798; gender; age, (in years); education, (in years); (relationship
status), single or partnered; annual household (income); average number of times the customer
plans to (use the treadmill each week); average (number of miles) the customer expects to walk/
run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent
shape.
The 180 observations of the dataset relate to 180 unique customers of the treadmill products. The
characteristics in the dataset are linked to the fitness level and treadmill usage characteristics of the
customers. It can also be assumed that the data provide is accurate as per the survey/ data collected
by the company. Below data dictionary is considered for the 9 variables in the dataset:
Sl. No. Dimension Detail Description
1 Product Model of treadmill product (TM195 / TM498 / TM798)
2 Age Age of the customer (Years)
3 Gender Gender of the customer (Male & Female)
4 Education Education of the customer (Years)
5 Marital Status Marital status of the customer (Single & Partnered/ Married)
6 Usage Weekly average number of times the customer plans to use the treadmill
(No. of times per Week)
7 Fitness Weekly average number of miles the customer expects to walk/run on the
treadmill (Miles per Week). 5 being the “very fit” and 1 being “very unfit”
8 Income Annual income of the customer (Assumingly in US$)
9 Miles Total distance covered on the treadmill (Miles)
The objective of the report is to explore the cardio data set (“CardioGoodFitness”) in R and generate
insights about the data set. This exploration report will consist of the following:
- Importing the dataset in R
- Understanding the structure of dataset
- Graphical exploration
- Descriptive statistics
- Insights from the dataset
2. Assumptions
After analysing the data, we can say that this is to identify the profile of the typical customer for
each treadmill product offered by CardioGood Fitness. We can decide to investigate whether there
are differences across the product lines with respect to customer characteristics. Therefore, it has
been decided to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail
store during the prior three months. The data are stored in the CardioGoodFitness.csv file.
It has been identified from the dataset that the following customer variables to study: (product
purchased), TM195, TM498, or TM798; gender; age, (in years); education, (in years); (relationship
status), single or partnered; annual household (income); average number of times the customer
plans to (use the treadmill each week); average (number of miles) the customer expects to walk/
run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent
shape.
The 180 observations of the dataset relate to 180 unique customers of the treadmill products. The
characteristics in the dataset are linked to the fitness level and treadmill usage characteristics of the
customers. It can also be assumed that the data provide is accurate as per the survey/ data collected
by the company. Below data dictionary is considered for the 9 variables in the dataset:
Sl. No. Dimension Detail Description
1 Product Model of treadmill product (TM195 / TM498 / TM798)
2 Age Age of the customer (Years)
3 Gender Gender of the customer (Male & Female)
4 Education Education of the customer (Years)
5 Marital Status Marital status of the customer (Single & Partnered/ Married)
6 Usage Weekly average number of times the customer plans to use the treadmill
(No. of times per Week)
7 Fitness Weekly average number of miles the customer expects to walk/run on the
treadmill (Miles per Week). 5 being the “very fit” and 1 being “very unfit”
8 Income Annual income of the customer (Assumingly in US$)
9 Miles Total distance covered on the treadmill (Miles)
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

3. Exploratory Data Analysis – Step by step approach
A Typical Data exploration activity consists of the following steps:
1. Environment Set up and Data Import
2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
5. Missing Value Treatment (Not in scope for our project)
6. Outlier Treatment (Not in scope for our project)
7. Variable Transformation / Feature Creation
8. Feature Exploration
We shall follow these steps in exploring the provided dataset.
Although Steps 5 and 6 are not in scope for this project, a brief about these steps (and other
steps as well) is given, as these are important steps for Data Exploration journey.
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability.
3.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting data files
and code files easier. Basically, working directory is the location/ folder on the PC where you
have the data, codes etc. related to the project.
Please refer Appendix A for Source Code.
3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
For example: Cardio <- read.csv("CardioGoodFitness.csv")
Please refer Appendix A for Source Code.
3.2 Variable Identification
Following R functions used during the analysis:
- dim (): See dimensions (# of rows/ # of columns) of the data frame.
- names (): See Feature names of the dataset.
- str (): Display internal structure of an R object, to identify classes of the features.
3.2.1 Variable Identification inferences
a. summary (<data frame>): Provides summary of the dataset.
str(Cardio): Provides the structure of the object “Cardio”. Here it states as data.frame for the
object as it has variables which are of “Factor” and “ int” data types.summary (Cardio): gives
summary all the 9 variables like the frequency of each variable:
A Typical Data exploration activity consists of the following steps:
1. Environment Set up and Data Import
2. Variable Identification
3. Univariate Analysis
4. Bi-Variate Analysis
5. Missing Value Treatment (Not in scope for our project)
6. Outlier Treatment (Not in scope for our project)
7. Variable Transformation / Feature Creation
8. Feature Exploration
We shall follow these steps in exploring the provided dataset.
Although Steps 5 and 6 are not in scope for this project, a brief about these steps (and other
steps as well) is given, as these are important steps for Data Exploration journey.
3.1 Environment Set up and Data Import
3.1.1 Install necessary Packages and Invoke Libraries
Use this section to install necessary packages and invoke associated libraries. Having all the
packages at the same places increases code readability.
3.1.2 Set up working Directory
Setting a working directory on starting of the R session makes importing and exporting data files
and code files easier. Basically, working directory is the location/ folder on the PC where you
have the data, codes etc. related to the project.
Please refer Appendix A for Source Code.
3.1.3 Import and Read the Dataset
The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file.
For example: Cardio <- read.csv("CardioGoodFitness.csv")
Please refer Appendix A for Source Code.
3.2 Variable Identification
Following R functions used during the analysis:
- dim (): See dimensions (# of rows/ # of columns) of the data frame.
- names (): See Feature names of the dataset.
- str (): Display internal structure of an R object, to identify classes of the features.
3.2.1 Variable Identification inferences
a. summary (<data frame>): Provides summary of the dataset.
str(Cardio): Provides the structure of the object “Cardio”. Here it states as data.frame for the
object as it has variables which are of “Factor” and “ int” data types.summary (Cardio): gives
summary all the 9 variables like the frequency of each variable:
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

b. colSums(is.na()): Check missing values. There are no missing values in the data set.
No. of Observations No. of Variables Dimension
180 9
No. of Females No. of Males
76 104
Marital Status (Partnered) Marital Status (Single)
107 73
No. of Observations No. of Variables Dimension
180 9
No. of Females No. of Males
76 104
Marital Status (Partnered) Marital Status (Single)
107 73

3.3 Univariate Analysis
Univariate Analysis can be done on the Categorical Variables and Numeric Variables.
The Categorical Variables are: Product, Gender and Marital Status:
Using the following histogram (Bar Chart) to represent the distribution by the three categorical
variables using Bar Plot and Pie Chart:
Univariate Analysis can be done on the Categorical Variables and Numeric Variables.
The Categorical Variables are: Product, Gender and Marital Status:
Using the following histogram (Bar Chart) to represent the distribution by the three categorical
variables using Bar Plot and Pie Chart:
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Observations from Categorical Variables Analysis:
(Graph 1) - Product Sales: The count by each product type (i.e. treadmill) is provided. TM195 has the
highest count (80), followed by TM498 (60) and TM798 (40). This gives us an understanding that
product TM195 is most popular among consumers. The reason might be that it is more affordable.
We will see that later on.
(Graph 2) - Treadmill Usage by Gender: The graph clearly shows that No. of Males (104) clearly
outnumbers the No. of Females (76), with a suggestion that Men are more health conscious.
(Graph 3) - Treadmill Usage by Marital Status: Partnered (107) customers is more than single (73)
customers.
Univariate Analysis can also be done based on the Integer (Numeric Variables).
The Numeric Variables are: Age, Education, Usage, Fitness, Income and Miles.
Using Histogram and Boxplot to represent the data and to identify the Outliers.
(Graph 1) - Product Sales: The count by each product type (i.e. treadmill) is provided. TM195 has the
highest count (80), followed by TM498 (60) and TM798 (40). This gives us an understanding that
product TM195 is most popular among consumers. The reason might be that it is more affordable.
We will see that later on.
(Graph 2) - Treadmill Usage by Gender: The graph clearly shows that No. of Males (104) clearly
outnumbers the No. of Females (76), with a suggestion that Men are more health conscious.
(Graph 3) - Treadmill Usage by Marital Status: Partnered (107) customers is more than single (73)
customers.
Univariate Analysis can also be done based on the Integer (Numeric Variables).
The Numeric Variables are: Age, Education, Usage, Fitness, Income and Miles.
Using Histogram and Boxplot to represent the data and to identify the Outliers.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Observations from above Numeric Variables Analysis:
(Graph 4 and 7) - Treadmill Usage by Age:
From the Boxplot, we can identify that the minimum age of the customer who used any of the
products is 18 years and the maximum age is 50 years which is also an outlier. We can also identify
from the Histogram that customers between age 20-25 are the ones who mostly used the treadmill
for fitness with a count of 69 customers.
(Graph 5 and 8) - Treadmill Usage by Education Level:
From the Boxplot, we can identify that the minimum education level of the customers is 12 years
while the maximum is 21 years. Customers with 15 years of average education are the ones who
mostly uses the treadmill.
(Graph 6 and 9) - Treadmill Usage:
From the Boxplot, we identify that customers used the treadmill for a minimum of 2 times per week
with maximum of 7 times a week. There are very few customers who exercised more than 6 times a
week who can be termed as outliers in this dataset.
(Graph 4 and 7) - Treadmill Usage by Age:
From the Boxplot, we can identify that the minimum age of the customer who used any of the
products is 18 years and the maximum age is 50 years which is also an outlier. We can also identify
from the Histogram that customers between age 20-25 are the ones who mostly used the treadmill
for fitness with a count of 69 customers.
(Graph 5 and 8) - Treadmill Usage by Education Level:
From the Boxplot, we can identify that the minimum education level of the customers is 12 years
while the maximum is 21 years. Customers with 15 years of average education are the ones who
mostly uses the treadmill.
(Graph 6 and 9) - Treadmill Usage:
From the Boxplot, we identify that customers used the treadmill for a minimum of 2 times per week
with maximum of 7 times a week. There are very few customers who exercised more than 6 times a
week who can be termed as outliers in this dataset.

⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide

Observations from above Numeric Variables Analysis:
(Graph 10 and 13) - Treadmill Usage by Fitness Level:
The minimum number of miles the customer covers on the treadmill in a week is 1 and the
maximum number of times the customer covers on the treadmill in a week is 5. The average of all
the customers is 3.3 miles. The Boxplot indicates an outlier on the lower side, with fitness level 1.
(Graph 11 and 14) - Income Level:
From the Boxplot we can identify that the minimum income of the customers is 29562$, while the
maximum is 104581$. There are some outliers with very high-income level above 70000$. The
average income is 53720$ and they form the bulk of the treadmill buyers.
(Graph 12 and 15) - Miles Exercised:
The minimum total miles exercised by the customers is 21 miles and the maximum total miles
covered by the customer is 360 miles. The average of all the customers is 103.2 miles. The Boxplot
indicates outliers beyond 170 miles.
3.4 Bi-Variate Analysis
Bi-variate analysis is used to represent the relationship between two variables and helps users to
identify how changes in one variable affects the other variable. The below combinations are
possible:
1. Categorical Vs Categorical
2. Categorical Vs Numerical
3. Numerical Vs Numerical
Categorical Vs Categorical: To identify the relationship between two categorical variables, we can
use a Stacker Column Chart.
(Graph 10 and 13) - Treadmill Usage by Fitness Level:
The minimum number of miles the customer covers on the treadmill in a week is 1 and the
maximum number of times the customer covers on the treadmill in a week is 5. The average of all
the customers is 3.3 miles. The Boxplot indicates an outlier on the lower side, with fitness level 1.
(Graph 11 and 14) - Income Level:
From the Boxplot we can identify that the minimum income of the customers is 29562$, while the
maximum is 104581$. There are some outliers with very high-income level above 70000$. The
average income is 53720$ and they form the bulk of the treadmill buyers.
(Graph 12 and 15) - Miles Exercised:
The minimum total miles exercised by the customers is 21 miles and the maximum total miles
covered by the customer is 360 miles. The average of all the customers is 103.2 miles. The Boxplot
indicates outliers beyond 170 miles.
3.4 Bi-Variate Analysis
Bi-variate analysis is used to represent the relationship between two variables and helps users to
identify how changes in one variable affects the other variable. The below combinations are
possible:
1. Categorical Vs Categorical
2. Categorical Vs Numerical
3. Numerical Vs Numerical
Categorical Vs Categorical: To identify the relationship between two categorical variables, we can
use a Stacker Column Chart.
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

(Graph 16) - Product Purchased by Gender Observations:
Product TM195 is most and equally preferred by both Male and Female, followed by TM498 and
TM798. However, all the three products are equally like by men while females like TM195 the most
and TM798 the least. This brings up an assumption that the model/color for TM195 might be more
unisex and since this information is not available, this extends the scope of analysis further.
(Graph 17) - Product Purchased by Marital Status and Gender Observations:
We observe that both partnered and single customers prefer product TM195. However, partnered
Males have equal liking towards all the product variances.
Product TM195 is most and equally preferred by both Male and Female, followed by TM498 and
TM798. However, all the three products are equally like by men while females like TM195 the most
and TM798 the least. This brings up an assumption that the model/color for TM195 might be more
unisex and since this information is not available, this extends the scope of analysis further.
(Graph 17) - Product Purchased by Marital Status and Gender Observations:
We observe that both partnered and single customers prefer product TM195. However, partnered
Males have equal liking towards all the product variances.

Categorical Vs Numerical:
To interpret relation between categorical and numerical variables, we can draw box plots for each of
the categorical variables
(Graph 18) - Product Vs Age: TM195 is popular among all age groups.
(Graph 19) - Product Vs Education: TM798 is more popular among people with higher education
level.
(Graph 20) - Product Vs Usage: Usage of TM798 is higher than others
To interpret relation between categorical and numerical variables, we can draw box plots for each of
the categorical variables
(Graph 18) - Product Vs Age: TM195 is popular among all age groups.
(Graph 19) - Product Vs Education: TM798 is more popular among people with higher education
level.
(Graph 20) - Product Vs Usage: Usage of TM798 is higher than others
⊘ This is a preview!⊘
Do you want full access?
Subscribe today to unlock all pages.

Trusted by 1+ million students worldwide
1 out of 29
Your All-in-One AI-Powered Toolkit for Academic Success.
+13062052269
info@desklib.com
Available 24*7 on WhatsApp / Email
Unlock your academic potential
Copyright © 2020–2025 A2Z Services. All Rights Reserved. Developed and managed by ZUCOL.

