Data Import and Read the Dataset in R - Cardio Good Fitness Data Analysis Project
29 Pages7418 Words424 Views
Added on 2022-01-21
About This Document
Project in R - Cardio Good Fitness Data Analysis Project in R - Cardio Good Fitness Data Analysis 1 Project Objective 2 2 Assumptions .2 3 Exploratory Data Analysis – Step by step approach .3 3.1 Environment Set up and Data Import 3 3.1.1 Install necessary Packages and Invoke Libraries 3 3.1.2 Set up working Directory .3 3.1.3 Import and Read the Dataset.3 3.2 Variable Identification .3 3.2.1 Variable Identification – Inferences .3
Data Import and Read the Dataset in R - Cardio Good Fitness Data Analysis Project
Added on 2022-01-21
BookmarkShareRelated Documents
Project in R - Cardio Good Fitness Data Analysis
Table of Contents 1Project Objective............................................................................................................................2 2Assumptions...................................................................................................................................2 3Exploratory Data Analysis – Step by step approach...............................................................3 3.1 Environment Set up and Data Import........................................................................................3 3.1.1Install necessary Packages and Invoke Libraries..............................................................3 3.1.2Set up working Directory .................................................................................................3 3.1.3Import and Read the Dataset...........................................................................................3 3.2 Variable Identification...............................................................................................................3 3.2.1Variable Identification – Inferences.................................................................................3 3.3 Univariate Analysis...................................................................................................................4 3.4 Bi-Variate Analysis....................................................................................................................9 3.5 Missing Value Identification...................................................................................................16 3.6 Outlier Identification..............................................................................................................16 3.7 Variable Transformation / Feature Creation .................................................................. 16 - 19 4Conclusion.................................................................................................................................... 20 5Appendix A – Source Code................................................................................................. 21 - 28
1.Project Objective The objective of the report is to explore the cardio data set (“CardioGoodFitness”) in R and generate insights about the data set. This exploration report will consist of the following: -Importing the dataset in R -Understanding the structure of dataset -Graphical exploration -Descriptive statistics -Insights from the dataset 2.Assumptions After analysing the data, we can say that this is to identify the profile of the typical customer for each treadmill product offered by CardioGood Fitness. We can decide to investigate whether there are differences across the product lines with respect to customer characteristics. Therefore, it has been decided to collect data on individuals who purchased a treadmill at a CardioGoodFitness retail store during the prior three months. The data are stored in the CardioGoodFitness.csv file. It has been identified from the dataset that the following customer variables to study: (product purchased),TM195, TM498, or TM798; gender; age, (in years); education, (in years); (relationship status), single or partnered; annual household (income); average number of times the customer plans to (use the treadmill each week); average (number of miles) the customer expects to walk/ run each week; and self-rated fitness on an 1-to-5 scale, where 1 is poor shape and 5 is excellent shape. The 180 observations of the dataset relate to 180 unique customers of the treadmill products. The characteristics in the dataset are linked to the fitness level and treadmill usage characteristics of the customers. It can also be assumed that the data provide is accurate as per the survey/ data collected by the company. Below data dictionary is considered for the 9 variables in the dataset: Sl. No.DimensionDetail Description 1ProductModel of treadmill product (TM195 / TM498 / TM798) 2AgeAge of the customer (Years) 3GenderGender of the customer (Male & Female) 4EducationEducation of the customer (Years) 5Marital StatusMarital status of the customer (Single & Partnered/ Married) 6UsageWeekly average number of times the customer plans to use the treadmill (No. of times per Week) 7FitnessWeekly average number of miles the customer expects to walk/run on the treadmill (Miles per Week). 5 being the “very fit” and 1 being “very unfit” 8IncomeAnnual income of the customer (Assumingly in US$) 9MilesTotal distance covered on the treadmill (Miles)
3.Exploratory Data Analysis – Step by step approach A Typical Data exploration activity consists of the following steps: 1. Environment Set up and Data Import 2. Variable Identification 3. Univariate Analysis 4. Bi-Variate Analysis 5. Missing Value Treatment (Not in scope for our project) 6. Outlier Treatment (Not in scope for our project) 7. Variable Transformation / Feature Creation 8. Feature Exploration We shall follow these steps in exploring the provided dataset. Although Steps 5 and 6 are not in scope for this project, a brief about these steps (and other steps as well) is given, as these are important steps for Data Exploration journey. 3.1Environment Set up and Data Import 3.1.1 Install necessary Packages and Invoke Libraries Use this section to install necessary packages and invoke associated libraries. Having all the packages at the same places increases code readability. 3.1.2 Set up working Directory Setting a working directory on starting of the R session makes importing and exporting data files and code files easier. Basically, working directory is the location/ folder on the PC where you have the data, codes etc. related to the project. Please refer Appendix A for Source Code. 3.1.3 Import and Read the Dataset The given dataset is in .csv format. Hence, the command ‘read.csv’ is used for importing the file. For example:Cardio <- read.csv("CardioGoodFitness.csv") Please refer Appendix A for Source Code. 3.2Variable Identification Following R functions used during the analysis: -dim (): See dimensions (# of rows/ # of columns) of the data frame. -names (): See Feature names of the dataset. -str (): Display internal structure of an R object, to identify classes of the features. 3.2.1 Variable Identification inferences a.summary (<data frame>): Provides summary of the dataset. str(Cardio): Provides the structure of the object “Cardio”. Here it states as data.frame for the object as it has variables which are of “Factor” and “ int” data types.summary (Cardio): gives summary all the 9 variables like the frequency of each variable:
b.colSums(is.na()): Check missing values. There are no missing values in the data set. No. of ObservationsNo. of Variables Dimension 1809 No. of FemalesNo. of Males 76104 Marital Status (Partnered)Marital Status (Single) 10773
3.3Univariate Analysis Univariate Analysis can be done on the Categorical Variables and Numeric Variables. The Categorical Variables are:Product, Gender and Marital Status: Using the following histogram (Bar Chart) to represent the distribution by the three categorical variables using Bar Plot and Pie Chart:
End of preview
Want to access all the pages? Upload your documents or become a member.