Data Cleaning Workflow: Data Collation, Formatting, and Analysis

Verified

Added on  2021/06/15

|5
|973
|180
Practical Assignment
AI Summary
This assignment details a student's data cleaning workflow, focusing on the preparation of library data from three different sources. The student collated the data, addressed formatting inconsistencies in date and language columns, and created a new client column. The process involved the use of Excel functions such as 'Filter', 'Replace', and 'Format Cell' to standardize data entries. The student also addressed issues of missing values and abbreviations. The assignment includes details of column transformations, data cleaning processes, and the handling of inconsistent entries, ultimately aiming to prepare the data for analysis. The final dataset contained 18 cleaned observations, reflecting the application of various data cleaning techniques and the removal of redundant or incomplete data.
Document Page
Data Cleaning: Work Flow
The data from the three sheets were first collated into one and the data was processed to make
the data useful for analysis.
Source Files
WestdaleLibraryDataISPPA2Feb2018
Data not Parsed Correctly. Dates were incorrectly entered in Column “ACCESSION
NUMBER”. The data was removed from the “Accession Column” to Formatted the
Column Using “Format Cell --> Date-“
In Column “LANG” , all different spellings of language “Noongar” were filtered, using
“Filter” tab ; and “Replace” Command was used to replace them with Noongar.
Similarly, “Wiradjuri” was used to replace the different variations of the spelling.
An Additional Column named “CLIENT” was created. Which contained the cocanated
data from “CLIENT_Second_Name” and “CLIENT_First_Name:.
The following Columns were Copy Pasted to the Respective colums in the Destination
__Data
Table 1 COLUMNS INTERCHANGED
Source Sheet Destination Sheet
Date DATE
Thing Number USN
Language group LANG
COLOUR M_NAME
MATERIAL MATERIAL
FABRIC FABRIC
COST PRICE (AUD)
MAKER M_NAME
CITY M_CITY
SUBURB M_SUBURB
ZIP CODE M_PC
NICKNAME M_NAME
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
MarciaBradyDataISPPA2Feb2018
Formatted the “DATE” Column Using “Format Cell --> Date-“
Data was not parsed properly. The numeric characters were manually removed and
patsted in the empty column to the left. The empty colum was names as “Price(AUD)
In Column “Address” , the City and suburb name were together and not parsed. Another
column was added, naed as “suburb” and suburb values from the address column were
cut and paste manually.
In Column “Language group”, only initials were mention. Hence, the initial “N” was
replaced by “NOONGAR” Similarly, “Wiradjuri” was used to replace the initial W, using
the Replace command. “Westd. Council” was replaced by “Westdale City Council”.
In the column, “Made_for” proper names were added to replace common nouns.
The following Columns were Copy Pasted to the Respective colums in the Destination
__Data
Table 2 COLUMNS INTERCHANGED
Source Sheet Destination Sheet
DATE DATE
Price (AUD) PRICE (AUD)
Thing Number USN
Language group LANG
COLOUR COLOUR
MATERIAL MATERIAL
FABRIC FABRIC
MAKER M_NAME
CITY M_CITY
SUBURB M_SUBURB
Name S_NICKNAME
MADE_FOR CLIENT
DonorAuctionCatalogueDataISPPA2Feb2018
Data from Columns CLIENT_First_Name and CLIENT_Second_Name were merged
together.
The following Columns were Copy Pasted to the Respective colums in the Destination
__Data

Source Sheet Destination Sheet
Date DATE
Document Page
LANG LANG
NUM USN
M_NAME LANG
M_CITY COLOUR
M_SUBURB MATERIAL
M_PC FABRIC
CLIENT CLIENT
PRICE (AUD) PRICE (AUD)
COLOR COLOUR
NICKNAME S_NICKNAME
MATERIAL MATERIAL
FABRIC FABRIC
DESTINATION FILE
Data Labels
Date: Date describes the date of entry of the observations
USN: USN stands for Unique Sequence Number. This column describes the unique entry
number.
S_Nickname: This describes the Scarecrow nickname .
Client: This column describes the name of the client
Language: This column lists the language spoken by the client.
M_name: This column lists the name of the manufacturer.
M_PC: This column lists the Pin Code/ Zip code of the manufacturer
M_Suburb: This column describes the suburb of the maker location.
M_City: This entry lists the city that the manufacturer of the scarecrow is located in.
Price: The selling price of the scarecrow is listed in Australian Dollars.
Colour: Colour describes the colour of the scarecrow.
Fabric: Fabric describes the fabric used to decorate the scare crow
Document Page
Material: Material describes the material that the scarecrow is primarily made of.
PROCESSES
Data with null or no observations were deleted. Apart from the basic cleaning processes such as
correcting spelling mistakes, and using full forms of abbreviations, the following changes were
made , in the process of cleaning.
Name
1. All observations containing blanks, none and null observations were deleted. All
observations using 0, Null , None were filtered , using filter tab and the selected rows
were deleted.
2. Braidy was replaced with Brady in 3 observations using “Replace” function in Excel.
USN
Serial number observations were added. I n some cases, serial number was not available. Hence,
the observation is NA
Date
Date column is formatted as dd-mon-yy using the “DATE” function in excel
1. For Djeran, the season runs from April to May. Hence, the proxy 15th April was used for
date “’73 Djeran” using the Function “Replace”
2. For Birak, the season runs from January to December . Hence, the date could be either
December 1974 or January 1974 since, the financial year does ends in June, it is possible
to enter the date in the wrong financial year. Hence, the data falls in different years.
3. For ’75 Good Friday, Good Friday fell on 28th March, hence the date 28th March ws
entered.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Fabric
The “Replace” Function was used to change the fabric
Zip Code
Data from ‘Westdale Library Data’ column, “ZIP CODE” was used for data in
“M_pc”
A total of 18 observations were remainder after the cleaning.
M_City
Different variations of “Wabbawabballup” were filtered and corrected to have a uniform
spelling “Wabbawabballup” using Find and Replace function.
M_Suburb
Different variations of “Wabbawabballup” were filtered and corrected to have a uniform
spelling “Wabbawabballup” using Find and Replace function.
chevron_up_icon
1 out of 5
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]