The University of Sydney DATA2001: Data Cleaning and Exploration

Verified

Added on  2022/01/21

|73
|3869
|94
Homework Assignment
AI Summary
This document presents an assignment solution for DATA2001, focusing on data cleaning and exploration using Python. It covers the Python environment, including Jupyter Notebooks, and the use of Pandas for data manipulation and analysis. The assignment delves into data types, levels of measurement (nominal, ordinal, interval, and ratio), and measures of central tendency and dispersion. It explores data acquisition, cleaning, and transformation techniques, including handling missing data and converting data types. The solution emphasizes the use of Python libraries like Pandas and the importance of exploratory analysis workflows. The document also discusses the use of csv and pandas for reading data, and the handling of missing data. It also covers the use of functions to convert values in a given column. The assignment provides a practical guide to cleaning and preparing data for analysis using Python tools, with examples and explanations to facilitate understanding. This assignment is a valuable resource for students learning data science and big data concepts.
Document Page
The University of Sydney Page 1
DATA2001: Data Science,
Big Data and Data Diversity
Data Cleaning and Exploration
with Python
Presented by Alan Fekete
Material prepared by Uwe Roehm
School of Computer Science
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The University of Sydney Page 2
Jupyter Notebooks:
The Python Environment in DATA2001
Document Page
The University of Sydney Page 3
Jupyter Notebooks support interactive Data
with Python
IPython interactive command shell offers:
Introspection
Tab completion
Command history
Jupyter runs in a browser and supports:
Sharing and documenting of live code
Data cleaning, visualisation, machine learning, …
Jupyter’s gallery of interesting notebooks:
https://github.com/ipython/ipython/wiki/A-gallery-of-interesting-IPython-Notebooks
We provide Jupyter servers which run Python 3
https://ucpu0.ug.it.usyd.edu.au/ (remember you need to be using VPN, if off campus
Document Page
The University of Sydney Page 4
1. Click here for file
open dialogue
2. Click upload
next to file name
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The University of Sydney Page 5
Installing Python and Jupyter using Anaconda
You can use our Jupyter
server
but it can be slow
If you wish, you
can install Python and
Jupyter privately, eg
using
Anaconda Distribution,
which includes Python,
the Jupyter Notebook,
and other commonly used
packages for scientific
computing and data
science.
Document Page
The University of Sydney Page 6
Python and Data Science Libraries
Document Page
The University of Sydney Page 7
Python background
Students who did data1002: this should mostly be revisio
If you didn’t really master pandas, matplotlib before, do so now
Also note the following key differences
More sophisticated ways to consider the kinds of data (not
numerical/castegorical)
Students who learned Python elsewhere (eg infi1110): yo
need to learn how to use particular libraries (Pandas,
matplotlib etc) from the examples here, and online reso
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The University of Sydney Page 8
Python general concepts
general program syntax
variables and types
integer and float numbers, string types, type conversion
list, dictionary, tuple and set
condition statements (if/elif/else)
for loops, ranges
functions
print(), len(), lower(), upper(), …
nesting of functions; example: print( len( str.upper() ) )
Document Page
The University of Sydney Page 9
Data Preparation and Exploration with Pyth
Objective
Learn Python tools for exploring a new
data set programmatically.
Lecture
Data types, cleaning, preprocessing
Descriptive statistics, e.g., median,
quartiles, IQR, outliers
Descriptive visualisation, e.g.,
boxplots, confidence intervals
Readings
Data Science from Scratch: Ch 4-5
Exercises
matplotlib: Visualisation
numpy/scipy: Descriptive stats
TODO in W2/W3
Grok Python modules
Explore the survey data
Document Page
The University of Sydney Page 10
Exploratory Analysis Workflow
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
The University of Sydney Page 11
Example: Analysis of Major Power Stations i
dataset from data.gov.au
How can we load this data into Python?
Which data preparation steps are needed?
Source: https://data.gov.au/dataset/ds-ga-04661f51-82ee-144e-e054-00144fdd4fa6/details?q=power%20stations
Document Page
The University of Sydney Page 12
Preliminaries:
Types of Data and Levels of Measureme
chevron_up_icon
1 out of 73
circle_padding
hide_on_mobile
zoom_out_icon
[object Object]