Lecture Notes: Computational Problem Solving - Week 5

Verified

Added on 2023/01/13

AI Summary

This document contains lecture notes on basic statistics, sets and tuples, dictionaries, and string/file manipulation in Python. It is part of the Computational Problem Solving course and is suitable for students studying this subject.

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.

Lecture Notes
Computational Problem Solving
Week 5:
Basic Statistics, Sets and Tuples, Dictionaries
and String / File Manipulation in Python
Contents:
Contents
Basic Statistics ............................................................................................................................................................... 2
Types of Data and Sampling ......................................................................................................................................... 2
Arithmetic Mean, Geometric Mean and Harmonic Mean ........................................................................................... 4
Median and Mode ........................................................................................................................................................ 8
Interquartile Range and Standard Deviation ................................................................................................................ 9
Parametric and Non-parametric Data ........................................................................................................................ 13
Regression and Correlation ........................................................................................................................................ 14
Misuse of Statistics ..................................................................................................................................................... 16
Introduction to Data Structures ................................................................................................................................. 18
Linked Lists .............................................................................................................................................................. 18
Queues .................................................................................................................................................................... 19
Rings ....................................................................................................................................................................... 23
Sets in Python ............................................................................................................................................................. 24
Dictionaries ................................................................................................................................................................. 27
Error Handling ................................................................................................................ Error! Bookmark not defined.
String Manipulation .................................................................................................................................................... 29
File Manipulation ........................................................................................................................................................ 31
Mathematics Questions .............................................................................................................................................. 32
Programming Questions ............................................................................................................................................. 33

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Basic Statistics
Descriptive statistics is the process of summarising data. You would have looked at some aspects of this from
school. Another area of statistics is called inference which is inferring something from the data given (which leads
to prediction). The combination of statistics and programming is often referred to as data analysis and data
science. Data science is one of the hottest areas of computer science currently and is used throughout the
modern world. For example, when you think about what to watch next on Netflix, the internal algorithms will
predict possible shows to watch based on your previous viewing habits. Likewise, when you purchase items from
amazon. One of the main tools used for data science is Python together with some libraries to make it both easier
and more efficient. Another popular programming tool for statistics is R.
One infamous story about the power of data science is the following:
Meeting with a Target manager while clutching some printed ads that had been delivered to his home, a
customer was outraged by their product placement aimed at his young daughter. 'My daughter got this in the
mail!' the man said to the manager according to the Times, showing him the store's coupons advertising baby-
related products. 'She's still in high school, and you're sending her coupons for baby clothes and cribs? Are you
trying to encourage her to get pregnant?' the man asked the manager who responded with a baffled apology. The
manager himself didn't know why the man's daughter had received the items. He followed up with a phone call to
him later on, apologizing again. What the manager didn't know was that the company's analytics department
most likely had noticed a pattern of products related to early pregnancy purchased by the man's daughter. Clues
such as vitamin supplements, large quantities of lotion, and hand sanitizers, typical to many pregnant women
according to the Target department, signal other items the consumer may need. Or as a woman's pregnancy
continues, what she will soon need, like nappys. 'I had a talk with my daughter,' the man responded later to the
Target manager after listening to his second apology on the phone. 'It turns out there's been some activities in my
house I haven't been completely aware of. She's due in August. I owe you an apology,' he said.
So, when you use your clubcard, Tescos are monitoring your data to see what you might want to buy to target
vouchers to your potential spending habits.
Types of Data and Sampling
Data is often categorised into qualitative and quantitative. Qualitative data is data which is not numerical. For
example, favourite colours would be a type of qualitative data. Conversely, quantitative data is data that is
numerical, for example shoe size, height, etc.
Quantitative data itself may also be split into two categories to make analysis easier. These two partitions are
discrete data and continuous data.
Discrete data is data such as shoe size which are distinct values, e.g. 6, 6.5, 7, 7.5. You are unable to by a size 7.12
shoe, for example. On the other hand, continuous data has no such distinct values. For example, the height of a
person may be 1.62m but if we had a better way to measure it, we might find the height is actually 1.617m, and
we might be able to measure the height even more precisely. Sometimes, it can be convenient to discretise data
to make it easier to analyse or put data into “bins”. For example, for height, as we cannot measure the exact
height of a person, we might put it into a category of, say, 160𝑐𝑚 ≤ ℎ < 165𝑐𝑚with some other bins, such as
165𝑐𝑚 ≤ ℎ > 170𝑐𝑚. It is important that when we are placing data into these bins that there are no overlaps
between the categories.
Definition 9.1
A population is the set of things that you are interested in.
A sample is a subset of the population

The population may be finite, such as the films currently in the cinema or infinite, such as the possible locations
that an archers arrow might land in. Parameters for a population can be almost impossible to accurately measure,
so it is best to take a statistic from a sample and use that result to say something about the population.
Definition 9.2
A parameter is a number that describes the entire population. A statistic is a number taken from a
single sample— you can use one or more of these to estimate the parameter.
The population might be large, say the population of the UK, and any statistics would be hard to gather. A census
is when data is collected from the whole population of interest. In terms of the UK, there is a census every ten
years to collect data about the population to help ascertain where to place money for large infrastructure
projects (amongst other uses).
A sample, as previously mentioned, is a subset of the population. There are many different techniques to gather a
sample and these depend on the situation and intended use of the data.
Sampling Method Description
Sample random
sampling
Every member of the population is equally likely to be chosen. For example,
allocate each member of the population a number. Then use random
numbers to choose a sample of the desired size.
Systematic sampling Find a sample of size 𝑛 from a population of size 𝑁by taking one member
from the first 𝑘 members of the population at random and then selecting
every 𝑘𝑡ℎmember after that, where 𝑘 = 𝑁
𝑛
Stratified sampling When you know you want distinct groups to be represented in your sample,
split the population into these distinct groups and then sample within each
group in proportion to its size.
Opportunity sampling Take samples from members of the population you have access to until you
have a sample of the desired size.
Quota sampling When you know you want distinct groups to be represented in your sample,
decide how many members of each group you wish to sample in advance
and use opportunity sampling until you have a large enough sample for
each group.
Cluster sampling Split the population into clusters that you expect to be similar to each
other, then take a sample from each of these clusters.
Within these broad categories, further refinements might be made, e.g., to choose the sample within the
stratified sampling or combined with other methods (e.g., a combination of stratified sampling and then
opportunity sampling). Often the sampling method is constrained by the data collection. Systematic sampling can
only be used if we can list the data and assign a identifier with each item. Sometimes due to ethical reasons (e.g.,
drug testing) have to use opportunity sampling as you cannot just ask random people on the street.
Sometimes there is bias within the sampling method. A bias is a flaw in the sampling procedure that makes it
more likely that the sample will not be representative of the population. Examples of where bias occurs is: faulty
sampling (sample is not representative), leading questions (questions are worded as to influence the answers),
faulty interviewing (failure to interview the entire sample, misreading questions, misinterpreting what was said),
lack of knowledge or understanding (person being interviewed does not understand or does not have the
information needed) and false answers (person being interviewed intentionally gives incorrect information).
There are some famous examples of bias within sampling, for example, during the Brexit referendum where the
remain campaign was thought to have won due to interviewers asking only select populations and people lying
due to peer pressure (this has also happened in various general elections).

Arithmetic Mean, Geometric Mean and Harmonic Mean
The arithmetic mean is the most common type of average. All averages are called a measure of central tendency
and is a metric to summarise a typical data value. The choice of average is very much dependent on both the
application and the type of data being analysed.
The arithmetic mean (or the mean) is the average used when data is symmetrical without any data which is too
extreme. If, when plotting data, the distribution looks like:
Then the arithmetic mean is a good type of average to use and is always the largest of the three main types of
mean. The above graph is an example of something called the normal (or Gaussian) distribution.
Definition 9.3
The arithmetic mean is for a set of data 𝑥𝑖 of length 𝑛 is:
𝑥̅ = 𝑎1 + 𝑎2 + 𝑎3 + ⋯ + 𝑎𝑛
𝑛
Or alternatively:
𝑥̅ = 1
𝑛∑𝑎 𝑖
𝑛
𝑖=1
The notation 𝑥̅ means the arithmetic mean of 𝑥 where 𝑥 is your dataset. This notation is used for samples. If you
have the population, you may use the Greek letter mu – 𝜇 – instead.
The above definition is a formal way to say add the numbers together and divide by the number of items there
are.
Although the arithmetic mean is the most common type of mean, there are other types of means too. Two of
which are the geometric mean and the harmonic mean.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

The geometric mean is best used for normalised data or average growth rates such as an average of percentages
and the value is in between the arithmetic and geometric means. Rather than adding the data values and dividing,
this time you multiply the data values and take the nth root.
Definition 9.4
The geometric mean is for a set of data 𝑥𝑖 of length 𝑛 is:
𝑥̅ = √𝑎1 × 𝑎2 × … × 𝑎𝑛
𝑛
Or alternatively:
𝑥̅ = (∏𝑎 𝑖
𝑛
𝑖=1
)
1
2
And the final type of mean that we are going to look at is the harmonic mean:
Definition 9.5
The harmonic mean is for a set of data 𝑥𝑖 of length 𝑛 is:
𝑥̅ = 𝑛
1
𝑎1 + 1
𝑎2 + ⋯ +1
𝑎𝑛
Or alternatively:
𝑥̅ = 𝑛
∑ 1
𝑎𝑖
𝑛
𝑖=1
The harmonic mean is therefore the arithmetic mean of the reciprocal of the data values. The harmonic mean is
often used to calculate the average of rates or ratios because it equalises the weights of each data point. The
harmonic mean is the lowest of the three types of mean we are looking at. The arithmetic mean, on the other
hand, places weight on the large data points while the geometric mean gives a lower weight to the smaller data
points. In finance, the harmonic mean is used to determine the average of price-to-earnings and the calculation of
the ratio of a portfolio consisting of several securities.

Worked Example:
You are a stock analyst in an investment bank. Your manager asked you to determine the P/E rati
index of the stocks of Company A and Company B. Company A reports a market capitalization of $
and earnings of $20 million, while Company B reports a market capitalization of $20 billion and ea
of $5 billion. The index consists of 40% of Company A and 60% of Company B.
Firstly, we need to find the P/E ratios of each company. Remember that the P/E ratio is essentially
market capitalization divided by the earnings.
P/E (Company A) = ($1 billion) / ($20 million) = 50
P/E (Company B) = ($20 billion) / ($5 billion) = 4
We must use the weighted harmonic mean to calculate the P/E ratio of the index. Using the formu
weighted harmonic mean, the P/E ratio of the index can be found in the following way:
P/E (Index) = (0.4+0.6) / (0.4/50 + 0.6/4) = 6.33
Note that if we calculate the P/E ratio of the index using the weighted arithmetic mean, it would
significantly overstated:
P/E (Index) = 0.4×50 + 0.6×4 = 22.4
Your turn:
You are a stock analyst in an investment bank. Your manager asked you to determine the P/E rati
index of the stocks of Company A and Company B. Company A reports a market capitalization of $
billion and earnings of $53 million, while Company B reports a market capitalization of $53.3 billi
earnings of $7.4 billion. The index consists of 35% of Company A and 65% of Company B.
Whilst the choice of the type of mean can be important, for any sufficiently large quantities of data, the
arithmetic mean is best due to both computational efficiency and a result of the law of large numbers.
Definition 9.6
The arithmetic-geometric mean combines the arithmetic and geometric means for two positive real
numbers.
Let 𝑎0 = 𝑥and 𝑔0 = 𝑦. Then
𝑥𝑛+1 = 1
2(𝑎𝑛 + 𝑔𝑛)
𝑔𝑛+1 = √𝑎𝑛 × 𝑔𝑛
These two values will converge to the same number

Worked Example:
The daily maximum gust (knots) during the first 20 days of June 2015 is recorded in Hurn. The da
shown below:
14, 15, 17, 17, 18, 18, 19, 19, 22, 22, 23, 23, 23, 24, 25, 26, 27, 28, 36, 39
Find the arithmetic mean, geometric mean and harmonic mean for this data
The arithmetic mean is
∑𝑥
𝑛
∑𝑥 = 14 + 15 + 17 + 17 + 18 + 18 + 19 + 19 + 22 + 22 + 23 + 23 + 24 + 25 + 26 + 27 + 28 + 3
∑𝑥 = 410
∴ 𝑥̅= ∑𝑥
𝑛 = 410
20 = 20.5
The geometric mean is(∏𝑥)1
𝑛
∏𝑥 = 14 × 15 × 17 × 17 × 18 × 18 × 19 × 19 × 22 × 22 × 23 × 23 × 24 × 25 × 26 × 27 × 28 × 36 ×
∏𝑥 = 6921683245 … ..
∴ 𝐺 = √∏𝑥𝑛 = √6921683245 … . .
20
= 21.98
The harmonic mean is
𝑛
∑(1
𝑥)
∑ (
1
𝑥) = 1
14+ 1
15+ 1
17+ 1
17+ 1
18+ 1
18+ 1
19+ 1
19+ 1
22+ 1
22+ 1
23+ 1
23+ 1
24+ 1
25+ 1
26+ 1
27+ 1
28+ 1
36
+ 1
39
∑ (
1
𝑥) = 0.8508 …
∴ 𝐻 = 𝑛
∑ (
1
𝑥)
= 20
0.8508 …
= 23.50656989. . . ≈ 23.5
Your turn:
Calculate the arithmetic, geometric and harmonic means for the following data:
14, 15, 18, 18, 20, 21

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Median and Mode
The median and mode are taught frequently at school. The median is the middle value in a sorted list and the
mode is the most common.
The median is best used for data which is not symmetrical (called skewed data). For example, when calculating an
average wage.
Definition 9.7
The median is the “middle” value in your data range.
This is denoted as either 𝑚 or 𝑄2 (more later)
Definition 9.8
The mode is the data item that occurs the most often in your dataset.
It is most useful for qualitative data or when you are looking for the “most popular” item.
Worked Example:
The daily maximum gust (knots) during the first 20 days of June 2015 is recorded in Hurn. The da
shown below:
14, 15, 17, 17, 18, 18, 19, 19, 22, 22, 23, 23, 23, 24, 25, 26, 27, 28, 36, 39
Find the median and the mode
The median is the middle value
20+1
2 = 10.5𝑡ℎ which is between 23 and 23 therefore 𝑄2 = 23
The mode is the most common which is 23 with 3 entries.
Your turn:
Calculate the median and mode for the following data:
14, 15, 18, 18, 20, 21

Interquartile Range and Standard Deviation
Another useful statistic in addition to a measure of central tendency is the measure of dispersion or spread of the
data. Two different data sets may have the same arithmetic mean, say, but different spreads.
The above diagram shows three different distributions all with the same arithmetic mean. However, their spread
is different. Distribution 1 has the least amount of spread and distribution 3 the most with distribution 2 in the
middle. Like averages, we have different ways to calculate the spread of the data. The most common of which are
the interquartile range and the standard deviation. At school you would have most likely have learnt the range
(which is a crude but simple measure), the median and possibly the interquartile range. If you have studied
mathematics beyond 16 you would have also probably met the standard deviation.
Firstly, the range may be defined as:
Definition 9.8
The range is the difference of the largest value and the smallest value
The interquartile range is more difficult to calculate and first we need to be able to calculate quartiles. A quartile
is a bit like the median (in fact the median is the second quartile) and is a measurement of a typical data value at
every 25% through the data.
Definition 9.9
The lower quartile or 𝑄1 is the data point that is a quarter through the data
The upper quartile or 𝑄4 is the data point that is three quarters through the data
The interquartile range is 𝑄3 − 𝑄1 and covers 50% of the data.
The interquartile range is the metric for dispersion that should be used when using the median

The exact method to calculate the quartiles depends on whether the data is discrete or continuous:
Lower Quartile Upper Quartile
Discrete
Divide 𝑛 by 4
If whole, the LQ is between this value
and the one above
If not whole, round up and take that
data point
Divide 3𝑛 by 4
If whole, the UQ is between this value and the
one above
If not whole, round up and take that data point
Continuous Divide 𝑛 by 4 and take that data point Divide 3𝑛 by 4 and take that data point
We may illustrate the above with an example:
Worked Example:
The daily maximum gust (knots) during the first 20 days of June 2015 is recorded in Hurn. The da
shown below:
14, 15, 17, 17, 18, 18, 19, 19, 22, 22, 23, 23, 23, 24, 25, 26, 27, 28, 36, 39
Find the range and interquartile range.
The range is 39 − 14 = 25
The lower quartile is the
20
4 th value which is the 5th value. As this is exact, we take the 5.5th value which is
between 18 and 18 so 𝑄1 = 18
The upper quartle is therefore the 3 × 5 = 15th value. So we take the 15.5 value which is between 25 and
26 so therefore 𝑄3 = 25.5
The interquartile range is therefore 25.5 − 18 and 𝐼𝑄𝑅 = 7.5
Your turn:
Calculate the median and mode for the following data:
14, 15, 18, 18, 20, 21
At school it is often considered that the median, range, etc are the easiest to work out. This is because it does not
require too many complex calculations. However, this is not the case in general. Unfortunately, these methods
require that the data is sorted. For a large collection of data this is quite time consuming. The arithmetic mean

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

and standard deviation rely mostly on addition and so is more efficient for larger datasets. Luckily, the central
limit theorem and the law of large numbers can be used to approximate any distribution as a symmetric (normal)
distribution given enough data.
The variance and standard deviation can now be defined. These are both measures of spread and involve the fact
that each data point deviates from the mean by 𝑥 − 𝑥̅, where 𝑥 is a data point and 𝑥̅ is the arithmetic mean. The
variance is then calculated as “the average of the squared distances from the mean” so that the distances of each
data point from the mean are all squared and then divided by the amount there are. This gives the following
definition:
Definition 9.10
The variance denoted as 𝜎2 where 𝜎is the Greek letter sigma may be defined as:
𝜎2 = ∑(𝑥 − 𝑥̅)2
𝑛
= ∑𝑥2
𝑛 − ( ∑𝑥
𝑛 )
2
Or equivalently as 𝜎2 = 𝑆𝑥𝑥
𝑛
If the data set is a sample rather than the population 𝑠2 is used for the variance.
The standard deviation is the positive squared root of the variance and 𝜎is used to denote the standard deviation
for a population and 𝑠 for a sample.
Definition 9.11
The standard deviation denoted as 𝜎, may be defined as:
𝜎 =√∑(𝑥 − 𝑥̅)2
𝑛
= √∑𝑥2
𝑛 − ( ∑𝑥
𝑛 )
2
The standard deviation tells you the range from the mean which contains around 68% of the data (if the data is
normally distributed). For example, if 100 students have a mean height of 150cm and a standard deviation of
10cm, 68 students are within one standard deviation and 95 students are within two standard deviations of the
mean.
The standard error is an estimate standard deviation of its sampling distribution. The sampling distribution of
mean is generated by repeated sampling from the same population and recording the sample means obtained.
This forms a different distribution with its own mean and standard deviation.

Worked Example:
For a set of data:
∑𝑥 = 120
𝑛 = 7
∑𝑥2 = 2250
What is 𝑥̅and 𝜎?
𝑥̅ = 120
7 = 17.14
𝜎2 = ∑𝑥2
𝑛 − (∑𝑥
𝑛 )
2
𝜎2 = 2250
7 − (120
7 )
2
≈ 27.55
𝜎 ≈ 5.25
Your turn:
For a set of data:
∑𝑥 = 872
𝑛 = 11
∑𝑥2 = 74945
What is 𝑥̅and 𝜎?
Definition 9.12
The standard error may be defined as:
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 =
𝜎
√𝑛
Where 𝜎is the standard deviation and 𝑛 is the size of the data
This estimate has many practical applications in hypothesis testing as well as confidence intervals. Confidence
intervals are vitally important when summarising data or creating trend lines. May predictions have a degree of
uncertainty attached. Furthermore, it is dangerous to report a measure of central tendency without a measure of
dispersion (although common!).
Definition 9.13
The confidence interval may be defined as:
[𝑥̅− 𝑐 𝑠
√𝑛, 𝑥 +
𝑐 𝑠
√𝑛]
Where 𝑠 is the standard deviation of the sample and 𝑐 is a parameter to specify the interval. For a
95% confidence interval, which is common 𝑐 is 1.96.
A 95% confidence interval does not mean that 95% of the data is within this range nor does it mean that there is a
95% probability of a sample parameter from a repeated experiment falls within this interval but rather it
represents the long-term proportion of corresponding confidence interval that end up containing the true value
of the parameter.

Parametric and Non-parametric Data
We have already alluded to parametric and non-parametric when discussing types of average and spread.
The normal distribution is a distribution that is symmetrical and looks similar to the following when plotted:
Within this graph, 68% of the data falls within one standard deviation of the mean and 95% falls within two
standard deviations of the mean. A lot of continuous data is normally distributed, such as heights and mass. You
have a few people who are relatively short or tall, but the majority of people are nearer the mean height.
Definition 9.14
Parametric data is data that follows a normal distribution. This can be approximate (considering
noise). For example, heights.
Nonparametric data is data that when plotted does not look like this curve. For example, wages.
An example of data that is nonparametric are wages which has the following distribution:
The graph is heavily skewed to the left where people earn less with relatively fewer earning over £90000.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Regression and Correlation
Correlation looks at the linear relationship between data. Correlation may be measured by:
𝑟 = 𝑆𝑥𝑦
√𝑠𝑥𝑠𝑦
Where 𝑆𝑥𝑦 = ∑𝑥𝑦 −
∑𝑥∑𝑦
𝑛 , 𝑠𝑥 = √∑𝑥2
𝑛 − 𝑥̅2 and 𝑠𝑦 = √∑𝑦2
𝑛 − 𝑦̅2. Most packages will calculate this for you. This
provides a value between −1 and 1 which is used to describe the correlation. We will not use this formula
ourselves unless we decide to use it within a programming example. The more important concept is that of the
correlation coefficient.
Definition 9.15
If the correlation coefficient is:
𝑟 = 1 Strong Positive Correlation
𝑟 = 0 No correlation
𝑟 = −1 Strong Negative Correlation
Values between 0 and 1 are positive correlation and between -1 and 0 are negative correlation
When calculating the correlation coefficient, it is important to look at the data. Just because the correlation
coefficient is 0 does not mean that there is no relationship. Conversely, just because the coefficient is not zero
does not mean there is a relationship (this is the adage “correlation does not mean causation”). For example:
This graph indicates that if you buy an ice cream you are likely to be attacked by a shark. However, you are more
likely to go to the beach if it is hot and swim in the sea plus also more likely to buy an ice cream. The hidden third
variable in this case is the weather. The image below shows various relationships between data and the
correlation coefficient. The last row is interesting as there is clearly a relationship but it is not linear and so
therefore the correlation coefficient is zero.

The effect size is a value measuring the strength of the relationship between two variables in a population.
Reporting estimates of effect sizes when presenting empirical research. One common metric for calculating the
effect size is Cohen’s 𝑑 which is defined as the difference of the two means divided by a standard deviation for
the data:
𝑑 =𝑥1̅̅̅− 𝑥 2̅̅̅
𝜎
Where 𝜎 =√ 𝑠1
2+𝑠2
2
𝑛1+𝑛2
and 𝑠1 and 𝑠2 are the standard deviations for the two variables.
Regression (sometimes called the line of best fit) is used to define a line that summarises the relationship
between data. Linear regression aims to minimise the distances of the data points to the line of best fit.
An example of a simple linear regression fit is
The calculation of this line is 𝑦 = 𝑚𝑥 + 𝑐where 𝑚 =∑𝑥𝑦−
∑𝑥∑𝑦
𝑛
∑𝑥2−(∑𝑥)2
𝑛
and 𝑐 = 𝑦̅ − 𝑚𝑥̅.

Misuse of Statistics
Statistics is misused frequently, as Abraham Lincoln said: “there are lies, damned lies and statistics”. Statistics
itself does not lie, but sometimes information might be hidden or the wrong measure for an average is
deliberately used to make something either look better or worse (depending on the view / arguments wanting to
be made).
The above graph looks at gun deaths in Florida after the enactment of a new law. The graph clearly shows that
the number of gun deaths is getting better. That is until you look at the scale which has been inverted. The higher
values are nearer the bottom and the lower values near the top. Therefore, this graph actually indicates that the
gun laws have not helped with the law enforcement trying to show that it has helped.
Other tricks includes changing the scales which shows the trend being better than it is:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Or putting graphs in 3D which again hides information and makes the proportions different.
Generally, the following should be checked when looking at misleading graphs:
• Wrong graph type. Think about what you want to present. Trends are best displayed using lines.
• Compositions best displayed using segmented-bar-charts.
• Missing text. All tick-marks and axes must be labelled. The graph needs a title.
• Inconsistent scale. The scale must be constant across the graph; don’t change the increments between
tick marks.. Most people read increasing scales from left to right and from bottom to top.
• Comparative graphs must be plotted on the same axes to facilitate comparisons.
• Misplaced zero point. Most people assume that the zero point is at the bottom of the graph. This can give
a very misleading impression of the amount of change present in a data series.

Introduction to Data Structures
Last week within the programming component we looked at an introduction to classes and object-oriented
programming. Whilst this module is not focused on object-oriented programming itself (you explore more in
future years), we will use classes to help structure our programs.
Classes are used to help construct various data structures. You have already met lists and tuples as an inbuilt data
structure, that is one that is part of the Python programming language itself. We will look at some other types of
structures that help a lot in programming (and is often asked at interviews too!). We will not explore this in great
depth as this is a level 4 module and, again, in level 5 you will undertake a course that is devoted to data
structures and algorithms. This is very much a light introduction to this important topic.
The first data structure which we will discuss is linked lists.
Linked Lists
A linked list is a dynamic data structure that contains some data and then points to the next item. Each item may
be stored at a completely different place in memory (this is called non-contiguous). Each data item includes a
pointer to the next item. The first node is called the head and the last node has a pointer to nothing, sometimes
called the NULL pointer. In Python we may use the None keyword.
To set up a linked list in Python, we first have to construct a class called Node that stores each element in the list.
class Node:
def __init__(self, data):
self.data = data
self.next = None
The self keyword denotes a reference to a variable contained within the class. Unfortunately, there is not much
we can do with this class. In fact, we do not even need to use a class, some languages have other methods as all
we require is to group some variables together.
To hold some methods that we will need to use to actually use a linked list we will form another class called
LinkedList as follows:
class linkedList:
def __init__(self):
self.head = None
def append(self, data):

if self.head is None:
self.head = Node(data)
else:
currentNode = self.head
while currentNode.next is not None:
currentNode = currentNode.next
currentNode.next = Node(data)
The main feature of this code is the append method. We go through each element in the list until we get to the
last element (which points to None). This then inserts the new element at the end of the linked list. If the list has
no elements then the head points to None and instead we will insert the new node there.
We can add a function (or method) to print out each element in the linked list as follows:
def traverse(self):
currentNode = self.head
while currentNode.next is not None:
print(currentNode.data)
currentNode = currentNode.next
If we use some dummy information to test this, we can write in the file
And this should result in the output:
Queues
A queue is a linear data structure similar to that of a linked list. With a queue the least recently added item is
removed first. An example of this is a queue of customers in a shop, the consumer that came first is served first.

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

With a queue, we will need the following methods (taken from Queue in Python - GeeksforGeeks):
• Enqueue: Adds an item to the queue. If the queue is full, then it is said to be an Overflow condition – Time
Complexity : O(1)
• Dequeue: Removes an item from the queue. The items are popped in the same order in which they are
pushed. If the queue is empty, then it is said to be an Underflow condition – Time Complexity : O(1)
• Front: Get the front item from queue – Time Complexity : O(1)
• Rear: Get the last item from queue – Time Complexity : O(1)
To implement a queue, we may use the list datatype as follows:
Another option, is to use the implementation using collections.deque. Queue in Python may be implemented
using deque class from the collections module. Deque is preferred over list in the cases where we need quicker
append and pop operations from both the ends of container, as deque provides an O(1) time complexity for

append and pop operations as compared to list which provides O(n) time complexity. Instead of enqueue and
deque, append() and popleft() functions are used (again taken from GeeksforGeeks website).
Which results in the following output:
Finally, we may also use the implementation using queue.Queue
Queue is built-in module of Python which is used to implement a queue. queue.Queue(maxsize) initializes a
variable to a maximum size of maxsize. A maxsize of zero ‘0’ means a infinite queue. This Queue follows FIFO rule.
There are various functions available in this module:
• maxsize – Number of items allowed in the queue.
• empty() – Return True if the queue is empty, False otherwise.
• full() – Return True if there are maxsize items in the queue. If the queue was initialized with maxsize=0
(the default), then full() never returns True.
• get() – Remove and return an item from the queue. If queue is empty, wait until an item is available.
• get_nowait() – Return an item if one is immediately available, else raise QueueEmpty.
• put(item) – Put an item into the queue. If the queue is full, wait until a free slot is available before adding
the item.

• put_nowait(item) – Put an item into the queue without blocking. If no free slot is immediately available,
raise QueueFull.
• qsize() – Return the number of items in the queue.
With the following output:

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Rings
A ring data type, also called a circular queue or ringed buffer links the first element and the last elements
together to form a ring.
This is essentially the same as a queue with a maximum size that will loop round to cover itself in a circular
motion. These are used for buffering data streams, memory management, etc. They are also useful for solving
certain types of problems such as the Josephus problem. Please read and watch the video contained within the
page Circular Queue or Ring Buffer. Python and C Implementation. | by Brian Ward | Towards Data Science.
A python implementation, taken from Circular Queue or Ring Buffer. Python and C Implementation. | by Brian
Ward | Towards Data Science, is given below:

Please also read the webpage Linked Lists in Python: An Introduction – Real Python.
Sets in Python
As a reminder (this topic is covered in more detail within Introductory Programming and Discrete Structures), a
set is a collection of unordered objects. Tuples are where order does matter, and this was explored in unit 3.
Unlike mathematical sets, Python’s sets are mutable data objects so that their contents may change. They are not
defined, they are initialised. Elements may then be added, or removed, during the lifetime of a set data structure.
While elements may be added and removed, they must themselves be immutable data items, so we cannot
construct a set of sets.
The set function may be used for creating a new set from any iterable data structure.
For example, set( [ 1, 8, 1, 3, 5, 9] ) returns the set {2, 3, 5, 8, 9}.
Similarly, set(“Mississippi”) returns {“M”, “i”, “p”, “s”}.

So, the duplicates are removed. Also, set() returns the empty set.
We may initialise sets in extension form, by conversion or by set comprehension.
To initialise a set in extension form, we list the elements, separated by commas, enclosed in curly backets. For
example, S = {1, 3, 5, 7}.
We can then add new elements to the set using the add method, so we can create the same set as:
S = set()
S.add(1)
S.add(3)
S.add(5)
S.add(7)
We may also create the same set as:
S = set()
for x in range(8):
if x % 2 == 1: # if the remainder after dividing by 2 is one -> odd number
S.add(x)
We can create it from a list, such as
S = set([1,3,5,7])
And we may use set comprehension, as follows:
S = {x for x in range(8) if x % 2 == 1}
Use of set comprehension, like list comprehension, is a powerful and concise way to create new sets.
Alternative examples of set comprehension are:
U = (c for c in "Abracadabra" if c in "aeiou"}
the set of lowercase vowels in the string "Abracadabra", or even
T = {t*t for t in numberList if t > 0}
the set of the squares of all positive numbers drawn from the list numberList.
Python's set comprehension expression is very useful as a way of creating a set that is made up of members that
share a particular property. All members of the set must be drawn from the same data structure, the data
structure must be iterable (it must be possible to use a for loop to run through its members one by one). Also, any
condition may be used to select items from the source data structure, the chosen items may become members of
the set, or may be used to generate members of the set and finally each set comprehension expression is
equivalent to a for loop.
As we have seen, we have various options to create a set like what we saw with the set of odd integers less than
8. We've already seen the for loop that's equivalent to
S = {x for x in range(8) if x % 2 == 1}
The other two are as follows:
U = (c for c in "Abracadabra" if c in "aeiou"}
May be rewritten as:
U = set()

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

for c in "Abracadabra":
if c in "aeiou":
U.add(c)
And
T = {t*t for t in numberList if t > 0}
May be rewritten as:
T = set()
for t in numberList:
if t > 0 :
T.add(t)
Python's set comprehension expression is part of a family of comprehension expressions that may be used for
creating and initialising data structures. We can create objects using comprehension for lists and dictionaries as
well as sets.
For example, we can create a list to contain odd integers with the line:
L = [x for x in range(100) if x % 2 == 1]
Which is equivalent to:
L = [ ]
for x in range (100) :
if x % 2 == 0 :
L.append(x)
Sets also have the following properties:
x = len(s) The cardinality of s
x in s True when x is a member of s
x not in s True when x is not a member of s
s == t True when s and t have the same members
s <= t True when s is a subset of t
s < t True when s is a proper subset of t
s.isdisjoint(t) True when s and t have no members in common
Python supports a full complement of operations for creating new sets from old ones:
x = s.union(t) x = s | t
x = s.intersection(t) x = s & t
x = s.difference(t) x = s - t
x = s.symmetric_difference(t) x = s ^ t
x = s.copy()
and methods to change the contents of the set:

s.add(e) # inserts e into s (does not duplicate)
s.remove(e) # removes e from s (run-time error if e wasn't in s)
s.discard(e) # removes e from s (no error if e wasn't in s)
s.pop() # returns a random element from s and removes it
s.clear() # removes all elements from s
s |= t # updates s so that it also contains the elements in t
s &= t # updates s, keeping only elements found in both s and t
s -= t # updates s, removing elements found in t
s ^= t # updates s, removing elements found in both s and t
Python's second set type is called the frozenset. The big difference between set and frozenset data
structures is that a frozenset is immutable, so once its membership has been defined it cannot be changed.
This may not seem very useful, but - as we have established – a set may only have immutable members, so if we
want to model sets of sets we must use Python's frozenset structure.
To create a frozenset, we first need to create a regular set and then apply the frozenset function to is as:
frozenset(S)
All of the operations for the properties and the new set from old set operations work on frozensets. But any
method that changes the contents within a set will not work and if we try to use it, Python will throw a runtime
error. This is not a major problem though and just means we need to change the way we program slightly.
Later on in this course, we will use sets as well as tuples to help us construct some graph algorithms.
Please look at the questions in Python Sets (w3schools.com) and Python Set Methods (w3schools.com).
I would also like you to use this opportunity to explore the page Python Random Module (w3schools.com) and
how we use this module when we start looking at problems later on in the course.
Dictionaries
A python dictionary is a look up table. Each element within a dictionary is a pair made from its key (which must be
unique within that dictionary and immutable) and a data item (a value). If we supply a key, we can obtain the data
item.
Item = dictionary [key]
We can add new entries to a dictionary, modify entries and remove entries.
To create an empty dictionary, we may use the command
d1 = dict()
or
d2 = {}
Note that d2 is not an empty set, it is a dictionary. Any iterable data structure that contains a key – value pair may
be converted to a dictionary by the dict() function.
For example, the structures A, B, C, D below are different but the dictionaries created from them are the same:

A = [(1,5), (5,3), (7,8)]
B = [[1,5], [5,3], [7,8]]
C = ((1,5), (5,3), (7,8))
D = ([1,5], [5,3], [7,8])
A = dict(A)
B = dict(B)
C = dict(C)
D = dict(D)
Each dictionary entry may be written in full as a key : data_item pair, such as:
d3 = { “first”, [“this”, “is”, “an”, ”element] }
We may also write a whole dictionary out in full by separating the entries with commas. To make it easier to read,
it should be written on different lines such as:
d4 = {“first”, [“this”, “is”, “an”, ”element],
“second”, [“this”, “is”, “another”, ”element] }
We can then add a new element to the dictionary by using an assignment statement
d[k] = item
for example,
d4[“third”] = [“this”, “is”, “yet”, “another”, ”element]
This adds a new entry with the keyword “third” and the same technique may be used to modify the data item at
which a key points too, such as:
d4[“first”] = “I will use a different data type now to replace this element”
As previously mentioned, we may also use dictionary comprehension to build items within the dictionary.
D = {key : key*key for key in range(10)}
Once again, we can replicate this with a loop:
D = dict ()
for key in range (10):
D [key] = key * key
If we have a key, k, the item associated with that key may be obtained by performing a key-lookup.
item = D[3]
this looks like indexing, but it is different. List indices are always natural numbers whereas dictionaries can have a
key of any immutable data type. Lists are ordered, so items with a higher index number comes after those with
lower indices. Dictionaries are unordered (like sets), so there is no sense in which one item comes after another,
even if we use natural numbers as keys. As indexing is different, we cannot slice like we could do with lists.
Dictionaries are the Python implementation of associative arrays. An associative array consists of a set of (key,
value) pairs, such that each possible key appears at most once in the collection. Each key is associated with
(mapped to) a single data item. The data items of a dictionary can be any Python data type. So, dictionaries are
unordered key-value-pairs.

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.

Dictionaries are implemented using hash tables. A hash table is a means of organizing data that provides fast (and
uniform time) lookup. A hash function is used to convert each key into a number. This number is used as the
address of the associated data item. Each different key hashes to a different address. The address is used to find
the data item associated with the given key. Dictionary keys must be hashable. No part of a key can be allowed to
be mutable, because if the key was changed it would hash to a different address. This would make it impossible to
find the data item that was associated with the original the key
x = len(d) The number of entries in d
x in dd True when x is a key in d
x not in d True when x is not a key in d
d1 == d2 True when d1 and d2 contain the same entries
x = list(d) A list of all the keys in d
d.clear() Removes all the elements from the dictionary, d
x = d.copy() Returns a copy of the dictionary, d
x = d.get(k) Returns the value associated with the key, k
x = d.pop(k) Returns the value associated with the key, k, and removes it from the dictionary
x = d.popitem() Removes the last inserted key-value pair from d, and returns it as a pair (a 2-tuple)
d.update(p) Updates the dictionary with the key-value pairs provided in data structure p
We can use a for loop to iterate over a dictionary. When we do so the loop variable is given the value of each key
in turn. If we want the data items, we must perform key-lookup.
for key in dictionary:
item = dictionary [key]
print(item)
From Python 3.7 onwards dictionaries are ordered data structures, so the keys will come out in the same order
each time. However, the order that is preserved is not a based on the keys themselves: instead, it is the insertion
order. Whenever we need a ‘look-up’ table. We use a dictionary in preference to a list when we wish to record
the association between a data value (the key) and an object to which it refers. Lists are ordered, and we must
refer to objects by their position in the ordering. That position must be an integer, and the first item in the list
always has the index 0. Dictionaries are ordered, but the ordering is not based on their contents. We must refer
to an object using its key value. This may be an integer, but it need not be (it can be any immutable data item)
and there is no restriction on which values can be used as keys, if no key is repeated
Please look at the questions in Python Dictionaries (w3schools.com) and Python Dictionary Methods
(w3schools.com).
String Manipulation
This section is based on the webpage: Python String Manipulation Handbook – Learn How to Manipulate Python
Strings for Beginners (freecodecamp.org). Please also read Chapter 8 on Strings from the textbook “Think
Python”.

You will have looked at strings before (indeed we have used them in this module already). However, we will
explore more details about strings now. Strings in Python are immutable, that is we cannot change then. We can,
however, overwrite variables with a new string which allows us greater flexibility.
A string, fundamentally, is an array of characters and the length of this array is the length of the string. We can set
up a string variable as:
myTown = “Hatfield”
print(type(myTown))
We may use single or double quotes for strings in Python (other languages use single quotes for characters and
double for strings).
An empty string is denoted as emptyString = “”
We can use some operations on strings to help us create larger strings. One such operation is concatenation. That
is the process of combining two (or more) strings together.
word1 = “St”
word2 = “Albans”
print(word1 + word2)
This combines the two strings to form “StAlbans”. To add a space, we will have to modify this to
print(word1 + “ “ + word2)
which will result in “St Albans”.
As previously mentioned, strings act a bit like arrays. In python we can therefore use some of the same operations
we use to select elements from a list. For example,
myTown[0] has the element “H”
myTown[3] has the element “f”
Furthermore, we can find the length of a string with the len command such as:
print(len(myTown))
which should provide the answer of 8.
We can replace parts of a string with the replace method.
“St Albans”.replace(“Albans”, “Stephens”)
Which will result in “St Stephens”.
Another useful method is the count method. For example,
word = “Welwyn Garden City”
print(word.count(“w”))
will produce the result of 1, as there is only one “w” in the word as we cannot ignore whether it is upper or lower
case.
To repeat a string, we may use the multiplication symbol such as
words = “Paris”*3
print(words)

will produce “ParisParisParis”.
Please explore in Python the keywords upper, lower, strip, title, swapcase, isalnum, etc from the document
above. You may also use the Python help too
help(words.title)
to provide more information about unknown functions (all inbuilt commands will have information, if there are
written by a third party it requires the writer to have written the help!)
Please look at the questions and tutorials in Python Strings (w3schools.com) and Python String Methods
(w3schools.com) as well as Python Dates (w3schools.com).
File Manipulation
Based on: Working With Files in Python – Real Python (bingj.com). Please also read the Chapter 14 on Files from
the book “Think Python”.
Frequently when writing some code you might wish to write data to a file or retrieve data from a file. It is
common to write data into either a binary or text file. We will be focussing on text files for now. When you write
your Python source code, this is stored in a normal text file. Sometimes, a text file might use special formatting to
make it easier to read by proving structure. Too common types are CSV (comma separated values) and XML
(eXtensible Markup Language). Source code for most languages also use text files.
To open a file, we may use the with command which also helps with error handling.
For example, we may use
with open('data.txt', 'r') as f:
data = f.read()
to read data from a file and
with open('data.txt', 'w') as f:
data = 'some data to be written to the file'
f.write(data)
to write data to a file. The open command returns a file handler which is the variable f. This provides methods
that can be used to read and write data to a file.
We can use Python to do some manipulation and finding of directories and files as well. To create a file
“example_directory”, we may use the following code:
import os
os.mkdir('example_directory/')
and list files contained within a directory as:
>>> import os
>>> entries = os.listdir('my_directory/')

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser

Please look at the questions in Python File Open (w3schools.com), Python File Open (w3schools.com), Python File
Write (w3schools.com), Python Delete File (w3schools.com) and finally Python File Methods (w3schools.com).
Mathematics Questions
Q1) Describe what the arithmetic mean, geometric mean, mode and median are
Q2) Describe the difference between an average and a measure of dispersion
Q3) for the following, calculate the arithmetic mean, geometric mean, mode and median. If the data has no
mode, say that it has no mode.
a) 1, 3, 9, 10, 11, 12, 12, 15
b) 4, 12, 15, 16
Q4) What is the best measure of average to measure:
a) Heights of people
b) Size of a computer screen
c) The wages of programmers
Q5) Calculate the arithmetic mean and standard deviation of the following dataset
𝑛 = 10,∑𝑥 = 29, ∑𝑥2 = 95.03.
Q6) Describe the main differences between parametric and non-parametric data.
Q7) What are the main differences between continuous and discrete data?
Q8) What type of correlation is described when the Pearson Product Moment Coefficient is:
a) 𝑟 = 0.03
b) 𝑟 = −0.79
c) 𝑟 = 0.65
d) 𝑟 = 0.4
e) 𝑟 = −0.3
Q9) Name some aspects of the following graph that are misleading:

Q10) Research into sampling methods (such as census, simple random sampling, systematic sampling, stratified
sampling, quota sampling, opportunity sampling, cluster sampling and self-selection sampling) and write down
where they might be used and their advantages and disadvantages.
Programming Questions
1. Write a function evens that takes a pair of whole numbers, a, and b and returns a new set containing all
of the even numbers between a and b inclusive.
2. Write a function intersect that takes a list of sets, setlist, as its parameter and returns a new set
that contains just the elements that are in the intersection of all the sets in setlist
3. Write a function singletons that takes a set, S, as its parameter and returns a set of frozensets, FS, as
its result, such that each member of FS contains a single element of S and each element of S is
represented in FS. For example, if S is {1,2,3,4} FS will be { {1}, {2}, {3}, {4} }
4. Write a function divisors that takes a positive whole number as its argument and returns a set
containing all the divisors of that number. (A divisor is a number that divides evenly into another number.
For example, 4 is a divisor of 28 because 28 ÷ 4 has a remainder of 0).
5. Write a function letters that takes a character string, s, as its single parameter and returns a set
containing all of the letters that are found in s, but no other characters.
6. Write a function fewest that takes as its parameter a set of character strings called strset and returns
the character string drawn from the set that contains the fewest unique characters
7. Write a function cartesian that takes two sets, S and T as its parameters and returns the Cartesian
product of the two sets, SxT. This will be a set of ordered pairs (2-tuples). Reminder: The Cartesian
product of two sets SxT is the set of pairs combining each element of S with each element of T
8. Write a function makedict that takes two lists, a and b, as its parameters, and returns a dictionary in
which each element of list a is the key to the corresponding element of list b. You may assume that list a
contains only immutable values, and the two lists are of equal length
For example, after the statement sequence
x = ['fred','joe','bill']
y = [21,5,93]
z = makedict(x,y)
z is the dictionary { 'fred' : 21, 'joe' : 5, 'bill' : 93 }
9. Write a function override that takes two dictionaries, a and b, as its parameters and return a single
new dictionary, d, which combines a and b so that all entries from b are included, along with all those
entries from a whose keys are not in b.
10. Write a function subtract that takes two dictionaries, a and b, as its parameters and return a single
new dictionary, d, which combines a and b so that the only entries included are those with keys in a but
not in b.

11. Write a function initialize that takes a list of keys, ks, and a data value, v, as its two parameters
and returns a new dictionary in which each of the keys from ks is associated with the value, v.
12. Write a function remove that takes a dictionary, d, and a list of keys, ks, as its two parameters and
returns a new dictionary containing all entries from d except those with keys in the list ks
13. Write a function isin that takes a dictionary, d, and a data value, v, as its two parameters and returns
True if the data value is in the dictionary and False if it is not
14. Write a function changekey that takes a dictionary, d, and two keys, k1 and k2, as its three
parameters and returns a new dictionary in which the entry with key k1 is replaced by an entry with the
same data value but with the key k2. You may assume that k1 is in d and k2 is not.

1 out of 34

Your All-in-One AI-Powered Toolkit for Academic Success.

+13062052269

info@desklib.com

Available 24*7 on WhatsApp / Email

Company

Tools

Support