Python Project: Natural Language Processing of Political Speeches

Verified

Added on  2019/09/13

|8
|3043
|349
Homework Assignment
AI Summary
This Python programming assignment focuses on analyzing speeches given by Donald Trump and Hillary Clinton using Natural Language Processing (NLP) techniques. The task involves reading speech data from text files, cleaning and preprocessing the text, and performing various analyses. Students are required to write Python functions to count words and sentences, calculate the Flesch reading-ease score to assess readability, and identify n-grams (sequences of words) to analyze the repetition of phrases. The assignment covers fundamental NLP concepts and utilizes regular expressions for text manipulation. The final deliverable includes a PDF report with the analysis results and a Python script containing the implemented functions and test code. The project aims to quantify the differences in speaking styles between the two candidates through data-driven analysis.
tabler-icon-diamond-filled.svg

Contribute Materials

Your contribution can guide someone’s learning journey. Share your documents today.
Document Page
Python Programming / Python Programming Background
Background
The American presidential election is right around the corner, and both presidential
candidates have been busy giving speeches. It is intuitively clear that Trump and
Clinton have quite different styles of speaking. In this exercise, we will see if we can
quantify such differences by analyzing the speeches using the programming tools
that we have learned through the course.
The field of research of analyzing natural language is called Natural Language
Processing (NLP). The techniques that we will use in this exercise are quite basic,
and we will omit some of the details that one would normally use in NLP. However,
despite this simplicity, we will see that we can get some interesting results.
Our analyses will be based on the following two data files, containing a selection of
speeches given by the candidates since their nomination some months ago at the
national conventions for each of the parties. The files can be found here:
http://people.binf.ku.dk/wb/data/clinton_speeches.txt (Links to an external site.)
http://people.binf.ku.dk/wb/data/trump_speeches.txt (Links to an external site.)
When opening the files, you will see that each speech consists of a title line starting
with "#", followed by a number of speech lines:
# Greensboro, North Carolina on Oct. 14, 2016:
Thank you. Wow. Thank you, everybody. Nice place. Plenty of room.
In 25 days, we're going win the state of North Carolina, which I love, and we're going to
win the White House.
...
Formal requirements:
Format: You should hand in:
1. A PDF file called exercise.pdf containing the output from the different
exercises below
2. A Python file called exercise.py, containing the function definitions.
3. A Python file called exercise_test.py, containing test code for the individual
questions.
These should be handed in as separate files (not zipped).
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
Content: For the Python part, remember to comment your code, use meaningful variable
names, and include docstrings for each function. Also, please limit yourself to the curriculum
covered in the course, that is, refrain from using list and dict comprehension, map, zip, reduce,
filter and lambda for the Python part, and tools like awk for the Unix part. Also, please use only
those external modules that are explicitly mentioned in the exercise.
Question 1: Reading in the data
The rest of the exercise will be done in Python.
In the exercise.py file, write a function called read_speeches that takes a speech filename as
argument, and returns a dictionary, where the keys are the title of the speeches in the file (as
strings), and the values are the corresponding speeches (as strings). You should remove the "#"
character from the titles. Replace \n characters in the speech lines with a single space. Also, all
speech lines starting with "[" should be omitted. Remember to close the files after reading from
them.In the exercise_test.py file, call the read_speeches on both the clinton_speeches.txt and the
trump_speeches.txt files, and save the results in two variables called clinton_speeches_dict and
trump_speeches_dict, respectively. Print out the size (number of keys) in each of these two
dictionaries.
If you have problems completing this exercise, and therefore do not have the requested
dictionaries, please follow the following instructions in order to be able to complete the remainder
of the exercises:
Download the following files:
trump_speeches.json (Links to an external site.)
clinton_speeches.json (Links to an external site.)
Insert the following lines of code in your program:
import json
json_file_trump = open("trump_speeches.json")
json_file_clinton = open("clinton_speeches.json")
trump_speeches_dict = json.load(json_file_trump)
clinton_speeches_dict = json.load(json_file_clinton)
json_file_trump.close()
json_file_clinton.close()
Note that these dictionaries are not exactly identical to the ones you get from solving the exercise
yourself (so you cannot use them for verification purposes).
2. In the exercise.py file, write a function called merge_speeches that takes a list of speeches
excluding titles (i.e. list of strings) as argument, and returns a single string containing all speeches.
As a separator between the speech strings, use a single space.
In the exercise_test.py file, call the merge_speeches to merge all of Clinton's speeches and all of
Trump's speeches, and save the result in variables called clinton_speeches_all and the
trump_speeches_all, respectively (Hint: there is an easy way to get all of the values in a dictionary
as a list). Print the length of the two lists to screen.
Document Page
If you have problems completing this exercise, and therefore do not have the requested strings
containing all speeches, please follow the following instructions in order to be able to complete
the remainder of the exercises:
Download the following files:
trump_speeches_all.txt (Links to an external site.)
clinton_speeches_all.txt (Links to an external site.)
Insert the following lines of code in your program:
trump_file = open("trump_speeches_all.txt")
clinton_file = open("clinton_speeches_all.txt")
trump_speeches_all = trump_file.read()
clinton_speeches_all = clinton_file.read()
trump_file.close()
clinton_file.close()
Note that these strings are not exactly identical to the ones you get from solving the exercise
yourself (so you cannot use them for verification purposes).
Question 2: Counting
In the .py file, write a function called count_words that takes a text (string) as input, and returns
the number of words in this text. You can assume that words are defined as anything that is
separated by whitespace.In the exercise_test.py file, test the count_words function on the
trump_speeches_all variable, and print the result.
If you have problems completing this exercise, and therefore do not have the requested function,
please follow the following instructions in order to be able to complete the remainder of the
exercises:
Insert the following lines of code in your program:
def count_words(text):
"""Dummy function that always returns 50000"""
return 50000
You can use this function as a replacement for the correct function in the questions below.
2. In the exercise.py file, write a function called count_sentences that takes a text (string) as input,
and returns the number of sentences in this text. You can assume that sentences are defined as
being separated by any of the following characters: ".", "!" or "?". Hint: an easy way to solve this is
using a regular expression, combined with re.split.
In the exercise_test.py file, test the count_sentences function on the trump_speeches_all variable,
and print the result.
If you have problems completing this exercise, and therefore do not have the requested function,
please follow the following instructions in order to be able to complete the remainder of the
exercises:
Document Page
Insert the following lines of code in your program:
def count_sentences(text):
"""Dummy function that always returns 5000"""
return 5000
You can use this function as a replacement for the correct function in the questions below.
Question 3: The Flesch reading-ease score
The Flesch reading-ease score is a measure of how easy a text is to read. There is a direct
translation from this score to the primary-school grade in which children are generally able to read
a text of a given difficulty. For more details, see the Wikipedia page
(Links to an external site.)
. In this Question, we will use this score to evaluate the complexity of the language used by the
two presidential candidates.
The definition of the Flesch reading-ease is: 206.835 - 1.015 *
number_of_words/number_of_sentences - 84.6 * number_of_syllables/number_of_words.
Calculating the number of syllables in a text is not trivial. There is a discussion of this topic here
(Links to an external site.)
. The top posts suggests to count the number of matches of a regular expression that describes a
syllable. We will use the same idea here, but have changed the regular expression slightly: "[aiouy]
+e*|e(?!d\b|ly)[aiouye]?|[td]ed|le\b" - in order to make it work on an entire text, rather than just
single word.
In the exercise.py file, write a function called count_syllables that takes a text (string) as argument
and returns the number of syllables in the text. The function should use the regular expression
defined above. You should convert the text to all lower-case letters in order for the regular
expression to work propertly. Hint: the findall method of a regular expression object returns all
matches as a list.In the exercise_test.py file, test your function on the string "I eat apples", and
print the result.
If you have problems completing this exercise, and therefore do not have the requested function,
please follow the following instructions in order to be able to complete the remainder of the
exercises:
Insert the following lines of code in your program:
def count_syllables(text):
"""Dummy function that always returns 100000"""
return 100000
You can use this function as a replacement for the correct function in the questions below.
2.In the exercise.py file, write a function called calculate_flesch_score that takes a text (string) as
input, and returns the flesch score (a number), calculated using the definition above.
In the exercise_test.py file, write a loop that iterates over the trump_speeches_dict dictionary,
calculates the flesch score for each speech using the calculate_flesch_score function, and saves
tabler-icon-diamond-filled.svg

Secure Best Marks with AI Grader

Need help grading? Try our AI Grader for instant feedback on your assignments.
Document Page
the result in a list called trump_scores. Do the same with the Clinton speeches (in
clinton_speeches_dict) and save that result to variable called clinton_scores. Print both lists to
screen.
If you have problems completing this exercise, and therefore do not have the requested lists of
scores, please follow the following instructions in order to be able to complete the remainder of
the exercises:
Download the following files:
trump_scores.json (Links to an external site.)
clinton_scores.json (Links to an external site.)
Insert the following lines of code in your program:
import json
json_file_trump = open("trump_scores.json")
json_file_clinton = open("clinton_scores.json")
trump_scores = json.load(json_file_trump)
clinton_scores = json.load(json_file_clinton)
json_file_trump.close()
json_file_clinton.close()
Note that these lists are not exactly identical to the ones you get from solving the exercise yourself
(so you cannot use them for verification purposes).
3.To plot the distribution of scores for each candidate, we can use a box or violin plot. Here is a
template for a function that does this:
def plot_scores(scores,
output_filename):
"""
Plots score distributions
Parameters
----------
scores : dict object
The values of this dictionary are lists of scores to be plotted. The keys of the dictionary
will be used as labels for the plot.
output_filename : str
The name of the file used where the plot will be saved.
"""
import matplotlib.pyplot as plt
Document Page
x_values = range(1, len(scores)+1)
y_values = scores.values()
# Create a box/violin plot
# plt. # <- Fill in this line
# Add labels on x-axis
plt.setp(plt.gca(), xticks=x_values, xticklabels=scores.keys())
# Save figure
plt.savefig(output_filename)
Copy this function to your exercise.py file, and fill in the missing line to create either a box or a
violin plot.
In the exercise_test.py file, call the plotting function on a dictionary which has the two
variables trump_scores and clinton_scores as values (you should find some appropriate keys).
Include the resulting plot in your exercise.pdf file.
Question 4: n-grams
Finally, we would like to investigate how often the candidates repeat the same phrases. An n-gram
is a subsequence of a sentence consisting of n words. For instance the sentence "I love Python"
would contain three 1-grams ("I", "love", "Python"), two 2-grams ("I love") and ("love Python"),
and one 3-gram ("I love Python").
In the exercise.py file, write a function called calculate_ngram_frequencies that takes two
arguments: a string of text (containing one or more sentences) and an integer n. The function
should register all sequences of words of length n in the text. For the particular value of n, the
function should save all such occurrences in a dictionary, where the keys are the n-grams (string
consisting of n words) and the values are the number of occurrences. This dictionary should be
returned from the function. The function should only search for n-grams within sentences. You can
use the following lines of code to turn the text (in the variable text) into a list of sentences:
import re
sentences = re.findall(r'[^\.\?!"]+', text)
You should iterate over these fragments, and detect n-grams within each of them.
In the exercise_test.py file, call the calculate_ngram_frequencies function with n=6 on the
clinton_speeches_all and the trump_speeches_all files. Save the results in two variables called
clinton_6gram_dict and trump_6gram_dict, respectively.
If you have problems completing this exercise, and therefore do not have the requested 6-gram
dictionaries, please follow the following instructions in order to be able to complete the remainder
of the exercises:
Download the following files:
trump_6grams.json (Links to an external site.)
clinton_6grams.json (Links to an external site.)
Document Page
Insert the following lines of code in your program:
import json
json_file_trump = open("trump_6grams.json")
json_file_clinton = open("clinton_6grams.json")
trump_6gram_dict = json.load(json_file_trump)
clinton_6gram_dict = json.load(json_file_clinton)
json_file_trump.close()
json_file_clinton.close()
Note that these lists are not exactly identical to the ones you get from solving the exercise yourself
(so you cannot use them for verification purposes).
2. Finally, we want to see which of the 6-grams is most frequent. If you use the items() method on
the returned dictionaries, you get a list of (ngram, frequency) tuples. These lists can be sorted by
using the key option for the sort list method, combined with the itemgetter function from the
operator module. It is illustrated on a small exerciseple here:
my_dict = {"name_1": 10, "name_2": 7, "name_3": 5} # simple dictionary
my_dict_items = my_dict.items() # .items() returns a list of (key, value)
tuples from the dictionary
print my_dict_items
# produces output: [('name_2', 7), ('name_3', 5), ('name_1', 10)]
# define a function that returns the part of each item we want to sort by
def get_second(x):
"""returns the second item of x"""
return x[1]
# pass the above function to the .sort() method with the keyword argument "key"
# so that we sort the items by comparing the values at their second positions
my_dict_items.sort(key=get_second) # sort the list of tuples by what is
contained in the second position of each tuple
print my_dict_items # list is now sorted
# produces output: [('name_3', 5), ('name_2', 7), ('name_1', 10)]
# Everything in one line using sorted instead of sort
print sorted(my_dict.items(), key=get_second)
# produces output: [('name_3', 5), ('name_2', 7), ('name_1', 10)]
Use a similar technique in your exercise_test.py file, to sort the Clinton and Trump 6-grams, and
print out the 10 most frequent 6-grams for both Clinton and Trump.
tabler-icon-diamond-filled.svg

Paraphrase This Document

Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Document Page
Final remarks
Note that this project is the official exercise for the course. This means that late hand-ins will not
be accepted. Also please ensure that all your code is well documented (e.g. doc-strings in
functions and comments), and that you use meaningful variable names. Finally, remember not to
use features like list and dict comprehension, map, zip, reduce, filter and lambda. The exercise is
designed to test basic Python skills covered in the curriculum (including writing loops), and points
will therefore be deducted for using these techniques instead of writing the corresponding loops.
chevron_up_icon
1 out of 8
circle_padding
hide_on_mobile
zoom_out_icon
logo.png

Your All-in-One AI-Powered Toolkit for Academic Success.

Available 24*7 on WhatsApp / Email

[object Object]