logo

Markov Processes for Text Generation and Analysis

13 Pages6068 Words431 Views
   

Added on  2019-09-30

About This Document

This assignment explores the use of Markov processes for text generation and analysis. Students will complete programming tasks to create efficient Markov models using maps and word-based models. They will also develop JUnit tests to ensure the correctness of their code. The assignment includes benchmark tests and an analysis write-up. Learn more with Desklib's online library of solved assignments, essays, and dissertations.

Markov Processes for Text Generation and Analysis

   Added on 2019-09-30

ShareRelated Documents
Project 1: MarkovProject 1: MarkovOverviewProgramming Tasks To CompleteAnalysis to CompleteGradingStarting, Running, Using GitRunning the programDeveloping and Testing EfficientMarkovExampleTesting Your New Model, EfficientMarkovDebugging Your Code in EfficientMarkovDeveloping and Testing WordGramWordGram Constructors and MethodsDeveloping and Testing EfficientWordMarkovDeveloping JUnit TestsJUnit TestsOverviewMarkov processes are widely used in Computer Science and in analyzing different forms of data. This assignment offers an occasionally amusing look at text (it's more fun than counting words) by using a Markov process to generate random text based on a training text. When run in reverse --- we won't do that in this assignment -- it's possible to identify the source of an unknown text based on frequency of letters and words. This process can be used to identify SPAM or to ascertain if Bacon wrote Romeo and Juliet.If you're on Facebook, you can use the what-would-i-say FB (or Android) app, described here http://what-would-i-say.com/about.html as "Technically speaking, it trains a Markov Bot based on mixture model of bigram and unigramprobabilities derived from your past posthistory."You can also read about the so-called "InfiniteMonkey Theorem" via its Wikipedia entry. Thisassignment has its roots in several places: astory named Inflexible Logic now found inpages 91-98 from Fantasia Mathematica(Google Books) and reprinted from] a 1940New Yorker story called by Russell Maloney. 1
Markov Processes for Text Generation and Analysis_1
The true mathematical roots are from a 1948 monolog by Claude Shannon, A Mathematical Theory of Communication which discusses in detail the mathematics and intuition behind this assignment. This assignment has its roots in a Nifty Assignment designed by Joe Zachary from U. Utah, assignments from Princeton designed by Kevin Wayne and others, and the work done at Duke starting with Owen Astrachan and continuing with Jeff Forbes, Salman Azhar, and the UTAs from Compsci 201.Programming Tasks To CompleteThere are four programming components of this assignment:1.Create a faster implementation than the brute-force method you start with, a class EfficientMarkov based on using maps rather than rescanning the training text. You can test this class with JUnit tests in MarkovTest by modifying a method in that class. You can run this class using the provided MarkovDriver by changing one line.2.Create a class WordGram that will use a word-markov model rather than the character-markov model you start with. You can use and run this with the existing WordMarkovDriver and you can test with WordGramTester which has some JUnit tests. The driver and testing programs are set to use a BruteWordMarkov class you're not given the source code for (this is in markov.jar as a .class file, part of the project).3.Create a class EfficientWordMarkov, modeled on EfficientMarkov and with the same structure in terms of methods --- but using WordGram objects rather than Strings. The class WordMarkovDriver from the previous step will be used to run this class.4.Create tests for the code you write to help ensure that the classes WordGram and EfficientWordMarkov are correct. You're given some test cases in the classes provided, you'll write a few more.Analysis to CompleteYou'll also need to run benchmark programs that stress your classes and then analyze the results of the benchmark tests as part of an assignment write-up you turn in. You will turn in the assignment analysis write up as a PDF. The analysis explanation is here.You may be given a challenge coding assignment as well, one that can help you understand Java interfaces. This challenge will not required and will not add points to your grade on the Markov assignment, though we will track that you've completed it.Grading10% bonus for using Git early.2
Markov Processes for Text Generation and Analysis_2
25% EfficientMarkov correctness20% WordMarkov correctness20% EfficentWordMarkov corrrectness20% Analysis writeup.5% Tests developed for your code.10% on-time submission (late submissions can lose this)Starting, Running, Using GitYou should start the assignment by forking the Git repo at this location https://git.cs.duke.edu/201fall16/markov-start-fall16 which has as its URI: git@git.cs.duke.edu:201fall16/markov-start-fall16.git You should fork it into your own space on git.cs.duke.edu, then clone it so you have a localversion of it on the computer you're using. Please make sure your Git repo is private! See the class instructions on using Git.You won't be able to easily copy/paste code here because there's a configuration .project file that specifies the library markov.jar and the JUnit testing library. See the repositoryfor the src/ and data/ folders the .project, .classpath, and markov.jar files.For this assignment a bonus of 10% of the grade will be based on downloading/cloning the code and running it according to the directions below. You'll be asked to make a small change to the code, run the program, and commit/push the change to git.cs.duke.edu. You'll need to do this before Thursday, September 29 to earn the 10%.Change the seed in the BruteMarkov class from 1234 to 5678, run the MarkovDriverprogram, commit the changed file. Then change back to RANDOM_SEED. We'll be able to see thecommit.What is a Markov Model? BackgroundAn order-k Markov model uses strings of k-letters to predict text, these are called k-grams. It's also possible to use k-grams that are comprised of words rather than letters. An order-2 Markov model uses two-character strings or bigrams to calculate probabilities in generating random letters. A string called the training text is used to calculate probabilities.For example suppose that in the text we're using for generating random letters, the so-called training text, using an order-2 Markov model the bigram "th" is followed 50 times by the letter "e", 20 times by the letter "a", and 30 times by the letter "o", because the sequences "the", "tha" and "tho" occur 50, 20, and 30 times, respectively while there are no other occurrences of "th" inthe text we're modeling.Now suppose that in generating random text using an order-2 Markov process we generate the bigram "th" --- then based on this bigram we must generate the next random character using the3
Markov Processes for Text Generation and Analysis_3
order-2 model. The next letter will be an 'e' with a probability of 0.5 (50/100); will be an 'a' with probability 0.2 (20/100); and will be an 'o' with probability 0.3 (30/100). If 'e' is chosen, then the next bigram used to calculate random letters will be "he" since the last part of the old bigram is combined with the new letter to create the next bigram used in the Markov processRunning the programRunning MarkovDriver with training files trump-convention.txt and clinton-convention.txtshould generate the results below when RANDOM_SEED is used to create the myRandom object used in determining random numbers.These runs are for the character-based Markov process. Runs of word-markov follow. You can use this output as part of verifying that your EfficientMarkov and EfficientWordMarkov programs work correctly (see below).kTrumpClinton1on Sanws t, fugeve pleronoy ate arllay pplis Sad wicine da o wilelereristurl te de oun o pr os! a Syofived. STPrthilind y ce buathicorge g ves thire: omy our deare s hand strs. inendit Mat f d te lithun. mel Dedive Anch an d. nghrkind, Hemis cin t tithele resmarere w, t oss ominu An't ito al all. Lign Trery whosse tovel cous Fongot. for fit aler-ers s aca s amanane orule s thitis: ld e. a t't, 2nce day, day knocce my who our bact arecties now the wer worce priefich of overichavely I wan chareliedumightlion’t we widectuare-Hill ney to peop thencom ougioneout forke siteemes. We was and is a – is dre but as ca now thillif le haven te but to mage we's resid an She to all bitarentrom torkee my is th quild a ding the saing? I why fuldn't of the we por ch cough my. Whis witionvelowe ever. H3tuary Depare going suppone. Anot of 2017, secutory Cling the famill states on stop the probled to that we nothing residence. In the waiticial immigrades age intonightly, mysely graska. Her border betta is then you good joy, dest shard decial thand a slips wear ther, my have of threach. Buildrealth care of us. If you will, wome that were times in the possibly foodbye effor plaimed be service. W4g endorsement of our very way. On Monday, we end individe their police of my preside myopponents with, and steel work very trade deeply than is fairs at a moment is into other bad trade agreement dimo ceiling from her a much! That our talking people. I believe the road from the Cuban a sing people what our bells for though listed allowed to be. And I do not Wisconstitution – we doesn't first.5nt images of the people. Anyone whole nation, then we rely on the Republican youth Korea. She was under something in poverty and their politician who the right to prosecutors to some of the plain fact's destiny is a nation so much seems that we start by some wanted to lend our fairer, and for steelworkers and I'm also imagine people for the rough on how we also — we will not only creating of Laure4
Markov Processes for Text Generation and Analysis_4

End of preview

Want to access all the pages? Upload your documents or become a member.

Related Documents
Programming Assignment | Program Creation
|4
|1061
|515

Project on Development of Game
|6
|1441
|65

Assignment on Python
|1
|557
|562

Netbeans Project Implemented CODE: Assignment
|7
|3497
|353

Impact of Reviews On Consumers | Amazon
|4
|2367
|110

Representation of Mobile Phone and MP3 Player Assignment
|2
|1076
|633