Project 1: Markov.

Added on - 30 Sep 2019

  • 13

    pages

  • 6068

    words

  • 64

    views

  • 0

    downloads

Showing pages 4 of 13
Project 1: MarkovProject 1: MarkovOverviewProgramming Tasks To CompleteAnalysis to CompleteGradingStarting, Running, Using GitRunning the programDeveloping and Testing EfficientMarkovExampleTesting Your New Model, EfficientMarkovDebugging Your Code in EfficientMarkovDeveloping and Testing WordGramWordGram Constructors and MethodsDeveloping and Testing EfficientWordMarkovDeveloping JUnit TestsJUnit TestsOverviewMarkov processes are widely used in Computer Science and in analyzing different forms ofdata. This assignment offers an occasionally amusing look at text (it's more fun than countingwords) by using a Markov process to generate random text based on a training text. When runin reverse --- we won't do that in this assignment -- it's possible to identify the source of anunknown text based on frequency of letters and words. This process can be used to identifySPAM or to ascertain if Bacon wrote Romeo and Juliet.If you're on Facebook, you can use the what-would-i-say FB (or Android) app, described herehttp://what-would-i-say.com/about.htmlas "Technically speaking, it trains aMarkov Botbasedon mixture model ofbigram and unigramprobabilitiesderived from your past posthistory."You can also read about the so-called "InfiniteMonkey Theorem" via itsWikipedia entry.Thisassignment has its roots in several places: astory namedInflexible Logicnow found inpages 91-98 fromFantasia Mathematica(Google Books)and reprinted from] a 1940New Yorker story called by Russell Maloney.1
The true mathematical roots are from a 1948 monolog by Claude Shannon,A MathematicalTheory of Communicationwhich discusses in detail the mathematics and intuition behind thisassignment. This assignment has its roots in a Nifty Assignment designed by Joe Zachary fromU. Utah, assignments from Princeton designed by Kevin Wayne and others, and the work doneat Duke starting with Owen Astrachan and continuing with Jeff Forbes, Salman Azhar, and theUTAs from Compsci 201.Programming Tasks To CompleteThere arefour programming componentsof this assignment:1.Create a faster implementation than the brute-force method you start with, a classEfficientMarkovbased on using maps rather than rescanning the training text. Youcan test this class with JUnit tests inMarkovTestby modifying a method in that class.You can run this class using the providedMarkovDriverby changing one line.2.Create a classWordGramthat will use a word-markov model rather than the character-markov model you start with. You can use and run this with the existingWordMarkovDriverand you can test withWordGramTesterwhich has some JUnittests. The driver and testing programs are set to use aBruteWordMarkovclass you'renot given the source code for (this is in markov.jar as a .class file, part of the project).3.Create a classEfficientWordMarkov, modeled onEfficientMarkovand with thesame structure in terms of methods --- but usingWordGramobjects rather than Strings.The classWordMarkovDriverfrom the previous step will be used to run this class.4.Create tests for the code you write to help ensure that the classesWordGramandEfficientWordMarkovare correct. You're given some test cases in the classesprovided, you'll write a few more.Analysis to CompleteYou'll also need to run benchmark programs that stress your classes and then analyze theresults of the benchmark tests as part of an assignment write-up you turn in.You will turn inthe assignment analysis write up as a PDF.The analysis explanation is here.You may be given a challenge coding assignment as well, one that can help you understandJava interfaces. This challenge will not required and will not add points to your grade on theMarkov assignment, though we will track that you've completed it.Grading10% bonus for using Git early.2
25% EfficientMarkov correctness20% WordMarkov correctness20% EfficentWordMarkov corrrectness20% Analysis writeup.5% Tests developed for your code.10% on-time submission (late submissions can lose this)Starting, Running, Using GitYou should start the assignment byforking the Git repoat this locationhttps://git.cs.duke.edu/201fall16/markov-start-fall16which has as its URI:git@git.cs.duke.edu:201fall16/markov-start-fall16.gitYou shouldfork it into your own space on git.cs.duke.edu, then clone itso you have a localversion of it on the computer you're using.Please make sure your Git repo is private!Seethe class instructions on using Git.You won't be able to easily copy/paste code here becausethere's a configuration .project file that specifies the library markov.jar and the JUnit testinglibrary. See therepositoryfor the src/ and data/ folders the .project, .classpath, and markov.jarfiles.For this assignment a bonus of 10% of the grade will be based on downloading/cloning thecode and running it according to the directions below. You'll be asked to make a small changeto the code, run the program, and commit/push the change to git.cs.duke.edu.You'll need todo this before Thursday, September 29 to earn the 10%.Change the seed in theBruteMarkovclass from 1234 to 5678, run theMarkovDriverprogram, commit the changed file. Then change back toRANDOM_SEED. We'll be able to see thecommit.What is a Markov Model? BackgroundAn order-k Markov model uses strings of k-letters to predict text, these are calledk-grams. It'salso possible to use k-grams that are comprised of words rather than letters. An order-2 Markovmodel uses two-character strings orbigramsto calculate probabilities in generating randomletters. A string called thetraining textis used to calculate probabilities.For example suppose that in the text we're using for generating random letters, the so-calledtraining text, using an order-2 Markov model the bigram"th"is followed 50 times by the letter"e", 20 times by the letter "a", and 30 times by the letter "o", because the sequences "the", "tha"and "tho" occur 50, 20, and 30 times, respectively while there are no other occurrences of "th" inthe text we're modeling.Now suppose that in generating random text using an order-2 Markov process we generate thebigram "th" --- then based on this bigram we must generate the next random character using the3
order-2 model. The next letter will be an 'e' with a probability of 0.5 (50/100); will be an 'a' withprobability 0.2 (20/100); and will be an 'o' with probability 0.3 (30/100). If 'e' is chosen, then thenext bigram used to calculate random letters will be"he"since the last part of the old bigram iscombined with the new letter to create the next bigram used in the Markov processRunning the programRunningMarkovDriverwith training filestrump-convention.txtandclinton-convention.txtshould generate the results below whenRANDOM_SEEDis used to create the myRandom objectused in determining random numbers.These runs are for the character-based Markov process.Runs of word-markov follow. You can use this output as part of verifying that yourEfficientMarkovandEfficientWordMarkovprograms work correctly (see below).kTrumpClinton1on Sanws t, fugeve pleronoy ate arllay pplisSad wicine da o wilelereristurl te de oun o pros! a Syofived. STPrthilind y ce buathicorgeg ves thire: omy our deare s hand strs.inendit Mat f d te lithun. mel Dedive Anch an d. nghrkind,Hemis cin t tithele resmarerew, t oss ominu An't ito al all. Lign Trerywhosse tovel cous Fongot. for fit aler-erss aca s amanane orule s thitis: ld e. a t't,2nce day, day knocce my who our bactarecties now the wer worce priefich ofoverichavely I wan chareliedumightlion’t wewidectuare-Hill ney to peop thencom ougioneout forke siteemes. We was and isa –is dre but as ca now thillif le haven te butto mage we's resid an She to allbitarentrom torkee my is th quild a dingthe saing? I why fuldn't of the we por chcough my. Whis witionvelowe ever. H3tuary Depare going suppone. Anot of 2017,secutory Cling the famill states on stop theprobled to that we nothing residence. In thewaiticial immigrades age intonightly, myselygraska. Her border betta is then you good joy, dest shard decialthand a slips wear ther, my have ofthreach. Buildrealth care of us. If you will,wome that were times in the possiblyfoodbye effor plaimed be service. W4g endorsement of our very way. On Monday,we end individe their police of my preside myopponents with, and steel work very tradedeeply than is fairs at a moment is into otherbad trade agreement dimo ceiling from her a much! That ourtalking people. I believe the road from theCuban a sing people what our bells forthough listed allowed to be. And I do notWisconstitution – we doesn't first.5nt images of the people. Anyone wholenation, then we rely on the Republican youthKorea. She was under something in povertyand their politician who the right toprosecutors to some of the plain fact's destiny is a nation so much seems thatwe start by some wanted to lend ourfairer, and for steelworkers and I'm alsoimagine people for the rough on how wealso — we will not only creating of Laure4