This content covers topics related to Data Structures and Algorithms, including constructing keyword trees and suffix trees, determining informative sites in MSA, gene finding algorithms, and BLAST. It also includes an R script for analysis.
Contribute Materials
Your contribution can guide someoneās learning journey. Share your
documents today.
Running Head: DATA STRUCTURES AND ALGORITHMS1 Data Structures and Algorithms [Name of Student] [Instructional Affiliation] [Date of Submission]
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Data Structures and Algorithms2 Given the following DNA sequence: GGTGTAAAGAATCTT Construct a keyword tree (10 points) a.Construct a suffix tree (10 points) GTTCAATAAAAA GG AT CG AS C4G5 GTTAC GCACTTACGGTGA GGTGTAAAGAATCTT
Data Structures and Algorithms3 2.How many different nucleotide sequences may code for the following protein sequence: Arg-Lys-Pro-Val-Ser-Ile-Ala? (10 points) 209210211212213214215216217218219220221 TTGCA G GG A TT T GG T AT T TC C GC C CC G GA T CAGGTAAA A LeuGlnGlyPheGlyIleSerAlaProAspGlnValLys TAG Ambe r CA G GG A TT T GG T AT T TC C GC C CC G GA T CAGGTAAA A TTGCAGGTTGGGTTCGCCCGACAGGTAAA G1G2T3C4T5 GTTAC GCACTTACGGTGA GGTGTAAAGAATCT T
Data Structures and Algorithms4 GATTTCCGTA LeuGlnGlyPheGlyValSerAlaProAspGlnValLys ā TTGCA G GG A TT T GG T AT T TC C GC C CC G ATCAG G TAAAA- LeuGlnGlyPheGlyIleSerAlaProIleArgOchr e Therefore 52 different nucleotide sequences. 3.Given the following MSA (Multiple Sequence Alignment), describe (in pseudocode) how you would determine which positions contained informative sites (15 pts) AT-ACGCCGATGCAT ATTACGACGATGCTT ATTACGACGAAGCTT AT-ACGACGATGCAT 4. Describe how gene finding algorithms work. Include a description of all the elements that they search for to help determine whether or not a sequence is a protein coding gene (10 pts) How gene finding algorithms works TGC TCG T U GA AGC T U GA T U CG A T C D Dā
Paraphrase This Document
Need a fresh take? Get an instant paraphrase of this document with our AI Paraphraser
Data Structures and Algorithms5 Since gene finding refers to the process of identification of the parts of the genomic DNA that usually encodes the protein coding genes, several approaches are employed to enhance the prediction. The algorithm employs two main methods in searching the elements. In similarity evidence based gene finding systems, the target genome is usually searched in the sequence that are same and similar extrinsic evidence. In a protein sequence, the member class of possible coding the DNA is obtained by reversing the translation of the genetic code. Once the candidate DNA sequence has been determined, the target genome is then searched that matches. The matches can be partial, complete, exact or even inexact. Consequently, a high similarity degree to a protein product gives a strong evidence that the region of the target genome is a protein coding gene. 5. What is BLAST? Describe how the algorithm works. Be sure to include any statistical measures that are used in determining the strength of any BLAST results. (10 pts) Basic Local Alignment Search Tool (BLAST) in bioinformatics is a widely used program and it refers to the algorithm used for comparing the primary biological sequence information such the protein sequences, the DNA sequences, the amino acid sequences and even the nucleotides. How the BLAST algorithm works? Normally to run the algorithm, the BLAST usually needs a query sequence to search against (usually the target sequence) and the sequence to search for. Then the BLAST finds the respective subsequences in the database which are also the same and similar to the sub sequences in the query (Casey, p 22). In most usage, the query may be more than the database. When the algorithm searches a sequence database, it first computes the pairwise
Data Structures and Algorithms6 alignment of the sequence against all known sequences in the database. Then the best and scoring significant homologs are deleted. The algorithm employs some statistical methods and approaches that are used to determine the strength of the BLAST. The two main statistical methods employed is: Assessing Alignment Significance. With this method, random alignments are randomly generated and their scores computed at first step. The mean and the standard deviation of the random scores are then computed. The deviation of the actual random scores from the mean of the score is then computed. From this, evaluation of the significance of the alignment is then done. 6. A graduate student has written part of an R script to perform an analysis. It is listed below. ļ·Describe what each line does by adding comment lines to it as appropriate (10 pts) ļ·#------------------------------------------------------- ļ·library(stats) (refers to the statistical data set in a library) ļ·mydata <- iris (data set to be executed) ļ· ļ·# Round 1 ( refers to frequency of the execution) ļ·set.seed(101)(means there are 101 results to be executed) ļ·km <- kmeans(mydata[,1:4], 10) (10 refers to the number of centroids to be displayed ļ·plot(mydata[,1], mydata[,2], col=km$cluster) ( refers to the axes names/labels on the script) ļ·points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2) (refers to points at which the
Data Structures and Algorithms7 clusters should be distributed in the script) ļ·#------------------------------------------------------- ļ·Execute the script and show all of the output it generates (5 pts) ļ·Modify the script so that there are 3 centroids displayed (10 pts) ļ·#------------------------------------------------------- ļ·library(stats) ļ·mydata <- iris ļ· ļ·# Round 1
Secure Best Marks with AI Grader
Need help grading? Try our AI Grader for instant feedback on your assignments.
Data Structures and Algorithms8 ļ·set.seed(101) ļ·km <- kmeans(mydata[,1:4], 3) ļ·plot(mydata[,1], mydata[,2], col=km$cluster) ļ·points(km$centers[,c(1,2)], col=1:3, pch=19, cex=2) ļ·#------------------------------------------------------- ļ·Provide the final modified script and its output. (10 pts) References Casey, R.M. (2005).BLAST Sequences Aid in Genomics and Proteomics. Business Intelligence
Data Structures and Algorithms9 Network. Mount, D. W. (2004).Bioinformatics Sequence and Genome Analysis. 2nded. Cold Spring Harbor Press.