6.096 – Algorithms for Computational Biology Sequence Alignment and Dynamic Programming Lecture 1 - Introduction Lecture 2 - Hashing and BLAST Lecture 3 - Combinatorial Motif Finding Lecture 4 - Statistical Motif Finding Challenges in Computational Biology 4 Genome Assembly 5 Regulatory motif discovery 1 Gene Finding DNA 2 Sequence alignment 6 Comparative Genomics TCATGCTAT TCGTGATAA 3 Database lookup 7 Evolutionary Theory TGAGGATAT TTATCATAT TTATGATTT 8 Gene expression analysis RNA transcript 9 Cluster discovery 10 Gibbs sampling 11 Protein network analysis 12 Regulatory network inference 13 Emerging network properties Comparing two DNA sequences • Given two possibly related strings S1 and S2 – What is the longest common subsequence? S1 A C G T C A T C A S2 T A G T G T C A SS11 AA CC GG TT CC AA TT CC AA Edit distance: •Number of changes SS22 TT AA GG TT GG TT CC AA needed for S1ÆS2 A G T T C A LCSS How can we compute best alignment S1 A C G T C A T C A S2 T A G T G T C A • Need scoring function: – Score(alignment) = Total cost of editing S1 into S2 – Cost of mutation – Cost of insertion / deletion – Reward of match • Need algorithm for inferring best alignment – Enumeration? – How would you do it? – How many alignments are there? Why we need a smart algorithm • Ways to align two sequences of length m, n n m m n m n § · ( )! 2 ¨ ¸ | ¨ ¸ m m 2 m ( !) S © ¹ • For two sequences of length n n Enumeration Today's lecture 10 184,756 100 20 1.40E+11 400 100 9.00E+58 10,000 Key insight: score is additive! i S1 A C G T C A T C A S2 T A G T G T C A j • Compute best alignment recursively i, j – For a given split ( ), the best alignment is: • Best alignment of S1[1..i] and S2[1..j] • + Best alignment of S1[ i..n] and S2[ j..m] i i S1 A C G T C A T C A S2 T A G T G T C A j j Key insight: re-use computation S1 S1 A C G T C A T C A A C G T C A T C A S2 S2 T A G T G T C A T A G T G T C A S1 A C G T C A T C A S1 A C G T C A T C A S2 T A G T G T C A S2 T A G T G T C A S1 A C G T S1 C G T C A T C A S2 T A G T G S2 T G T C A Identical sub-problems! We can reuse our work! Solution #1 – Memoization • Create a big dictionary, indexed by aligned seqs – When you encounter a new pair of sequences – If it is in the dictionary: • Look up the solution – If it is not in the dictionary • Compute the solution • Insert the solution in the dictionary • Ensures that there is no duplicated work – Only need to compute each sub-alignment once! Top down approach Solution #2 – Dynamic programming • Create a big table, indexed by (i,j) – Fill it in from the beginning all the way till the end – You know that you’ll need every subpart – Guaranteed to explore entire search space • Ensures that there is no duplicated work – Only need to compute each sub-alignment once! • Very simple computationally! Bottom up approach Key insight: Matrix representation of alignments A C G T C A T C A T A G T G T C A S1 A C G T C A T C A S2 T A A Goal: G G Find best path T T G through the matrix C/G T T C C A A
Description: