Identifying Variable Length Multi-pair Palindromic Patterns with Errors in a DNA Sequence

Size: px

Start display at page:

Download "Identifying Variable Length Multi-pair Palindromic Patterns with Errors in a DNA Sequence"

Kristopher Hall
5 years ago
Views:

1 Identifying Variable Length Multi-pair Palindromic Patterns ith Errors in a DNA Sequence Hyoung rae Kim Department of Computer Sciences Florida Institute of Technology Melbourne, FL 90, USA hokim@fit.edu William D. Shoaff Department of Computer Sciences Florida Institute of Technology Melbourne, FL 90, USA ds@cs.fit.edu Abstract The emphasis in genome projects has moved toards sequence analysis in order to extract biological meaning (eg., evolutionary history of particular molecules or their functions) from the sequence. Especially, palindromic or direct repeats that appear in a sequence has biophysical meaning [6]. A problem is recognizing interesting patterns and configurations of ords (strings of characters) over complementary alphabets. We propose an algorithm to identify variable length palindromic pairs (longer than a threshold)here e can allo gaps (distance beteen ords). The algorithm is called palindrome algorithm (PA) and has O(N) time complexity. A palindromic pair consists of a hairpin structure. By composing collected palindromic pairs e build n-pair palindromic patterns; this is called a structural representation algorithm (SRA). In addition, e dot some of the longest pairs in a circle to represent the structure of a DNA sequence. We run this algorithm over several selected genomes and the results of E.coli K are presented. Keyords: complement, palindrome, palindromic pattern, algorithm, string searching, DNA structure, DNA sequence, structural representation, efficient pattern extraction, Escherichia coli K, Escherichia coli o57, salmonella,. Introduction One of the problems arising in the analysis of biological sequences is the discovery of patterns that appear at different positions in a nucleic acid. Genomic science and structural biology have relationships in terms of the sequence and the structure of nucleic acids. It is ell knon that in a palindrome of nucleic acids the subsequence binds ith the subsequence in the opposite direction complementary on its on strand to make a stem-loop. Since in DNA, topological entanglements such as knots and catenation are crucial to the function of cells, finding these palindromes and knots is important []. Especially, palindromic or direct repeats that appears in a sequence has biophysical meaning: recognition site of dimers, forming stem-loops, and contributions to global structure of nucleic acids; moreover, the genetic netork, transduction pathay, and tissue specificity are also related to these sequences [6]. Our research focuses on finding inverted repeats (palindromes). When e start to search for palindromic pairs in a DNA sequencee basically have no information about the ords (subsequences) for hich e are searching. It is easy to start searching for fixed length palindromic pairs; hoever, finding fixed length palindromes has several disadvantages, in particular setting the length. It may be very difficult to kno hich length has biological meaning. We can find variable length palindromes longer than the minimum ord length (this can be calculated either automatically or assigned by a user). Palindromes may contain some mismatches in the form of gaps and defects of various other natures [4,7]. So e also allo errors ithin a ord. We may be able to compose these collected palindromic pairs (that consists of a hairpin structure) to multiple pair palindromic patterns. Furthermoree can use this to help understand the structure of a DNA sequence. We divide a sequence into n-gram tokens, and then combine tokens by merging adjacent tokens to long ords. Our algorithm has linear time complexity for the length of a DNA sequence and O(N LTK) space complexityhere LTK is n-gram indo size. We call this kind of searching paradigm break a string and then merge them to be real one break and gather searching technique. A disadvantage of this algorithm is it is not complete and may not find a palindromic pair even if one exists. This hoever, happens only in a very synthetic string sets (for example a string that consists of only one character). Our contributions for this ork are to introduce a ne palindrome searching algorithm (PA), to introduce a ne possibility of structural representation of a DNA sequence, to extend a palindromic representation from hairpin structure to multi-pair palindromic patterns, and to introduce a ne searching paradigm, break and gather that e have used for finding palindromic pairs. We hope our ne structural representation scheme by using a n-

2 pair palindromic pattern ill help biologist in visualizing DNA sequences. The rest of this paper is organized as follos: Section defines palindrome, palindromic pattern, and presents our palindrome algorithm; Section details our PA and the algorithm to generate patterns; Section 4 describes our experiments; Section 5 analyzes the results from the experiments; Section 6 discusses related ork in searching palindrome; Section 7 summarizes our ork and suggests possible future ork.. Palindromic pair and patterns Palindromic pairs represent hairpin structures (stemloop formation) that has biological meaning in DNA and RNA [6]. A palindrome is an inverted repeats of a ord (subsequence); a ord is a sequence of characters. When searching a DNA sequence for this palindromic paire do not have the information about the ord length or ho many palindromic pairs exist in a DNA sequence. We can easily think about the fixed length of a ord. The method of finding fixed length of palindromes has the critical disadvantage of setting the length of a palindromehere the length that has biological meaning may not be knon. In our approach the length of a palindrome is not fixed a priori. We find variable length palindromes that are longer than minimum length of a ord. A palindromic pair can represent only a hairpin structure. With multiple palindromic pairs e extend this research to multi-pair palindromic patterns. We composed multiple palindromic pairs and constructed patterns (details in Sec..). In the next sections e detail the characteristics of a palindrome and palindromic patterns; the input and output data is also illustrated... Palindromic pair A palindromic pair consists of a ord and its complement as shon in Figure. Even though its structure is not complex, there are some characteristics: gaps, length, overlap, and errors. There can be a gap beteen the ord and its complement. Palindrome recognition differs according to the presence of a gap and differences in length; it is necessary to examine complement ith gaps [6]. Long palindromes are more informative. Short ones are too common in DNA. For example all palindromes of length 4 can be found easily but it is hard for us to find information from them. Word aaatg cattt Reverse complement Figure. Word and complement There can be overlaps among palindromes as shon in Figure. Palindrome ( and ) and ( and ) are overlapped. There can be many ays of collecting pairs from this situation. First collect only and (collecting the first pair), second collect the longer pairs, third collect both pairs, and fourth collect and (collecting the second pair). Since one of our purposes is to collect as many pairs as possiblee chose the second method that collects both pairs. In some cases three or more pairs can overlap each other. Hoever, since the biological meaning of overlapping is not knon yete do not collect all possible pairs alloing multiple overlaps (e only allo overlap one time). The details are explained in Section... Once e get the palindromic pairs including overlapped pairs, those overlapped pairs can be removed easily depending on the necessity. Figure. Overlapping palindromes The algorithm to find complement token is explained in Section... There can be mismatches in the form of gaps and defects in a sequence itself [4,7]. In PA, tokens match exactly but e allo an error or mismatches beteen tokens hen they are combined. Every error can exists only after the token length. Since normally the length of a token is much smaller than a ord and e assign small errorse do not think this limitation ill cause big differences in the results. This process is detailed in Section Palindromic pattern A palindromic pair constructs hairpin structures; once e get multiple palindromic pairs e can easily extend the structure to multiple pair palindromic patterns. A palindromic pattern is a pattern composed of multiple palindromic pairs. To ords and ith complements and form a -pair palindromic pattern. There are three different -pair patternshich can be identified by integers or ordinal quadruples. Figure presents three types, each of hich is ordered folloing a Closest Position First (CPF) rule that states a complement is to be inserted in the position closet to the start of the pattern starting from the last ord. Each number inside the bracket represents a ord; the repeated numbers represent their complement. Overlap

3 Type 0: (,,, ) Type : (,,, ) Type : (,,, ) Figure. -pair palindromic patterns For example (,,,) in Figure represents (, ); this type can also be represented in a sequence S as shon in Figure 4. s Figure 4. Pattern type in a string s To-pair palindromic patterns may also have restrictions just as -pair palindromic patterns do, yet another constraints can be applied: A minimum and maximum gap beteen ords (subsequences). Words should not be closer than the minimum ord gap and should not be separated by more than the maximum ord gap. Capital letters in Figure 5 represents ords and their complements; e call the gap beteen a ord and its complement complement gap and the gaps beteen ords a ord gap. N-pair patterns generate many different multiple-pair palindromic patterns. In section..e explain an algorithm to generate all possible n- pair palindromic patterns. We can abbreviate some patterns hen they are isomorphic hen vieed as a graph (details in Sec...). Complement gap Complement gap s = acac AAAT acac CACT acac ATTT acac AGTG acac Word gap Word gap Word gap Figure 5. Word and complement gap There can be many palindromic pairs in a DNA sequence. Once e pick a number of longest palindromic pairs (n)e can represent them as n-pair palindromic patterns. We call this a structural representation. One can identify certain DNA sequences by using enough palindromic pairs. This structure can be implemented simply, and e call the algorithm as structural representation algorithm (SRA). More details of SRA ill be explained in section... Different genomes appear to have different structural representations... Input/output data Our research focuses on collecting palindromic pairs from a DNA sequence. In collecting palindromes there are several options (constraints). We may need to set the token (n-gram) length. Sometimes e may only ant long palindromic pairs, so e need to set the minimum length (this can be the same as the length of a token). We can assign a gap beteen a ord and its complement and a gap beteen different ords (minimum and maximum complement gaps, and minimum and maximum ord gap). These gaps can be any value from 0 through the sequence length. The length of an error means the number of character mismatches alloed hen tokens are combined. An example of input data is shon in Figure 6. A DNA sequence:.. aaattaata.. aataaaaaga.. gataaact... tattaattt... tctttttatt agtttatc. length: 8 Minimum length of a palindromic pair: 9 Minimum complement gap: 0 (< sequence size) Maximum complement gap: The size of a sequence Minimum ord gap: 0 (< sequence size) Maximum ord gap: The size of a sequence Length of an error: 0 Figure 6. Input data The output results in variable length palindromes. An example is shon in Figure 7. Our program continues to build n-pair palindromic patterns and count the number of these patterns. The numbers of different type n-pair palindromic patterns are also output data; in addition, by using a number of collected pairs e represent a structure of a DNA by combining and by plotting in a circle. Position String Position String 50. Approach Word aaattaata aataaaaaga Figure 7. Output data Complement tattaattt tctttttatt... When the objects that e are searching for are not knon and lie in an enormous amount of data, the problem becomes difficult. For this kind of a problem e propose a ne searching method, called Break and Gather. It breaks the problem of finding long palindromes into finding small, fixed-sized chunks, then gathers and combines the chunks that can be combined. Another advantage of this Break and Gather method is

4 its easy adaptability to other pattern searching problems (ex. searching for direct repeats). We explain the palindrome algorithm (PA) that detects palindromic pairs ith variable length and gaps from a DNA sequence in Section.. Details are explained in In addition, the collected palindromic pairs can then be used for generating multi-pair palindromic patterns by synthesizing and matchinghich is explained in Section.... Palindrome algorithm We use n-gram tokens to search a DNA sequence for ords. We ill briefly explain the process. At first e set the length of a token (LTK). We treat characters as numbers: a = 0, c =, g = and t =. The encoding of a token ill be explained in... Each token is stored in an open hash table, an array of pointers (details in Section..). As e store a token, the existence of its complement is also checked. We do not convert each token to its complement and search the hash table; insteade pass the complement together ith the token. The complement gaps are considered hile searching for a complement token. All the token pairs that meet the constraints are retrieved from the DNA sequence. If a valid token pair that meets complement gaps is found, then they are stored in a token table (TT) that holds the token, the complement token, and their positions. This process is explained in Section... The structure of a token table is shon in Figure 8. Figure 8. Structure of the token table Once the hole DNA sequence is scanned, the token table is sorted by positions. When all the pairs are sorted by position some tokens and complements are adjacent to the next token and complement. It is clear that those pairs can be combined. Thereforee combine adjacent tokens and store them in a Combined token table (CTT). This is illustrated in Section..4. Out of the combined token table e select only the combined tokens longer than minimum ord length and store them in a ord table (WT) those results are palindromic pairs. The folloing subsections explain the details.... structure Even though the goal is to collect long variable length palindromes; as a first stepe break a DNA sequence into small tokens and start to search for palindromic token pairs. Since e do not kno the maximal size of the palindromes in a DNA sequencee start from a small size. To reduce the time e treat the characters as numbers. But the problem is ho to convert a long set of characters into a number and ho to reduce the time complexity of converting a token to a number. We encode and save a token as a number. The characters are treated as number: a = 0, c =, g = and t =. For instance, a string s = caacgt ill be calculated like this: = 05. The number 05 exactly signifies the string caacgt. Since e are converting characters into number there is a limitation in the token size. We ill use signed long integer in C++. Of course, other different data types could be used such as unsigned long long hich reaches to around The maximum value of a signed long integer is,47,48,647. This simply implies that an Integer can only store 5 characters (4 5 <,47,48,647<4 6 ). Hoever, the length of a token (LTK) could be longer than 5. So e divide a token into to parts: a hash index and list indexes (the first characters are used for the hash index and the remainders are grouped into 5 character chunks, except the last chunk hich may be smaller) as shon in Figure 9. The reason hy e assign characters for hash index and the structure of the open hash table ill be explained in the next section. Hash index List index Hash index List index List index List index n Figure 9. Structure of a token Pointer of positions Pointer of positions For examplehen a token length is 50 the first a token Its complement characters are used for the hash index and the next 5 characters are used for the first list index, the next 5 characters for the second list index, and the last 9 characters for the third list index. When a fixed length token is encoded into a number, it is dependent on the token length (LTK). Given that the time complexity of converting a token to a number is Ο(LTK), the total time complexity of converting all tokens in a DNA sequence is Ο(LD LTK)here LD stands for the length of the sequence. Hoever, the index of the next ord can be computed incrementally simply by scanning a ne character. We can vie a string of c consecutive characters as representing a length-c quaternary number. Given a string T[..d], let ts denote the quaternary value of the length-c substring T[s+..s+c], for s=0,,, d-c. Certainly, ts = if and only if T[s+..s+c] = [..c]. We can compute in time O(c): =P[c] + 4(P[c-] + 4(P[c-]+ +4(P[]+4P[]) )).

5 To compute the remaining values t, t,,td-c in time O(d-c), it suffices to observe that ts+ can be computed from ts in constant time [4], since ts+ = 4(ts 4 c- T[s+])+T[s+c+]. For example, let us say a token (LTK is 8) is acggtgat. the next token is cggtgatg, the hash ord length is and the length of each list ord is. First, acg is encoded and 6 ( =6) is kept as the hash index, gtg is encoded as 46 and kept as list index, and at is encoded as for list index. The first character a of the hash index in the token becomes the old character for the ne hash index hile the g in list index becomes the ne character in the hash index. Similar shifts occur for the other indices. This is depicted in Figure 0. The ne values for the ne token is calculated: ne value=(4 (old value old char base)) + ne char, index length- here base is 4 and index length is the number of converted characters in the index. For example the ne hash value is 6 (6=(4 (6-0 4 ))+)here 6 is old value, 0 is old character ( a ), 4 is base (4 index length- ), index length is, and is ne character ( g ). Old character for the hash index Hash index c g g Figure 0. Converting character to number Ho to get the old and ne characters ill be explained in Section Hash table Ne character in the hash index & Old character for the list index t g a List index Hash index List index List index List index a c g g t g Ne character in the list index & Old character for the list index Ne token Ne character in the list index While scanning a DNA sequence each token is stored in an open hash table. Since e are trying to build a token table (a table that stores tokens and their complements ith positions), every time a token is scanned existence of its complement should be checked also. And if it is there, a t t g Old token both should be stored in the token table. To increase the searching speed e use an open hash table. The structure of a open hash table is basically just an array of pointers to linked lists that stores a token. The structure is shon in Figure. The structure is very simple, so e focus on explaining the size of the hash table. The size of the hash table is not related to the token size, but the size of the DNA sequence (LD). The program saves all the captured tokens and their positions; therefore, the space complexity is Ο(LD LTK). The number of tokens in a DNA base is LD LTK + LD. Therefore, if e say the optimal open hash table size is LD. The optimal length of hash character sets is log 4 LD. For instance, the size of E.coli K is 4,69, and the number of characters used for the hash index is (.07 log 4 4,69,). Hash table Figure. Hash table... Finding a complement token Position array Above e explained ho a token is inserted into a hash table. It is also necessary to check hether there its complement exists already. If e convert the number every time to check the existence of a complement, its time complexity is going to be O(LD LTK). Therefore, instead of converting the characters e pass the complement as a value. Here e ill explain ho to compute the complement token ith efficiency and ho to get the old and ne characters for each token conversion. Later e ill explain ho to build the token table. As mentioned abovee do not convert each token to its complement and search the hash table; insteade pass its complement together ith the token. We no explain ho to kno ne added or removed characters from a token ith constant time complexity. We devised a special type of queue, called an open queuehich removes the first element and get a ne element hen it is full, and has a Peek method that sees the character at a certain position besides the standard Enqueue and Dequeue methods. We made to queues hich size is

6 token length+ ith open queue type: one is called queue and the other is called requeue. While scanning, all the ords pass through the queue and all complement token through the requeue. One gros from the right and the other from the left as shon in Figure. The time complexity of passing through or peeking into a queue is O(). By peeking certain positions ne characters and old characters for each section of a ord are picked. Queue a c g g t g a t g c Ne char. for list index Ne char. for list index, Old char. for list index Ne char. for hash index, Old char. for list index Old char. for hash index Requeue g c a t c a c c g t Old char. for list index Ne char. for list index Old char. for list index Ne char. for list index, Old char. for hash index Ne char. for hash index Figure. Open queue and open requeue The complement gaps are considered hile searching for a complement token. All the token pairs that meet the constraints are retrieved from the sequence. If a valid token pair is found, then it is saved in a token table (TT) that holds the token, the complement token, and their positions. A token table is made from position arrays as follos: each token instance has a position array that stores all positions here the token occurred. When a position is added to this array a check to the complement token position array is made and if there is a valid pair (one ith enough gap) then the position pair is sent to the token table. After this, the array index of the smaller position value is advanced so that position ill not be used for future comparisons in pair searching. For example, in the case of Figure, the first position value arrived and is stored in the position array of a token. At this moment no comparison occurs since the complement token s position array is empty. When the next value, 59, arrived it becomes a pair ith position, and the value () ill not be used again. The pair (, 59) is sent to a token table. When position value 750 arrived, it pairs ith 59. And the pair (59, 750) is sent to the token table. Usually the lengths of each position arrays are not very long. If it is long it means there are many repeated tokens and palindromic pairs. To collect hole possible overlapping pairs the hashing process and searching complement process should be separated. Searching complements for each token has to be done after all the tokens are stored in the hash table. The token table consists of cartition products of the to sets of positions of ords and their complementshere the position of a token appears ahead of its complement. This separating process ensures us to find the longest palindromic pairs as ell, and the time complexity does not change. Position array of a token Position array of its complement token Pointer table (TT) Complement aattaat 59 attaattt Figure. Finding a token pair Long palindromes are rare and considered more informative. Soe prefer longer palindromes over shorter ones. Furthermore, if the token size is too small then there may be too many token pairshich lead to space problems in our algorithm. Let s suppose a token length of and four different characters. Then the average distant point here their palindromes occur can be calculated in the uniformly random case: all possible different cases are 6 (4 ), since the string is random, all 6 different cases may be distributed evenly. The probability of next occurrence of the same token is 6, but the palindromic token is alays ithin the 6 tokens, so e need to divide it by. The average distance of its complement is: 4 LTK /here 4 is the number of different characters (a, c, g, t) Generally, the palindrome that can happen in a random case may not be informative; thereforee assign the value of LTK bigger than the length that can statistically happen in a random case...4. Combining adjacent tokens Breaking a huge string into parts and locating token pairs as explained above. No e explain ho to combine the parts. Once the hole DNA is scanned, the token table is sorted by positions. When all the pairs are sorted by position some tokens and complements are adjacent to the next token and complement. It is clear that those pairs can be combined. Thereforee combine adjacent tokens and store them in a Combined token table (CTT). In the example of Figure 4, the token and complement token at the positions of (, 4) and (59, 59) respectively overlap except one character. We can combine these into one larger ord; this procedure is

7 called combination. The combined token table shos fully connected tokens. The positions of next ord can be represented as p+, hen p is the position of a ord ithout alloing errors. But hen e allo E errors (E is the number of error characters), the position of next ord becomes beteen p+ through p++e. We keep combining if the next position of a ord and complement are ithin this period. Position String aaattaat 4 aattaata 50 aataaaaa 5 ataaaaag 5 taaaaaga 00 gataa act... table (TT) Combined token table (CTT) position String aaattaata 50 aataaaaaga 00 gataaact... Complement token Position String attaattt tattaatt tttttatt ctttttat tcttttta agttt atc... Complement token position String 58 tattaattt 76 tctttttatt 9 agtttatc... Figure 4. Combining adjacent tokens PA is not a complete algorithmhich means there can exist a palindromic pairs PA cannot find. For example a sequence is atatatat atatat (size is S) that consists of only to complement characters a and t. if e do not allo complement gap the size of the palindromic pair should be the half of the sequence (S/). But our algorithm may not be able to find it. Lets suppose the token length is LTK, then the size of the ord and its palindrome ill be (S-LTK) and they are overlapped; this pair ill be removed. To check this problem e check each token that consists of only to complement characters. To compensate this probleme proposed above an idea as a future ork. We store tokens and complements ith positions in a hash table. After that e build a token table that consist of cartition products of the to sets of positions of ords and their complements, here a ord alays comes ahead of its complement. Then e sort the token table and combine tokens and complements. PA () { () Scanning and building TT by using open hash table () Sorting the TT in the order of position () Combine adjusted tokens and palindromes (4) Select combined tokens longer than minimum length of a palindromic pairs and store the selected ones to WT (5) return WT } Figure 5. Palindromic algorithm The combined tokens are not considered as ords yet. Only the ones hich length is bigger than the minimum ord length are stored into a ord table (WT). For instance the minimum ord length is 9, the token at the position of 00 is not selected for the ord table... Palindromic pattern algorithm Knoing palindromes is useful but only for predicting potential hairpins. Since structural information is importante try to find more complicated structures (palindromic patterns). Even though e can generate very complicated patterns using multiple palindromese only focus on -pair palindromic patterns in counting the number of patterns in our experiment. The algorithm that generates all possible n-pair palindromic patterns is explained in Section... After generating all possible palindromic patterns e tried to count each palindromic pattern to kno the dominant patterns. Further more, for visualizatione present some simplification of their patterns. A ay of simplification of palindromic patterns is explained in Section... It may be possible to create a specific n-pair palindromic pattern that can identify a DNA sequence if e can use enough number of palindromic pairs; this algorithm is detailed in Section Generating all n-pair palindromic patterns An algorithm for generating n-pair palindromic patterns is explained in this section, called palindromic pattern algorithm (PPA). This algorithm can be applied to generating any n-pair patterns. When... n and their complements (, ) exist in a string s, they are ordered such that i+ n alays appear after i and, by convention, each complement appears after. The ords... n and their complement n, can make various combinations called n-pair palindromic patterns. For example { } for n= and {{ }, { }, { }} for n=. When n is small, one can build the possible patterns intuitively. But hen n is big there ill be many multi-pair palindromic patterns. As an example the case for -pair palindromic patterns is shon in Figure 6. First, the three ords and are assigned to the root nodehere happens after and happens after. Positioning generates one child node because can occur only after ; positioning generates three child nodes because have three possible positions: beteen and, beteen and, after. We follo a Closest

8 Position First (CPF) rule. Finally, by positioning in each possible location, fifteen -pair palindromic patterns are generated. An integer value is assigned for each pattern from left to right in a leaf nodes in Figure 6, here each number stands for a ord.,,... Simplification of palindromic patterns -pair palindromic pattern generated 5 different patterns. More dimensional palindromic patterns ill generate more patterns. Some of their patterns are isomorphic hen vieed as a graph. To abbreviate the pattern types disregard direction and some patterns form the same shape. For example the pattern {,,,,,} and {,,,,,} becomes one type by graph similarity. This is illustrated in Figure 8. 0:,,,,, :,,,,, :,,,,, :,,,,, 4:,,,,,, 5:,,,,, 6:,,,,, 7:,,,,, 8:,,,,, Figure 6. -pair palindromic patterns 9:,,,,, 0: : : : 4: The number of leaf nodes for n=,, and is,, and 5 respectively. Let L(n) denote this sequence. Intuitivelye can see that the number of leaf nodes, L(n), in a palindromic pattern tree can be calculated by multiplying by L(n-) by n here L()=. Notice n- is simply the number of ords in the penultimate level of the tree. For example, for -pairs the number of child nodes is, and there are 5 position to insert hen building -pair patterns. L()= L(n) = ( n ) L(n- ) = (n ) (n ) 7 5 * String I = put three ords (ex., n ) * Pattern S = {I} // store set of strings * Stack = store complements (ex.... ) n PPA (Stack, S) { If (Stack is empty) return S End if } = Stack.pop () For (each e S) S = S & <insert to all possible places in e from closest place to > End of For PPA (Stack, S ) Figure 7. Palindromic pattern algorithm Figure 8. Similar pattern types in -dimensions... Structural representation algorithm Above e explained an algorithm to generate all possible n-pair palindromic patterns. We can easily expect that if e have enough palindromic pairs collected (ex. 0 or more), then a palindromic pattern can identify a certain DNA sequence. We call this structural representation the algorithm is called structural representation algorithm (SRA). The SRA gets this sequence of pairs as an input parameter. Lets suppose () the input is to ros of data that stores the positions of the palindromic pairs. () e sort them by position of ords, and () assign the same integer id to each ord and its palindrome. (4) We make the to ros as a one long ro. (5) Sort the ro by position and (6) read the integer id. * PW is the positions of ords and their complements SRA (PW) { () sort them by position of ords () assign a integer tag to each ord and its complement () make the to ros as one long ro (4) Sort the ro by position (5) return the integer tags } Figure 9. Structural representation algorithm

9 4. Experiment The purpose of our research is to identify palindromes globally. Experiments ere conducted on the DNA bases of Escherichia coli K (E.coli K) [], Escherichia coli o57 (E.coli o57) [], and Salmonella [], but only the result of E.coli K (ACCESSION U00096LOCUS ECOLI 469 bp DNA circular BCT 0-SEP-997) is presented in this paper since the purpose of this conference is more related to the computation than the biological results. The results of the rest sequences can be provided upon requests. The size of those genomes is around 5 MB. Even though this size is not the same as humans, it is big enough to test the performance of our algorithm. We ill focus on the number of palindromic pairs collected, the number of -pair palindromic pairs, and the structural representation of this genome. The computer that e used is UNIX Sun Ultra 60 Workstation. We used C++ for implementation. 5. Analysis Time complexity is one of the main evaluation criteris. To measure the time complexity e listed main processes in Table. The time complexity for scanning and building a token table is O(LD)here LD is the size of a DNA sequence; the space complexity is O(LD LTK), here LTK is the length of a token. The space complexity can be reduced by storing pointers of the tokens. Sorting the token table takes O(TT log TT)here TT means the size of a token table. but generally LD is much bigger than the TT. So, the time complexity remains O(LD). CTT is also much smaller than the LD and TThere CTT means the size of combined token table. We therefore, can still claim that the time complexity of this algorithm is O(LD). Tsunoda [6] introduced an algorithm ith O(N log N)here N is the size of DNA sequence. Logically our algorithm has a linear time complexity for the length of a DNA sequence, but it is still affected by LTK hen LTK is significantly long. Table. Main processes Processing Execution time Memory space. Scanning and building token table O(LD) O(LD LTK). Sorting the token table O(TT log TT) O(TT). Building adjusted token table O(CTT) O(CTT) 4. Collecting palindromic pairs O(CTT log CTT) * LD is the size of a DNA sequence * LTK is a token size * TT is the size of token table * CTT is the size of a combined token table The folloing focus on the efficiency of PA: the number of long palindromic pairs. We introduced an other palindromic pattern algorithm (PPA) and a structural representation algorithm (SRA); e present the results of these algorithms also. The input sequence as E.coli K genome. Other options ere the token length (50), minimum length of a palindromic pair (50), minimum complement gap (0), maximum complement gap (ithout limitation), minimum ord gap (0), maximum ord gap (ithout limitation). We chose the token length as 50 because this length made reasonably size of token table. We did not assign any constraints to the gaps and errors to simplify our experiment. The base counts ere,4,6 for a,79,4 for c,76,775 for g and,40,877 for t. There as no other characters other that the four base characters. The longest ord pair length as,456hich is unusual in random sequences; in a random sequence ith a length of 4,69, the maximum ord pair length is : the maximum ord length in the random case is log 4,69,. The number of ord pairs ith length 986 and longer as 4, the number of ord pairs ith length beteen 58 and 985 as 7, and the number of ord pairs ith length beteen 50 and 57 as 06. The total number of ord pair that is longer than 50 as 7. The results from E.coli K- ere surprising because there ere a lot of very long palindromic pairs. We confirmed that is generally the longest ord pair length in a random sequence of length 4,69,. Three random sequences ith 4,69, characters ere generated, and tested; from all of them the longest ord length as. Intuitively, in the random case the average distance of repetition of a ord length of,456 is 4 456, here 4 is the number of different characters. The probability of the repetition of the ord in a sequence length of 4,69, is 469/ Since e are looking for a arbitrary length of palindromic pair ith in a sequence e divide the value by ; the probability of having a palindromic pair length of,456 in a random sequence length of 4,69, is (469 )/ This proves that E.coli K is not a random sequence. E.coli o57 [], and Salmonella [] also had much longer palindromic pairshich proved that these sequences are not random as ell. The number of pattern type 0 to 4 are listed in the Table. The Total pattern number in three-dimensions as,080,76. The bigger number means that there are more multi pair palindromic patterns in a string that are of interest to a biologist. The actual running time on a UNIX Sun Ultra 60 Workstation as.04,.0, and.0 minutes for three tests (average as.0minutes)here.7,.5, and.49 minutes ere for the time of collecting palindromic pairs (average is.40 minutes).

10 Table. Summarized results of E.coli K Description Value length 50 Minimum length of a palindromic pair 50 Minimum complement gap 0 Maximum complement gap Without limitation Minimum ord gap 0 Maximum ord gap Without limitation Total DNA sequence 4,69, a,4,6 c,79,4 g,76,775 t,40,877 others 0 longest pair length,456 longer than or equal to Beteen 58 and Beteen 50 and Totol palindromic pairs 7 Type 0, 5,488 Type, 46,77 Type,,6 Type, 07,06 Type 4, 0,55 Type 5, 5,7 Type 6,,547 Type 7, 95,46 Type 8, 95,955 Type 9, 88,0 Type 0, 0,65 Type, 85,99 Type, 06,084 Type, 67,94 Type 4, 40,977 Total palindromic pattern,080,76 We put similar patterns together and shoed the results in Table. The explanation for the pattern similarity as explained in section... Table. Combine similar pattern types in E.coli K Description Value Pattern 0, 5,488 Pattern, 07,06 Pattern 4, 0,55 Pattern 9, 88,0 Pattern, 06,084 Pattern 4, 40,977 Pattern,5 (,) 7,964 Pattern,0 (,) 45,7 Pattern 6,7,8 (,,) 0,98 Pattern, (,) 5,94 Total Pattern,080,76 We chose 0 longest palindromic patterns as shon in Table 4. The number 0 as arbitrary picked. We identified each pair by index; e presented the ord length, start and end positions of a ord and their complements from the left. The end position of a ord and its complement can be calculated by adding the start position and ord length. To check the overlap e presented the end position in the table also. Index Table 4. Long ord pairs in E.coli Word Length Start position Word End position Complement Start position End position These 0 pairs can identify different types (=9 7 5, this formula as shon in Section..). We represented the ords and complements in the order of positions. Index 9 appeared first and then index 8 and so on. Each number represents the index of each ord and complement (the first number out of to identical numbers represents a ord and the second one represents its complement) Instead of indexing by the order of palindrome length, e changed the index.r.t the position of ords. We call this sequence a 0-pair palindromic pattern structural representation of E.coli K As e got this structural representation from other to genomes (E.coli o57 and Salmonella) e could observe that they are different from each other. Even though e could not prove that this value can identify genomes theoretically, it appeared that different genomes have different structural representations. Jensen visualized complete microbial chromosomes in a circle so that repetitive sequences in base composition or DNA structure became visible [0]. We borroed their

11 idea and presented some longest pairs to see the DNA structure. Dotting all the palindromic pairs can confuse us and e are more interested in the structure of some very long pairs. The five longest ord pairs ere doted in a circle in Figure 0. One of the biggest surprises in genetics is a loop in mrna. It seems that Eucaryotic genes contain loads of junk messages - non-coding sequences. The genes themselves are fragmented [5]. E.coli K- is not a Eucaryotic genome, but is has similar loops. The palindromic pairs st, nd, and 5 th are close together; therefore, matching them can generate small loops. Gene is symmetric about the origin of replication that indicates that matching sequences tend to occur at the same distance from the origin [6]. Hoever the circle ith the five longest ord pairs in Figure 0 does not sho any symmetric pattern. The only conclusion that e can find is that rd and 4 th pairs sho somehat symmetric pattern. E.coli 057 and salmonella did not sho symmetry neither. Origin 4,000,000,000,000 Figure 0. Word pairs dotted in a circle 6. Related ork 0 st pair nd pair rd pair 4 th pair 5 th pair,000,000,000,000 Bailey [5] developed a program that can detect palindromes; but, because it only applies a complementary symmetry matrix for each position, it could not detect the general palindromic sequences ith gaps of arbitrary length. In his algorithm, the time complexity and memory complexity exploadhen arbitrary gaps and lengths ere considered. To solve this problem Tsunoda [6] devised an algorithm that extracts variable length palindromes alloing variable length of gaps. They classified repetitive patterns as four cases: () direct repeats, () trans-strand repeats, () backard repeats, (4) inverted repeats (palindrome). They clearly described the necessity of various type repetitive patterns. Their algorithm searched for patterns of those four types of ith O(N log N) time complexityhere N as the size of sequences; they found exact matches. Porto [] alloed errors ithin palindromes; that is, they introduced approximate palindromes. Their algorithm found all approximate complements ith K errors of a certain ord. The difference is that our algorithm finds a complement for each ord. There have been parallelpalindrome-searching algorithms that find all palindromes of a certain initial ord alloing complement gaps [,]. Jensen [0] presented a method for visualizing complete microbial chromosomes so that repetitive sequences (direct repeats, inverted repeats etc) or DNA structure became visible. It is difficult to get an overvie of a complete genome due to its size. They used an atlas (a colored circle) for representing sequences and structures. In our research e plotted palindromic pairs. We hope this can give an insight toards the symmetric of a genome. Eisen [6] found symmetry around the replication origin and terminus; that is, the distance of a particular conserved feature (DNA or protein) from the replication origin is conserved beteen closely related pairs of species. They also found statistically significant x-shaped patterns ithin some genomes, indicating that there is symmetry about the replication origin. Hoever in our structural representation and plotting of palindromic pairs the results did not sho clear symmetric shape. 7. Concluding remarks We found variable length palindromic pairs instead of fixed length pairs. The method of finding fixed length of palindromic pairs has the disadvantage of setting the lengthhere the length that has biological meaning is not knon. Even though e can allo gaps e did not assign any constraints. We also collected overlapped palindromes. Some errors ere alloed in a ord or a complement because of mismatches in the form of gaps and defects in a sequence itself [4,7]. We propose a ne palindrome algorithm (PA) using a Break and Gather mechanism. Surprisingly this algorithm can collect palindromic pairs ith time complexity of O(LD)here LD is the length of a DNA sequencehich is faster than the O(LD log LD) in [6]. PA consists of four major processes: scanning these sequences and building a token table, sorting the token table, building combined token table, and collecting palindromic pairs. Another advantage of this Break and Gather method is its easy adaptability to other pattern searching problems. We tested our algorithm against the ell-knon E.coli K- genome, as ell as other sequences. The longest palindromic pair found as much longer than the maximum lengths of the ord pairs that can happen in random sequences and the number of long palindromic pairs as much more than e expected. The longest ord pair length is,456 (the maximum length of random palindromic pairs for sequences of (their length) is

12 expected to be ). There are 4 pairs ith length 986 or longer and 7 pairs ith length beteen 58 and 985. We concluded that the sequences of E.coli K are not random sequences, because the probability of having pairs size of,456 significantly lo. E.coli o57 and salmonella ere not random as ell. By composing multiple palindromic pairse could generate multiple pair palindromic patterns. We proposed a palindromic pattern algorithm (PPA) that could generate all possible combinations of n-pair palindromic pairs. PPA can be applied for generating any n-pair patterns. By using this e counted the existing -pair palindromic patterns. In additione picked 0 longest palindromic pairs and represented them as 0-pair palindromic patterns. We also presented structural representation algorithm for this (SRA). The structural representations from different DNA sequences (E.coli K, E.coli 057, and salmonella) yielded different structures. Soe assume that this method can identify different genomes. We also dotted the five longest palindromic pairs in a circle that represents a sequence to see the symmetric of a genome. Some research shoed symmetry of a genome [6], but upon this approach E.coli K, E.coli 057, and salmonella did not sho this symmetric shape. Our algorithm is not a complete algorithm; therefore, in some very artificial sequences even though there exists a palindromic pairs, our algorithm could not detect it. To compensate this eakness e proposed another algorithm briefly; therefore, implementing this ne approach is one of our future orks. Mismatches in subsequences are not considered; e focused on extracting completely matched subsequences, since, in general it is difficult to estimate ho mismatches affect the forming the structure of nucleic acids. Our Break and Gather method also can be used for finding direct repeats ith O(N) time complexityhere N is the length of a sequence. Our algorithm can be modified to scan RNA and Protein by alloing other characters such as U. We can extend this research to a human genome. Furthermoree need to examine hether our structural representation scheme for identifying DNA sequences by studying multiple sequences experimentally. 8. Acknoledgement We thank Timothy J. Atkinson for his valuable comments. 9. References [] A. Apostolico, D. Breslauer, and Z. Galil, Parallel Detection of All Palindromes in a String, Theoretical Computer Science, 4:, 995, 6-7. [] A.H.L. Porto and V.C. Barbosa, Finding Approximate Palindromes in Strings, Pattern Recognition 5, 00, [] D. Breslauer and Z. Galil, Finding All Periods and Initial Palindromes of a string in parallel, Algorithmica 4:4, 995, [4] D. Gusfield, Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, Cambridge University Press, Ne York, NY, 997. [5] Gonick, Larry, and M. Wheelis. The Cartoon Guide To Genetics. Ne York: Barnes & Noble, c [6] J.A. Eisen, J.F. Heidelberg, O. White, and S.L. Salzberg, Evidence for Symmetric Chromosomal Inversions Around the Replication Origin in Bacteria, Genome Biology (6), 000. [7] J. Jurka, Origin and Evaluation of Alu Repetitive Elements, in R.J. Maraia (Ed.), The Impact of Short Interspersed Element (SINEs) on the Host Genome, R.G. Landes, Ne York, NY, 995, pp 5-4. [8] J.T.L. Wang, Discovering Active Motifs in Sets of Related Protein Sequences and Using Them for Classification, Nucleic Acids Research, (4), 994, [9] K. Shishido, N. Komiyama, and S. Ikaa, Increased Production of a Knotted Form of Plasmid pbr DNA in Escherichia coli DNA Topisomeraes Mutants, Journal of Molecular Biology, 00, pp [0] L.J. Jensen, C. Friis, and D.W. Ussery, Three Vie of Microbial Genomes, Res. Microbiol. 50, 999, [] National Center for Biotechnology Information, Complete genome sequence of Salmonella enterica serovar Typhimurium LT, 00, nucleotide&list_uids=67690&dopt=genbank, 00. [] National Center for Biotechnology Information, Genome sequence of enterohaemorrhagic Escherichia coli O57, 00, t_uids=6445&dopt=genbank, 00. [] National Center for Biotechnology Information. The Complete Genome Sequence of Escherichia coli K, 997, t_uids=67994&dopt=genbank, Apr. 00. [4] T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. Ne York: McGra-Hill, [5] T.L. Bailey, Discovering motifs in DNA and protein sequences. Univ. of California at Sandiego (Ph.D. dissertation), 995. [6] T. Tsunoda, M. Fukagaa, and T. Takagi, Time and Memory Efficient Algorithm for Extracting Palindromic and Repetitive Subsequences in Nucleic Acid Sequences, Pacific Symposium on Biocomputing 4, 999, pp. 0-. [7] X. Guan and E.C. Uberbacher, A Fast Look-Up Algorithm for Detecting Repetitive DNA Sequences, Pacific Symposium on Biocomputing, Singapore, 996,

Discover Activity. Think It Over Inferring Do you think height in humans is controlled by a single gene, as it is in peas? Explain your answer.

Discover Activity. Think It Over Inferring Do you think height in humans is controlled by a single gene, as it is in peas? Explain your answer. Section Human Inheritance Reading Previe Key Concepts What are some patterns of inheritance in humans? What are the functions of the sex chromosomes? What is the relationship beteen genes and the environment?