Stringhe - · PDF file1 1 2 Stringhe Stringa = dato che non si scompone, ma puo’ essere...

1

1

2

StringheStringhe

Stringa = dato che non si scompone, ma puo’ essere scritto come una lunga sequenza di caratteri

• abbbabababcbabcccba• 01010001001010• Nel#mezzo#del#cammin#di#nostra#vita#• ACCCATGATCGTACTCG

2

3

Algoritmi e strutture dati su stringheAlgoritmi e strutture dati su stringhe

• Text editing• Linguistica computazionale• Compressione di dati• Biologia computazionale

Nota: alcune delle slides seguenti sono tratte da http://www.theory.lcs.mit.edu/

Applicazioni principali:

4

TextText editingediting

• grep string-matching, regular-expression-matching

• agrep approximate-pattern-matching• sed repetition-finding, pattern-matching• awk multi-pattern-matching• lex regular-expression-matching• yacc e twig tree-pattern-matching• diff LCS, edit-distance

Esempio: alcuni comandi e linguaggi in Unix

3

5

Linguistica Linguistica computazionalecomputazionale

Dizionario = testo statico organizzato in modo da velocizzare gli accessi

(per esempio tramite indice)

Applicazioni :• Parsing linguaggi naturali • Image processing 2D-pattern-matching,

dictionary-of-sub-images

Alcune strutture dati per indici sono:

suffix-trees, suffix-arrays, DAWG, Factor-automata

6

Compressione dei datiCompressione dei dati

• Compressione di testi (senza perdita di informazione)

• Compressione di immagini o suoni (con perdita di informazioni)

Applicazioni: • telecomunicazioni (fax,…) • analisi linguaggi naturali (dizionari, …) • biologia computazionale• …

Importante: ridondanze e regolarita’ nel testo

4

7

8

5

9

Biologia molecolareBiologia molecolare

Le molecole di DNA e RNA possono essere viste come stringhe:

DNA: CAGCTGATCGATGCTAGCCTGRNA: ACGUACGUAUUGCAUCGUAC

Alfabeto dei nucleotidi (basi, bp):A = adenina, C = citosina, G =guanina, T = timinaU = uracile

A - T C - G

10

6

11

12

Studiare il DNAStudiare il DNA

• Sequencing = determinare la sequenza completa dei nucleotidi

di una lunga stringa di DNA

Ø trascrizione dal gel

Ø ricostruzione (fragment assembly)

• Collezione: costruzione e ricerca in enormi database

• Analisi: Ø Mapping

Ø Alignment

Ø Ricerca di similarita’, ridondanze, regolarita’

Ø …grande quantita’ di algoritmi su stringhe…

7

13

GenomaGenoma

Il genoma e’ tutto il DNA in un organismo

• http://www.pbs.org/wgbh/nova/genome/• http://www.ornl.gov/hgmis/ (Human Genome Project)• http://gdbwww.gdb.org/gdb/report.html (Genome

Database)• …

Nota: strutture dati centrali in alcuni progetti genome-scalesono suffix-trees, generalized-suffix-trees e suffix-arrays

14

HumanHuman GenomeGenome ProjectProject

Begun in 1990, the U.S. Human Genome Project is a 13-year effortcoordinated by the Department of Energy and the National Institutesof Health. The project originally was planned to last 15 years, buteffective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to

• identify all the approximately 30,000 genes in human DNA,

• determine the sequences of the 3 billion chemical base pairsthat make up human DNA,

• store this information in databases, • improve tools for data analysis, • transfer related technologies to the private sector, and • address the ethical, legal, and social issues (ELSI) that may arise

from the project. • Several types of genome maps have already been completed, and a

working draft of the entire human genome sequence wasannounced in June 2000, with analyses published in February 2001.

8

15

Q. Q. What'What's a s a genomegenome? And ? And whywhy isis ititimportantimportant??

• A genome is all the DNA in an organism, including its genes. Genescarry information for making all the proteins required by allorganisms. These proteins determine, among other things, how the organism looks, how well its body metabolizes food or fightsinfection, and sometimes even how it behaves.

• DNA is made up of four similar chemicals (called bases and abbreviated A, T, C, and G) that are repeated millions or billions of times throughout a genome. The human genome, for example, has3 billion pairs of bases.

• The particular order of As, Ts, Cs, and Gs is extremely important. The order underlies all of life's diversity, even dictating whetheran organism is human or another species such as yeast, rice, or fruitfly, all of which have their own genomes and are themselves the focus of genome projects. Because all organisms are relatedthrough similarities in DNA sequences, insights gained fromnonhuman genomes often lead to new knowledge about humanbiology. [01/01]

16

Q. Q. HowHow big big isis the the humanhuman genomegenome??

The human genome is made up of DNA, which has four different chemical building blocks . These are called bases and abbreviated A, T, C, and G. In the humangenome, about 3 billion bases are arranged along the chromosomes in a particular order for each unique individual. To get an idea of the size of the human genome present in each of our cells , consider the following analogy: If the DNA sequenceof the human genome were compiled in books , the equivalent of 200 volumes the size of a Manhattan telephone book (at 1000 pages each) would be needed to hold it all.

It would take about 9.5 years to read out loud (withoutstopping) the 3 billion basesin a person's genome sequence. This is calculated on a reading rate of 10 basesper second, equaling 600 bases /minute, 36,000 bases /hour, 864,000 bases /day, 315,360,000 bases /year.

Storing all this information is a great challenge to computer experts known as bioinformatics specialists . One million bases (called a megabase and abbreviated Mb) of DNA sequence data is roughly equivalent to 1 megabyte of computer data storage space. Since the human genome is 3 billion base pairslong, 3 gigabytes of computer data storage space are needed to store the entire genome. This includes nucleotide sequence data only and does notinclude data annotations and other information that can be associated withsequence data.

9

17

SequencingSequencing

Sequencing una piccola porzione di DNA (circa 500 bp) e’ fattibile (tramite macchinari o a mano) mentre una lunga no.

Sequencing di una lunga porzione si ottiene mettendo tutti i pezzi piccoli in un computer. Il problema e’ di rimettere insieme i pezzi correttamente.

18

10

19

20

11

21

22

12

23

24

Problemi algoritmici nel Problemi algoritmici nel sequencingsequencing

• Trascrizione 2D-pattern-matching

• Fragment assembly

shortest-superstring, suffix-prefix

13

25

CollezioneCollezione

Gli archivi di sequenze e il software per la ricerca in essi sono diventati strumenti essenziali nella moderna biologia molecolare

‘It would be hard to find a recent DNA-baseddiscovery that didn’t use these tools’(G. Lennon)

26

DatabaseDatabase

• I piu’ grandi archivi contengono sequenze per una lunghezza totale di 500.000.000 nucleotidi; questi numeri crescono di circa il 75% all’anno

• Il ‘nonno’ degli archivi e’ GenBank, incaricato di mantenere tutte le sequenze di DNA mai rese pubbliche

• In Europa Swiss-Prot e’ il piu’ grande archivio di sequenze di proteine; ha circa 400.000 richieste al mese

14

27

Analisi sequenze biologicheAnalisi sequenze biologiche

• Mapping

• Alignment

• Ricerca similarita’, ridondanza e regolarita’

• …

28

MappingMapping

Mapping = specificare le locazioni dove compaiono markers o copie di sottostringhe di interesse

Mapping genetico e fisico

I problemi di acquisizione e utilizzo di mappe fisiche sono piu’ vicini a problemi su stringhe

15

29

30

16

31

32

Problemi algoritmici nel Problemi algoritmici nel mappingmapping

Mapping fisico e sequencing sono correlati:

shortest-superstring, suffix-prefix

17

33

AlignmentAlignment

Alignment = allineare sequenze di caratteri inserendo spazi e permettendo ‘match’ e ‘mismatch’ fra i caratteri corrispondenti

Un alignment di qacdbd e qawxb:q a c _ d b dq a w x _ b _

Dualita’ con edit-transcription e edit-distance(programmazione dinamica). Varianti coinvolgono LCS, approximate-string-matching

34

Similarita’Similarita’

Primo fatto dell’analisi di sequenze biologiche:‘’Nelle sequenze biomolecolari (DNA, RNA, …)

una alta similarita’ fra sequenze di solito implica una significativa similarita’ strutturale o funzionale’’

‘’…everything in life is so similar, that the samegenes that work in flies are the ones that work in humans’’ (Wieschaus, Premio Nobel 1995)

18

35

RidondanzaRidondanza

‘’The vast majority of extant proteins are the result of a continous series of duplicationsand subsequent modifications. As a result, … so many new sequences resemblealready known sequences.’’ (Doolittle)

‘’…all of biology is based on an enormousredundancy…’’ (Doolittle)

36

Regolarita’Regolarita’

Molte tecniche algoritmiche su stringhe molecolari sfruttano regolarita’ e ripetitivita’ nelle stringhe

Strutture ripetitive nel DNA :• Palindrome : abccba, ‘was it a cat i saw’

• Palindrome complementate: AGCTCGCGAGCT• Tandem arrays: TTAGGG si ripete alla fine di ogni cromosoma

umano

• Alu repeats: sottostringhe di circa 300 bp che si presentano quasi identiche in tutto il genoma umano (ogni frammento di DNA umano piu’ lungo di 20000 bp ne contiene una) e contengono esse stesse ripetizioni di sottostringhe di circa 40 bp e spesso dei tandem arrays

• Genomic imprinting: funzione e origine misteriose

• ….

19

37

StringString--matchingmatching

‘’Il problema su stringhe fondamentale’’

• Fondamentale in molte applicazioni

• Compare come sottoproblema (per esempio: in biologia molecolare e’ sottoproblema di multiple-sequence- comparison, large-scale sequence-comparison, database-searching, …)

• Tecniche per string-matching si estendono ad altri problemi (per esempio: 2D-pattern-matching, string-matchingapprossimato, …) Nota: approssimato = alcuni errori di vario tipo sono ammessi (vedi agrep). E’ centrale in biologia molecolare.

38

StringString--matchingmatching

• Testo = stringa di lunghezza n• Pattern = stringa di lunghezza m• Problema = trova le occorrenze del pattern nel

testo (tutte o 1)

AGGCTCACGTATATATGCGTTATAATG

TATATATA

TATA

20

39

Algoritmi per Algoritmi per stringstring--matchingmatchingPreprocessing + search

• Basati su confronti:ØNaifØKnuth-Morris-PrattØBoyer – MooreØAutomi

• Semi-numerici:ØRabin-KarpØShift-And

• Altre strutture dati:ØSuffix - trees

40

Algoritmi per Algoritmi per stringstring matchingmatching

O(|A|n + m)Preprocess testo (o pattern)Suffix-trees

Piccoli patternShift-And

Caso medioRK

O(n+|A|m)Preprocess pattern; real-timesearch

Automi

O(n + m) in media

Preprocess patternBM

O(n + m)Preprocess patternKMP

O(n m )Naif

21

41

Calendario lezioniCalendario lezioni

Ø 1. Introduzione (questa!)

Ø 2. String-matching: algoritmi naif e KMP (CLR )

Ø 3. String-matching: algoritmi su automi e BM (idea) (CLR)

Ø 4. String-matching seminumerici : KR (CLR), Shift-And,

agrep (GUS 4.1, 4.2)

Ø 5. Suffix-trees, generalized suffix-trees, longest-

common-substring, (GUS 5.1,5.2,5.3,5.4, 6.4, 7.1,7.4)

Ø 6. Suffix-prefix, shortest-superstring (GUS 7.10, 16.17)

Stringhe - · PDF file1 1 2 Stringhe Stringa = dato che non si scompone, ma puo’ essere...

Documents

Transcript of Stringhe - · PDF file1 1 2 Stringhe Stringa = dato che non si scompone, ma puo’ essere...