Fisica teorica e biologia computazionale.

49
Fisica teorica e biologia computazionale. M. Caselle Napoli, marzo 2005

Transcript of Fisica teorica e biologia computazionale.

Page 1: Fisica teorica e biologia computazionale.

Fisica teorica e biologia

computazionale.

M. Caselle

Napoli, marzo 2005

Page 2: Fisica teorica e biologia computazionale.

Indice

• Introduzione alla biologia computazionale:

– Principali filoni di ricerca– Risorse di tipo didattico e preprint databases– Congressi– FB11– Il gruppo di Torino

• Esempio: Identificazione di sequenze regolatrici inlievito ed in uomo.

– Genomica comparativa: allineamento di sequenze– Dati di espressione genica: microarrays– Annotazioni funzionali: Gene–Ontology

1

Page 3: Fisica teorica e biologia computazionale.

Filoni di ricerca principali

• Protein folding

• Interazione Proteina–Proteina / Proteina–DNA

• Dinamica del DNA e dell’RNA

• Annotazione del genoma

• Elaborazione di algoritmi (clustering, sequencealignement)

• Studio di problemi legati alla mobilita’ cellulare

• ....

2

Page 4: Fisica teorica e biologia computazionale.

Risorse

Didattica

Master

• ”Bioinformatica: Applicazioni Biomediche eFarmaceutiche”, Universita di Roma La Sapienza.http://cassandra.bio.uniroma1.it/Master

• ”Bioinformatica”, Universita di Torinohttp://www.masterbioinformatica.it

• ”Bioinformatica”, Universita di Milano Bicoccahttp://www.btbs.unimib.it/master/bioinformatica2003.htm

Dottorati

”Sistemi complessi applicati alla biologia post-genomica”, Universita di Torino

http://www.bioinformatica.unito.it/complex

−systems/welcome.html

3

Page 5: Fisica teorica e biologia computazionale.

Preprints

Alla fine del 2003 e’ nato (a fianco di hep-th,hep-phecc.) un nuovo archivio di “quantitative biology” chesi chiama q-bio.

Il link e’ http://xxx.lanl.gov/archive/q-bio

Congressi

• Intelligent Systems for Molecular Biology

ISMB 2004, Glasgow 31 luglio - 4 agosto

http://www.iscb.org/ismbeccb2004/

ISMB 2005, Detroit 25-29 giugno

http://www.iscb.org/ismb2005/

ECCB 2005, Madrid 27-30 settembre

http:/www.eccb05.org/

4

Page 6: Fisica teorica e biologia computazionale.

• Research in Computational Biology

RECOMB 2004, S. Diego 27-31 marzo

CBrohttp://recomb04.sdsc.edu/

RECOMB 2005, Boston marzo 2005

Topics:

– Genomics

– Molecular sequence analysis

– Recognition of genes and regulatory elements

– Molecular evolution

– Protein structure

– Structural genomics

– Gene Expression

– Gene Networks

– Drug Design

– Combinatorial libraries

– Computational proteomics

– Structural and functional genomics

5

Page 7: Fisica teorica e biologia computazionale.

FB11: Applicazione di metodi della fisicateorica a sistemi biologici

Sezioni coinvolte e partecipanti

Sezione resp. locale partecipantiBA S. Stramaglia 7BO A. Bazzani 5CT A. Rapisarda 7FI S. Bagnoli 11MI C. Destri 12NA L. Peliti 6PD A. Stella 6Pr R. Burioni 4RM2 S. Morante 3SA S. Scarpetta 2TO M. Caselle 5

Totale 68

6

Page 8: Fisica teorica e biologia computazionale.

Il gruppo di Torino

• M. Caselle, Dip. di Fisica Teorica

• F. Di Cunto, Dip. di Biologia Molecolare

• I. Pesando, Dip. di Fisica Teorica

• P. Provero, Fondazione per le Biotecnologie

• D. Cora’ (Dottorato: Sistemi complessi ...)

• E. Curiotto (Dottorato: Sistemi complessi ...)

• L. Martignetti (Dottorato: Sistemi complessi ...)

• I. Molineris (Dottorato: Sistemi complessi ...)

• A. Re (Dottorato: Sistemi complessi ...)

• G. Sales (Laureando)

7

Page 9: Fisica teorica e biologia computazionale.

Collaborazioni con

• C. Dieterich Max Planck Inst. for MolecularGenetics, Berlin

• C. Herrmann Laboratoire de Genetique et dePhysiologie du Developpement (LGPD) Marseille

• I. Sbrana Dipartimento di Biologia dell’Universita’di Pisa

8

Page 10: Fisica teorica e biologia computazionale.

Linee di ricerca

1] Studio della regolazione genica. In particolare:

• identificazione di nuovi fattori di trascrizionein lievito usando Gene Ontology, Microarray ecorrelazioni.

• uso di metodi di Genomica Comparativa (inparticolare il confronto tra topo ed uomo) perl’identificazione di nuovi regolatori nell’uomo.

2] Ricerca di UTR in uomo mediante Genomicacomparativa e metodi statistici (catene di Markov)

3] Uso di tecniche di teoria dei grafi per studiarenetworks di coespressionee di coregolazione.

4] Studio di siti fragili nei cromosomi umani.

5] Studio dell’interazione tra DNA e fattori ditrascrizione mediante simulazioni di DinamicaMolecolare.

9

Page 11: Fisica teorica e biologia computazionale.

References

• M. C., F. Di Cunto, P. Provero

“Correlating overrepresented upstream motifs togene expression: a computational approach toregulatory element discovery in eukaryotes.”

BMC Bioinformatics 2002, 3:7, physics/0203013

• M. C., D. Cora, L. Silengo, F. Di Cunto, P. Provero

“Computational identification of transcription factorbinding sites by functional analysis of sets of genessharing overrepresented upstream motifs” BMCBioinformatics 2004, 5:57,q-bio.GN/0310040

• M. C., F. Di Cunto, P. Provero

”A computational approach to regulatory elementdiscovery in eukaryotes”

Proceedings of the 2002 ECMTB conference,cond-mat/0305279

10

Page 12: Fisica teorica e biologia computazionale.

• M.C., F. Di Cunto, M. Pellegrino and P.Provero

“Finding regulatory sites from statistical analysisof nucleotide frequencies in the upstream region ofeukaryotic genes”

Proceedings of the International Workshop“Modelling Bio-medical signals”, Bari, September2001, physics/0201033

11

Page 13: Fisica teorica e biologia computazionale.

1. Introduction

Genome Structure

• The density of protein-coding and RNA-codingsequences becomes lower and lower as thecomplexity of the organism increases. It is ratherhigh in Prokaryotes, low in S. Cerevisiae, very lowin the human genome: most of DNA in the humangenome is not coding (∼ 99%)

• The biological role of non-coding part of DNAis poorly understood. The common lore is thatit should be involved in the regulation of geneexpression

12

Page 14: Fisica teorica e biologia computazionale.

Gene regulation

Gene expression is tightly controlled and regulated:

• All cells in the body carry the full set of genes, butonly express about 20% of them at any particulartime

• Different proteins are expressed in different cells(neurons, muscle cells....) according to the differentfunctions of the cell.

As more and more complete genomes are decodedit is becoming of crucial importance to understand howthe gene expression is regulated.

The challenge is now to identify and fullycharacterize the network of interactions among genesand their products in an organism.

13

Page 15: Fisica teorica e biologia computazionale.

The most important example of such interactions isthe transcriptional regulation of protein coding genes.Even if this is not the only regulatory mechanism ofgene expression in eukaryotes it is certainly the mostwidespread one.

The goal of our research project (as of many othersin the world) is to reconstruct these interactions bycomparing existing biological information (like thecoregulation of sets of genes) with the statisticalproperties of the sequence data.

14

Page 16: Fisica teorica e biologia computazionale.

Transcription factors.

TFs act by binding to specific, often short (5-10bp) DNA sequences in the upstream noncoding regionof genes.

15

Page 17: Fisica teorica e biologia computazionale.

Regulatory network

T.F.’s themselves are proteins produced by othergenes.

The Genome is a complex network of interactionsbetween genes and their products This network patternis ubiquitous in Postgenomic biology

16

Page 18: Fisica teorica e biologia computazionale.

The problem.

However, computational detection of regulatorysites is a difficult task, specially in eukaryotes:

• the consensus sequences recognized by transcriptionalfactors are generally rather short (5-20 bp)

• they can be quite variable

• they are in general dispersed over large distances

• they are generally active in both orientations

A simple study of relative frequencies of sequencescan be meaningless

17

Page 19: Fisica teorica e biologia computazionale.

2. Our Strategy.

We have a few tools to attack the problem:

• Binding sites are often overrepresented. One canuse this to separate the signal (binding site) fromthe noise (background upstream sequence)

• Binding sites are often evolutionary conserved. Onecan use comparative genomics to recognize them.

• Genes which share the same functions may also sharethe same regulatory mechanisms. One may usemicroarray experiments or functional annotations toidentify binding sites.

18

Page 20: Fisica teorica e biologia computazionale.

Overrepresented words

in the upstream regions

Many binding sites are effective only when repeatedmany times in the upstream region of the gene theyregulate.

Example: the word GATAAG—CTTATC is a knownbinding factor for nitrogen-regulated genes: Examinethe 500 bp’s upstream of two of them.

19

Page 21: Fisica teorica e biologia computazionale.

>YPR138C upstream sequence, from -500 to -1

TCCACCTTATCTCGGCGCCAAATCCTTATC

TCTCGTAGCTGGTTTGCCCGCGATAAGGCG

GGCGAGTTATTTTGAAGTTTTCCATAAACT

GGTTTTCCATCTCGAGGTTTTTCCTCGCTT

TCCACGCTATGACCCTTTTTAGTTAAGGTA

CCCGATGGCATACTTTATATATTATATATA

TATGTTAAGTTAATATGTTTTAGCAGATTT

GATATGCTGATATGCAGCACGGACTTTCCC

TCTCCTTGTCTTATCGCATCTTATCGCAAC

AATTTGATAGATATCTTCTCCCTTTCCTAT

CTTGTAGAATAAGGTTGTGTGCTTTGAGTC

TGATAGCCGTCTTCTTTCGGTCGCTTCTTC

TCTCTTTTGGTTCTTTGATTGTCTATTACA

ATCAATGCAGGCTAGTTAAGGGTCCAATCA

CTTTTGAAATTGTTTTGTAAAAAGCGAAGG

CATTTTTTTTTTAGAAGATACAATTGAAAA

CATATAGATTTAGAGTTCAC

20

Page 22: Fisica teorica e biologia computazionale.

>YIR028W upstream sequence, from -500 to -1

ATTCTCGGGTCTAATGTGGCTCGAGGGTAT

CTCTTATCGGTATTACTTTCTTATCAATGA

AAAATTTCTGCCAGGGAAAATGCGCCCGCT

TTTTTTCCGGCCATCCTTACTCGCTGTCGC

ATACAAAATAGCGCCTCTAATCTAGTTGCG

ATAAGGAATGTGTATGTGTAATTGAAGATC

CAGGATGTTTTCCTTTTCAGGGAGATGAGA

AGGAATAATAGGATGGATTGACCGCTTTGC

TGTCACGTCGATAAGGTTCCTTTAAAAATT

GTGTCCAATGATTAGCATAGAGAGGTAGAG

TATCAGAGAAACAAGTTTGTAATCGAGAAA

CTTGATCTGCTAGTGTTGAGCATAGAAGGC

TAGGAAAACATGGGGAAGAAAAAAAAAGTA

TAAATAATTAGCTTGATGAGTAGTTTGAAT

ATATATGTTACTTTAGTTTCCCTTTTTGAC

CTTTTATATTCATCTACATCTTGTGATATA

AAACATCAACAAAGACGAGA

21

Page 23: Fisica teorica e biologia computazionale.

Our Proposal

first step Grouping of genes based on the motifs thatare overrepresented in their upstream regions. Toeach possible word w we associate the set Sw of allthe genes in whose upstream region the word w isoverrepresented

second step Select those sets which show somekind of functional characterization using microarrayexperiments or Gene Ontology annotations.

• Microarray: For each set Sw we compare theexpression distribution within the set with thegenome wide one (using for example Kolmogorv-Smirnov test).

22

Page 24: Fisica teorica e biologia computazionale.

• Gene Ontology: For each set Sw we compute theprevalence of all GO terms among the annotatedgenes in the set, and the probability that suchprevalence would occur in a randomly chosen setof the same size:– hypergeometric distribution to assess the

significance of the intersection– evaluation of false discovery rate through

comparison with randomly generated gene sets(using only the best p-value for each set ascriterion for the comparison)

The words which survive this analysis are candidatesto be binding sites.

The Gene Ontology Consortium ”Gene Ontology:tool for the unification of Biology” Nature Genetics 25

(2000) 25.

23

Page 25: Fisica teorica e biologia computazionale.

The sets S(word)

• For each word (5 to 8 bp’s) compute the frequencyin the upstream sequences of the whole genomeconsidered as a single sample: these will be ourreference frequencies.

• Then count occurrences of the word in the upstreamregion of each gene separately.

• If the number of occurencies of the word in theupstream region of gene G is statistically significant(compared to a binomial distribution based onthe above reference frequencies), then the geneG belongs to the set S(word).

Choices in our study on yeast:

• upstream sequences length: 500 bp

• probability cutoff P = 0.01

24

Page 26: Fisica teorica e biologia computazionale.

The Gene–Ontology filter.

For each set S(m) we computed the prevalence ofall Gene Ontology (GO) terms among the annotatedgenes in the set, and the probability that suchprevalence would occur in a randomly chosen set ofgenes of the same size.

For a given GO term t let K(t) be the total number of

ORFs annotated to it in the genome, and k(m, t) the number

of ORFs annotated to it in the set S(m). If J and j(m) denote

the number of ORFs in the genome and in S(m) respectively,

such probability is given by the right tail of the appropriate

hypergeometric distribution:

P (J, K(t), j(m), k(m, t)) =

min(j(m),K(t))X

h=k(m,t)

F (J, K(t), j(m), h)

where

F (M, m, N, n) =

`

m

n

´

M−m

N−n

M

N

In this way a P-value can be associated to each pair made of a

motif and a Gene Ontology term.

25

Page 27: Fisica teorica e biologia computazionale.

False discovery rate

Problem:

Given the huge number of P-values that wecompute (in principle equal to the number of GOterms multiplied by the number of words analysed) itis clear that very low P-values could appear simply bychance.

The usual way of dealing with this issue, that isthe Bonferroni correction, is not appropriate, becausedue to the hierarchical nature of the Gene Ontologyannotation scheme, the P-values we compute are veryfar from being independent from each other.

26

Page 28: Fisica teorica e biologia computazionale.

Our proposal

We randomly generated a large number NR of setsof genes comparable in size to the typical size of thesets associated to the motifs and ranked the randomsets based on their best P-values.

In this way we can determine a false discoveryprobability pf(C) as a function of the cutoff on P-values C

Warning:

The lower is the FDR required, the higher is theprecision required in determining the function pf(C)and hence the number NR of sets to be generatedrandomly. For instance a FDR of 0.01 requires thegeneration of 3.5 × 106 randomly chosen sets.

27

Page 29: Fisica teorica e biologia computazionale.

The microarray filter

DNA microarrays can estimate genome-wide geneexpression levels by measuring the amount of mRNAlevels in the cell. Thousands of genes can besimultaenously studied in a single microchip.

28

Page 30: Fisica teorica e biologia computazionale.

The result of the experiment is a slide of this type:

The fluorescence level is proportional to the amountof mRNA produced in the experimental condition understudy (usually one studies the ratio with respect to theexpression level in some “reference” state of the cell).

29

Page 31: Fisica teorica e biologia computazionale.

Example : Microarray samples in S.

Cerevisiae

The diauxic shift

DeRisi et al., Science 278 (1997) 680

• a yeast culture is inoculated into a glucose-richmedium

• rapid anaerobic growth fueled by fermentation, withproduction of ethanol, insues

• upon glucose depletion, the yest cells turn to ethanolas a carbon source for aerobic growth (respiration)

30

Page 32: Fisica teorica e biologia computazionale.

Expression data from DNA microarrays

• samples of cells are harvested at seven time-pointsduring the diauxic shift

• using DNA microarray techniques mRNA levels forall the genes can be measured and compared totheir initial values

• therefore the experiment answers the question:which genes are switched on, and which are switchedoff, as the available glucose becomes progressivelyscarcer?

The output of the experiment is, for each gene, theratio between initial expression level and expressionlevel at each of the seven timepoints during the diauxicshift.

The idea is to look for statistical correlation betweenthese numerical data and the presence of binding sitesin the upstream region of each gene.

31

Page 33: Fisica teorica e biologia computazionale.

Studying expression level for each set

For each set S(word) we compute the averageexpression level of the genes in the set at the seventimepoints of the diauxic shift experiment.More precisely, the average log2 of the ratio between measured

mRNA at each timepoint and measured initial mRNA.

This value is then compared to the average expressiontaken over the whole genome at each timepoint.

If the difference is larger than six standard deviationsthe word defining the set is a candidate binding sitefor the regulations of the genes in the set.

32

Page 34: Fisica teorica e biologia computazionale.

33

Page 35: Fisica teorica e biologia computazionale.

3. Example: Yeast

Identification of TF binding sites in yeast using Gene–Ontology

Output of the analysis:

• With the false discovery rate set at 0.01 we find atotal of 108 associations between 80 different words(of 5-8 letters) and 41 Gene Ontology terms.

• The words can be organized in 12 different groups.Within each group the motifs are very similar toeach other and are associated to the same or tovery similar Gene Ontology terms. For each groupwe construct a consensus sequence (“motifs”) byaligning the words.

34

Page 36: Fisica teorica e biologia computazionale.

motif C F P

AGGGTGC - - siderophore

transport

AGGGTGCA - - siderophore

transport

TGGGTGCA - - siderophore

transport

GGGTGCA - - siderophore

transport

GGGTGC - - siderophore

transport

GGTGCA - heavy metal siderophore

ion porter transport

GGTGC cell wall - -

(sensu Fungi) - -

AGGGTGCACC

CGGCGCC - - tricarboxylic

acid cycle

CGGCGCCG - - tricarboxylic

acid cycle

GGCGCCGA - - tricarboxylic

acid cycle

GCGCCGAG - - tricarboxylic

acid cycle

CGGCGCCGAG

Table 1: Two examples of motifs.35

Page 37: Fisica teorica e biologia computazionale.

Validation:

• Comparison with known TF’s and binding sites(Transfac + literature survey)

• Comparison with the genome wide ChIP experimentof: T.I. Lee et al., Transcriptional regulatorynetworks in Saccharomyces cerevisiae. Science 298,(2002) 799.

motif C F P TF

TGAAAC - - sexual reproduction DIG1

STE12

TGAAACA - - sexual reproduction DIG1

STE12

TGAAACA

ACTGTG - - sulfur amino MET4

acid transport

TGTGGC - - sulfur metabolism MET4

MET31

ACTGTGGC

Table 2: Two examples of motifs with significant

intersection with ChIP data

36

Page 38: Fisica teorica e biologia computazionale.

Results:

• All the motifs we find correspond to known bindingsites. (No false positive!)

• For some of the motifs we are able to

– refine the putative binding sequences.– identify candidates for combinatorial regulation

(example: PAC and RRPE))– Refine the functional annotation of already known

TF’s– identify new potential targets of known TF’s

(example: Hcm1p)

37

Page 39: Fisica teorica e biologia computazionale.

MPS1 (YDL028C)CIN8 (YEL061C)PDS1 (YDR113C)SPC98 (YNL126W)VIK1 (YPL253C)SPC25 (YER018C)ESP1 (YGR098C)STU2 (YLR045C)SLI15 (YBR156C)

Table 3: Candidate targets of regulation by the Hcm1p

transcription factor

38

Page 40: Fisica teorica e biologia computazionale.

motif C F P

GATGAGA nucleolus - ribosome biogenesis

GATGAGAT nucleolus - ribosome biogenesis

ATGAGAT nucleolus - ribosome biogenesis

ATGAGATG - - ribosome biogenesis

TGAGATG - - ribosome biogenesis

and assembly

TGAGATGA - - ribosome biogenesis

and assembly

GAGATG - - ribosome biogenesis

and assembly

GAGATGAG nucleolus - ribosome biogenesis

and assembly

GAGATGA nucleolus - ribosome biogenesis

and assembly

AGATGAG nucleolus - ribosome biogenesis

GATGAG nucleolus - ribosome biogenesis

GATGA - - ribosome biogenesis

ATGAGCT nucleolus - ribosome biogenesis

TGAGCT nucleolus - rRNA processing

GATGAGATGAGCT

AAAAATT nucleolus - ribosome biogenesis

AAAAATTT nucleolus - transcription

complex from Pol I promoter

AAAATT nucleolus - ribosome biogenesis

AAAATTT nucleolus - ribosome biogenesis

AAAATTTT nucleolus - ribosome biogenesis

AAATT nucleolus - 35S primary

transcript processing

AAATTTTC small nucleolar - 35S primary

ribonucleoprotein transcript processing

complex

AAAAATTTTC

39

Page 41: Fisica teorica e biologia computazionale.

4. Binding site identification in human.

The extension of our algorithm to the humangenome is not straightforward. At least 15.000 bplong upstream regions must be taken into accountleading to a very small signal to noise ratio.

It is mandatory to perform a comparative analysisselecting only those parts of the upstream regionswhich are conserved between men and mouse.

This can be done using the CORG database:

C. Dieterich et al., CORG: a database forcomparative regulatory genomics. Nucleic Acid Res.,31, (2003) 374.

40

Page 42: Fisica teorica e biologia computazionale.

The CORG database.

CORG is a collection of conserved sequence blocksin the non-coding, upstream regions of orthologousgenes from man and mouse.

These blocks are obtained by searching statisticallysignificant local suboptimal alignments of 15kb regionsupstream of the translation start site.

The database contains more than 10,000 pairs oforthologous genes. The alignments were obtained usingthe Waterman-Eggert algorithm. We used two differentchoices of the PAM matrix: PAM1 and PAM10 to testthe robustness of the results.

An important role in the following analysis is playedby the fact that more than half of the genes in thedatabase are annotated in the GO database.

41

Page 43: Fisica teorica e biologia computazionale.

The two releases are very different:

• PAM1

– number of genes in the database: 10999– mean number of conserved blocks for gene: ∼ 20– mean length of the union of conserved blocks:

∼ 500– number of genes with a GO annotation 6187

• PAM10

– number of genes in the database: 12943– mean number of conserved blocks for gene: ∼ 40– mean length of the union of conserved blocks:

∼ 900– number of genes with a GO annotation 7260

42

Page 44: Fisica teorica e biologia computazionale.

Results.

In the PAM10 case, out of the 43250 possible wordsof 5,6,7 and 8 letters

• 154 different words survive the G–O filter

• 331 words survive the Microarray filter

• the intersection between the two sets is 109 wordswhich corresponds to a p–value e−201

• similar results are obtained with PAM1. Despite thefact that the PAM1 and PAM10 CORG databasesare very different our results seems to be very robust:most of the words are present in both releases.

43

Page 45: Fisica teorica e biologia computazionale.

Clustering of words.

Due to the larger amount of words and to the highermotif’s variability, clustering of words is more delicatethan in the yeast case. To decide if two words belongto the same motif we make a two steps analysis.

• First step: we check if at least one of the followingconditions is met:

– at least one GO term is significant for both motifs– there is at least one time point in the cell

cycle MA experiment in which both motifs aresimultaneously significant.

– the intersection of the two sets of genes (labeledby the two words that we are testing) isstatistically significant.

• Second step: we check if an alignment can be foundbetween the two words with no gaps, at least 4 basescorrectly aligned and at most 1 mismatch.

44

Page 46: Fisica teorica e biologia computazionale.

Validation.

Comparing our finding with the data collected inthe Transfac database we were able to recognize somewell known TF’s.

Example: NF–kB

motif C F P

GGAAATTC - chemoattractant -

GGRAAKTCCC Transfac consensus

Table 4: The putative NF–kB motif.

Example: E2F

motif C F P

TTTCGCGC - - DNA replication initiation

TTTSGCGC Transfac consensus

Table 5: The putative E2F motif.

45

Page 47: Fisica teorica e biologia computazionale.

Example: A putative new motif

motif C F P

A ATGTTG Golgi lumen - -

TGTTGA Golgi lumen - -

ATGTTGA Golgi lumen - -

T T ATGTA Golgi lumen - -

TWATGTTGA

Table 6: A putative motif with no reference in Transfac.

46

Page 48: Fisica teorica e biologia computazionale.

Conclusions.

We propose a new method to extract relevantbiological information on the Transcription Factors(and more generally on the mutual interactionsamong genes) from the statistical distribution ofoligonucleotides in the upstream region of the genes.

• The method requires a complete knowledge of theupstream oligonucleotide sequences and thus it canbe applied for the moment only to those organismsfor which the complete genome has been sequenced.

• It does not require any external bias. Thesignificance criterion only depends on the statisticaldistribution of oligonucleotides in the upstreamregion

• It can be easily implemented and could be used asa standard preliminary test, to guide more refinedanalysis

47

Page 49: Fisica teorica e biologia computazionale.

• It makes use of G–O annotations and/or Microarraydata to assess the significance of the results. Boththese tools are becoming more and more precise.This should lead to improved performances of futurereleases of our analysis.

We studied its performances in two cases: yeastand human. In both cases we found some alreadyknown TFs which we used as a validation test ofthe method. In the human case we also found somepreviuosly unknown candidates binding sites, which weexpect to be of biological relevance.

48