2 Fontana Salsomaggiore2011

7/30/2019 2 Fontana Salsomaggiore2011

1/62

Dalla sequenza alla funzioneDalla sequenza alla funzione

attraverso la genomicaattraverso la genomica

Fontana PaoloFontana Paolo

Fondazione Edmund Mach


2/62

HMM (Hidden Markov Model)

Un HMM un grafo di stati connessi dove ogni stato potenzialmente in grado di emettere

un simbolo. Il modello parametrizzato secondo una probabilit che governa ogni stato e le

transizioni tra gli stessi.

Un HMM descrive la probabilit di una determinata sequenza contro un potenzialmente

illimitato numero di sequenze.

Supponiamo di avere un alfabeto composto da due lettere (a,b) e di volere costruire una

sequenza utilizzando gli HMM con unarchitettura costituita da due stati:


3/62

- Se un frammento completamente contenuto

allinterno di un repeat ci possono essere pi

posizioni dove piazzarlo e se le copie non sono

esttamente uguali pu causare errori nel consenso

finale.

- I repeat possono essere posizionati in modo tale

da causare ambiguit, quindi due o pi layout

sono compatibili con i frammenti in input.

Per ordinare i contigs e quindi

creare uno scaffold si fa ricorso alleBAC ends (reads poste allestremit

di un BAC).

Whole Genome Shotgun (WGS)


4/62

Genome structural variationA mate pair that spans a

deletion event maps to the

corresponding regions of thereference, but the distance of

the two reads is greater than

the insert size, while if the

event is an insertion then the

distance is smaller. An

inversion is detected if the

orientation of the reads is

flipped.

We can apply a similar concept to linked insertions and everted duplications


5/62


6/62

Protein-Coding Genes in EukaryotesProtein-Coding Genes in Eukaryotes

Why are the proteomes of various eukaryotes similar in size, given the enormous phenotypicdifferences between eukaryotes?

(Proteome the complete set of all protein-encoding genes or all the proteins produced by

them)

Claverie calls this the N value paradox (Nis for number), while Betran and Long call this G

value paradox (G is for genes).


7/62

Protein-Coding Genes in EukaryotesProtein-Coding Genes in Eukaryotes

We do know that organisms such worms and flies appear to have about 13 000 to 20 000

protein-coding genes, while plants, mice, and humans have only lightly more (about 20

thousand to 40 thousand genes).

Why do organisms such as humans, having so much greater biological complexity than insects

and nematodes, have not even twice as many genes?

The genes of higher eukaryotes employ more complex forms of gene regulation, such as

alternative splicing.

Also architecture of individual genes tends to be more complex, for example with more

domains present in an average human protein relative to insects.


8/62

Can you find a gene here?Can you find a gene here?

the gene is (Human Casein

Kinase II )

Landmarks?

Signals?

(hard to see)


9/62

Introns make things harderIntrons make things harder

Start codonATG

5

Stop codonTAG/TGA/TAASplice sites

Intergenic Exon Intron IntergenicExon ExonIntron

mRNA Transcript

5 UTR 3 UTR


10/62

ATG TGA

coding segment

complete mRNA

ATG GT AG GT AG. . . . . . . . .

start codon stop codondonor site donor siteacceptor site acceptor site

exon exon exonintronintron

TGA

Eukaryotic Gene SyntaxEukaryotic Gene Syntax

Regions of the gene outside of the CDS are called UTRs (untranslated

regions), and are mostly ignored by gene finders, though they are

important for regulatory functions.


11/62

Types of ExonsTypes of Exons

Three types of exons are defined, for convenience:

initial exons extend from a start codon to the first

donor site; internal exons extend from one acceptor site to thenext donor site;

final exons extend from the last acceptor site to thestop codon;

single exons (which occur only in intronless genes)extend from the start codon to the stop codon:


12/62

Gene Prediction 12

atg

tga

ggtgag

ggtgag

ggtgag

caggtg

cagatg

cagttg

caggcc

ggtgag

Known Genes provide training signalsKnown Genes provide training signalsfor computerized gene findingfor computerized gene finding

start

splice donor

splice acceptor

stop


13/62

What is Gene Prediction?What is Gene Prediction?

Gene prediction is the problem of

parsing a sequence intononoverlapping coding segments(CDSs) consisting ofexonsseparatedby introns.


14/62

Gene Prediction Approaches

Intrinsic (ab initio)

GENSCAN, FGENESH, GeneMark.hmm, GlimmerM,

Genie;

Extrinsic (similarity-based)

Spliced alignment: GenomeScan, EuGene, FGENESH+,

FGENESH_C, GeneId+, AUGUSTUS, etc;

Genomic comparison: TwinScan, TWAIN, SLAM, SGP,FGENESH_2, etc;


15/62










16/62

Generalized Hidden Markov Model(GHMM) loutput di uno stato pu

essere una stringa di lunghezza finita.

Inoltre la distribuzione di probabilit pu

non essere la stessa per tutti gli stati: per

esempio uno stato pu utilizzare una

matrice di pesi per generare la sequenzadi output, mentre un altro stato potrebbe

usare un HMM.

Gli stati corrispondono alle unit

funzionali di un gene (promotore, esoni,

introni, ) e le transizioni tra uno stato elaltro devono essere biologicamente

consistenti.

Genscan


17/62


18/62

General Things to RememberGeneral Things to Remember

about (Protein-coding) Geneabout (Protein-coding) Gene

Prediction SoftwarePrediction SoftwareIt is, in general, organism-specific

It works best on genes that are reasonablysimilar to something seenpreviously

It finds protein coding regions far better than non-coding regions

In the absence of external (direct) information, alternative forms will notbe identified

It is imperfect! (Its biology, after all)


19/62

Omologia: due geni o proteine si dicono omologhi se derivano da un progenitore comune

Lomologia un carattere qualitativo a cui non pu essere attribuito un valore percentuale

Similarit una funzione che associa un valore numerico a un paio di stringhe

Ci sono due diversi tipi di omologia:

1. Due sequenze omologhe si definiscono ortologhe se appartengono a due specie diverse e il loroprocesso di divergenza ha avuto origine in seguito al processo di speciazione da cui le due specie in

questione hanno avuto origine.

2. Due sequenze omologhe si definiscono paraloghe se il loro processo di divergenza ha avuto origine inseguito a un processo di duplicazione genica

Colinearit tra Lg13 e Lg16 di melo


20/62

ALGORITMO

AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSR----KNGSSKVD

AGSGYWKATG DK I + VGIKKALVFY GKAPKG KTNWIMHEYRL + R K S ++D

AGSGYWKATGADKPIGLP-KPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD

AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSRKNGSSKVD

AGSGYWKATGADKPIGLPKPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD

ALLINEARE

l d ll


21/62

Algoritmi di allineamento esatto

Globale: Needleman e Wunsch Locale: Smith e Waterman

S1

S2S1S2

2. Costruzione di una matrice nxm (n la lunghezza di S1 e m di S2) dove

ogni lettera di S1 confrontata con ogni lettera di S2 e per ogni confronto

effettuato assegnato un punteggio in base agli score decisi in precedenza.

1. Il primo passo per procedere allallinemento di due sequenze deciderelo score o punteggio da assegnare ai match, mismatch e gap

3. Dalla matrice si ricava la sequenza con score globale maggiore


22/62

88AA

77CC

66AA

55CC

44AA

33CC

22GG

11AA

887766554433221100

TTTTCCAACCAACCAA

0 1 2 3 4 5 6 7

1 21 3 4 5 6 7

2 1 2 2 3 4 5 6

3 2 1 2 2 3 4 5

4 3 2 1 2 2 3 4

5 4 3 2 1 2 3 4

6 5 4 3 2 1 2 3

7 6 5 4 3 2 2 3

S1: A_CACACTT

S2: AGCACAC_A

S1: A_CACACTT

S2: AGCACACA_

Algoritmi troppo lenti per poterli applicare nella ricerca disimilarit contro gli attuali database biologici

BLAST


23/62

BLAST

Il BLAST si basa su un algoritmo euristico, ci significa che l'allineamento prodotto non esatto.

Lalgoritmo del BLAST pu essere diviso in tre parti.

1) Leggere tutte le parole di lunghezza W contenute nella sequenza query; per ognuna di queste generata una lista di

parole affini che producono uno score maggiore a una soglia T quando allineate con la parola della query.

2) Analizza tutte le sequenze della banca dati ricercando la presenza di W-mers corrispondenti esattamente alla lista

delle parole precedentemente prodotte.

Oltre a W, T e S c un altro parametro importante X che determina

quanto il programma deve insistere su un hit di W-mer prima di

fermarsi

3) Verifica se e quanto sia possibile estendere ogni hit. Questo

processo svolto cercando di estendere lallineamento in

entrambe le direzioni senza inserire gap. In questo modo si

ottiene un HSP (High-scoring Segment Pair) non ulteriormente

estendibile. Il parametro S definisce una soglia di score sopra la

quale un HSP ritenuto degno di attenzione.

La statistica che sta alla base del BLAST consente inoltre di metterein relazione il valore di S con il numero atteso di HSP che

raggiungono tale soglia in una banca di sequenze casuali della stessa

grandezza di quella considerata. E=kmneS


24/62

FUNZIONE?FUNZIONE?


25/62

Seeding for sequence alignment:

PatternHunter approach

BLAST looks for match ofkconsecutive letters as seeds (the

default value for k is 11 for nucleic alignments). Insteed

PatternHunter uses k non consecutive letters as seeds. The

relative position of the kletters is called a spaced seed model

and k is its weigth.

For example, if we use the weigth 6 model 1110111, then the

following alignmets match the seed:

actgcct

acttcct

actacct

1110111


26/62

tactgcctg

|||| ||||

tactacctg

1: 1110101

2: 11101013: 1110101

With BLAST's seed model if a hit at position i is

identified, the chance to have a second hit at position i+1

is very high because it requires only one extra base match.

The dependency between the hits makes the detection of

homologs less efficient: many regions will have morethan one hit, which is unhelpful, while many other regions

will be missed.

Sensitivity=number of TP

number of TP+number of FN


27/62

Sopra il 30% di identit il 90% dellesequenze risultano essere omologhe

alla query, sotto il 25% meno del

10% lo sono.


28/62

ALGORITMO

AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSR----KNGSSKVD

AGSGYWKATG DK I + VGIKKALVFY GKAPKG KTNWIMHEYRL + R K S ++D

AGSGYWKATGADKPIGLP-KPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD

AGSGYWKATGTDKVITTEGRKVGIKKALVFYIGKAPKGTKTNWIMHEYRLLENSRKNGSSKVD

AGSGYWKATGADKPIGLPKPVGIKKALVFYAGKAPKGEKTNWIMHEYRLADVDRSVRKKKNSLRLD

ALLINEARE

Valutazione del significato biologico dellallineamento prodotto


29/62

1YEA AKESTGFKPGSAKKGATLFKTRCQQCHTIEE-------GGPNKVGPNLHGIFGRHSGQVK

1YCC ----TEFKAGSAKKGATLFKTRCLQCHTVEK-------GGPHKVGPNLHGIFGRHSGQAE

2PCBB ---------GDVEKGKKIFVQKCAQCHTVEK-------GGKHKT

GPNLHGLFGRKTGQAP

5CYTR ---------GDVAKGKKTFVQKCAQCHTVEN-------GGKHKVGPNLWGLFGRKTGQAE

1CCR -ASFSEAPPGNPKAGEKIFKTKCAQCHTVDK-------GAGHKQGPNLNGLFGRQSGTTP

1CRY ---------QDAASGEQVFK-QCLVCHSIGP-------GAKNKVGPVLNGLFGRHSGTIE

1HROA -----SAPPGDPVEGKHLFHTICITCHTDIK-------G-ANKVGPSLYGVVGRHSGIEP

1CXC -------QEGDPEAGAKAFN-QCQTCHVIVDDSGTTIAGRNAKTGPNLYGVVGRTAGTQA

1C2RA ---------GDAAKGEKEFN-KCKTCHSIIAPDGTEIVKG-AKTGPNLYGVVGRTAGTYP

155C -------NEGDAAKGEKEFN-KCKACHMIQAPD-GTDIKG-GKTGPNLYGVVGRKIASEE

2C2C --------EGDAAAGEKVSK-KCLACHTFDQ-------GGANKVGPNLFGVFENTAAHKD

2mtac -----APQFFNIIDGSPLNFDD-----AMEEGRDTEAVKHFLETGENVYNEDPEILPEAE. * : * : . .

Esistono metodi pi fini per la ricerca di sequenze proteiche correlate funzionalmente o

strutturalmente?

Lidea consiste nellindividuare quei domini o posizioni conservate e quindi sottoposte a un

vincolo strutturale o funzionale allinterno di proteine appartenenti alla stessa famiglia

Allineamento multiploLallineamento multiplo di tre o pi sequenze pu essere definito come unipotesi di

omologia posizionale tra basi o aminoacidi


30/62

Osservando un allineamento multiplo di sequenze proteiche correlate tra di loro si possono

notare delle regioni conservate tipicamente di 20-30 aminoacidi.

Lidea di base consiste nel classificare sequenze diverse come appartenenti alla stessa

famiglia se in possesso degli stessi motivi.

Per raggiungere tale scopo un metodo consiste nel definire dei profili: cio quali residuisono permessi in una certa posizione, quali sono altamente conservati o degenerati e quali

posizioni o regioni possono tollerare inserzioni o delezioni.

Determinare un albero


31/62

N sequenze

omologhe

Determinare tutti i

possibili allineamenti a

coppie

Determinare un albero

guida basato sui punteggi

di similarit tra tutte le

coppie

Scegliere la coppia di

sequenze con il pi alto grado

di similarit e ragrupparle in

un cluster fissandone

lallineamento

Il multi allineamentocomprende tutte le

sequenze

Allineamentomultiplo

Limite: se lalgoritmo sbaglia unallineamento influenzer negativamente

tutti i successivi


32/62


33/62

Dato un allineamento multiplo di un set di sequenze, un profilo per quel

allineamento indica la frequenza con cui ogni carattere appare in una determinata

colonna.

A T C _ A

A T A T A

A C C T _

C T _ T C

C1 C2 C3 C4 C5

A .75 .25 .50

T .75 .75

C .25 .25 .50 .25

_ .25 .25 .25

Spesso i valori di un profilo sono convertiti in rapporto logaritmico. Se p(y,j)

rappresenta la frequenza del carattere y nella posizione j e se p(y) indica la frequenza

con la quale il carattere y appare ovunque nellallineamento multiplo, allora il valore logp(y,j)/p(y) usato come entry nella matrice del profilo.

Per un carattere y e una colonna j, sia p(y,j) la frequenza con cui il carattere y appare

nella colonna j del profilo e inoltre S(x,j) indichi lo score per

allineare x con la colonna j

[ s ( x,y ) p ( y,j ) ]



34/62









Questo concetto pu essere applicato in biologia per lidentificazione di proteine appartenenti


35/62

Questo concetto pu essere applicato in biologia per l identificazione di proteine appartenenti

ad una stessa famiglia: infatti posso definire un set di posizioni che in una sequenza sono pi o

meno conservate.

Per raggiungere questo scopo definisco una catena lineare di stati di match, di inserzioni e

delezioni che si riferiscono ad un allineamento multiplo di proteine (profilo).

Tutti gli stati possono generare un

carattere eccetto quello di

delezione.

Lo scopo di tutto questo lavoro trovare un modello che assegni unalta probabilit a quelle

sequenze proteiche che appartengono alla stessa famiglia; cos facendo otteniamo un set di stati

e transizioni con i quali possiamo valutare la probabilit di una sequenza ignota di appartenere

ad una determinata famiglia proteica. Naturalmente ci sono pi cammini possibili che possono

generare la stessa sequenza: bisogna trovare quello giusto ovvero che massimizza il punteggio.

Vantaggi


36/62

Vantaggi

Solida base statistica

Possono essere utilizzate in un numero notevole di task come il data mining con lo scopodi classificare dati biologici, analisi di struttura di proteine, pattern discovery, ecc.

Svantaggi

Overfitting: a causa dei dati di partenza in una famiglia proteica alcuni membripotrebbero essere sovrarappresentati pesando cos troppo nella costruzione del modello e

rendendolo troppo stringente.

Ottengo un modello lineare che non in grado di descrivere correlazioni superiori

allinterno di una proteina: come per esempio legami a ponte di idrogeno, ponti disolfuroecc. che possono avvenire tra aminoacidi distanti tra loro, ma vicini a causa del fold della

proteina.


37/62

La Figura illustra la crescita dei dati relativi alle sequenze di DNA, dallavvento delle

tecniche di sequenziamento nel 1975 ai giorni nostri.

Aumento cumulativo di articoli di biologia molecolare e di genetica (linea tratteggiata) e

dei record di sequenze di DNA in GenBank (linea continua). Si noti come laumento

esponenziale dei dati di sequenza abbia portato, intorno alla met degli anni 90, ad

uninversione delle posizioni. Oggi, lenorme quantit di dati non consente di tenere il

passo con le pubblicazioni scientifiche che dovrebbero descriverli. (Adattato da M.S.Boguski, Science 286, 453-455, 1999).

Mediante le tecniche viste ci si deve confrontare con lenorme quantit di dati disponibili

nei database biologici pubblici


38/62

Uno dei principali task della bioinformatica ordinare i dati e ricavarne

informazioni utili e fruibili per la comunit scientifica

Esiste un settore vero e proprio della bioinformatica che riguarda, appunto, il

data-miningdata-mining

ed il processo attraverso il quale si raggiunge la conoscenza dallanalisi dei dati

presenti, ad esempio, nelle banche dati primarie e che in grado di generare le

banche dati secondarie o specializzate va sotto il nome di:

KDDKDD

Knowledge Discovery in DatabaseKnowledge Discovery in Database

K l d Di i D t b


39/62

Knowledge Discovery in Databases

(KDD)

DataWarehouse

Prepareddata

Data

PuliziaIntegrazione SelezioneTresformazione DataMining

Patterns

ValutazioneVisualizzazione

KnowledgeKnowledge

Base Knowledge

Application


40/62

Data mining (KDD) goalsLo scopo principale del data mining creare una base di conoscenza

utilizzabile per la predizione della funzione di dati biologici ignoti

Descrizione

Annotazione: il processo di interpretare i dati grezzi fornendo

uninformazione biologicamente utilizzabile

PredizioneCostruzione di un modello con potere di predizione

Data mining (KDD) operationsVerifica

Validare lipotesi analisi statistica

Ricerca

Esplorazione dei dati

modelli predittivi

Database segmentation

ONTOLOGYONTOLOGY is a way tois a way to


41/62

ONTOLOGY is a way toy

capture knowledge in acapture knowledge in a

written and computable form.written and computable form.

This means that the computerThis means that the computerfinds patterns so we dontfinds patterns so we dont

have to.have to.

IN PHILOSOPHYIN PHILOSOPHY

OntologyOntology (from Greek) is the philosophical study of the(from Greek) is the philosophical study of the

nature of being, existence or reality in general, as wellnature of being, existence or reality in general, as well

as of the basic categories of being and their relations.as of the basic categories of being and their relations.

IN COMPUTER SCIENCEIN COMPUTER SCIENCE

OntologyOntology is a formal representation of a set ofis a formal representation of a set of

concepts within a domain and the relationshipsconcepts within a domain and the relationships

between thosebetween those conceptsconcepts

G O t lG O t l


42/62

Transcription

mRNAsynthesis

DNA

directed rnasynthesis

Geneexpression

id: GO:0006352

Gene OntologyGene Ontology


43/62

The Gene Ontologyis like a dictionary

a name

term: transcription initiation

definition

: Processes involvedin the assembly of the RNApolymerase complex at thepromoter region of a DNAtemplate resulting in the

subsequent synthesis ofRNA from that promoter.

a definition

id: GO:0006352

an ID number

Eachconcept has:


44/62

There are also relationships between them.

Gene Ontology is a DAG Directed Acyclic Graph

Nucleic acid

binding is atype ofbinding.

DNA bindingis a type ofnucleic acidbinding.


45/62

Appropriate Relationships to Parents

GO currently has many relationships but themost frequent are types:

Is_a

An is_a child of a parent means that the child is acomplete type of its parent, but can bediscriminated in some way from other children ofthe parent.

CAR

Ferrari is a CAR FIAT 500 is a CAR


46/62

and:Part_of

A part_of child of a parent means that the childis always a constituent of the parent that incombination with other constituents of theparent make up the parent.

CARThe wheel is a part of a CAR

Appropriate Relationships to Parents


47/62

chromosome

Part_ofrelationship

nucleus

True Path Violations Create Incorrect Definitions

..the pathway from a child term all the way up to its top-level parent(s) must always be true".


48/62



Mitochondrial

chromosome

Is_arelationship

chromosome


49/62


chromosome

Mitochondrialchromosome

Is_a relationship

Part_ofrelationship

nucleus

A mitochondrial chromosome is not part of a nucleus!



50/62


chromosome

Nuclearchromosome

Mitochondrialchromosome

Is_a relationships

nucleus

Part_ofrelationship

mitochondrion

Part_ofrelationship


H t l ti hi


51/62

chromosome

mitochondrionnucleus

Has_partrelationship



To overcome this problem a new relationship has been recently added:

has_part. Previously we have been used to propagating gene products up

the graph. With the addition of has_part this is no longer so simple.

ABF1 MGM101

MGM101MGM101ABF1 ABF1


52/62

Biological process ontology

Which process is a gene product involved in?

Molecular function ontology

Which molecular function does a gene product have?

Cellular component ontology

Where does a gene product act?

The ontologies are used to categorize gene products.


53/62

AMINOACID SEQUENCEAMINOACID SEQUENCE

Similarity searchesSimilarity searchesHMM, profiles, HMM-HMM etc.HMM, profiles, HMM-HMM etc.

Is there anything really similar out there ?Is there anything really similar out there ?

Try functional transfer Try functional transfer annotate the sequence .annotate the sequence .

Good luck !Good luck !

Fold recognition, etc tryFold recognition, etc tryto find the 3D structuralto find the 3D structural

model or featuresmodel or featuresYESYES

NONO


54/62

ARGOT

It is a knowledge based and integratedapproach which combines:

1.clustering of GO terms, based on their

semantic similarities

1.weighting scheme which assesses retrievedhits sharing a certain number of biologicalfeatures with the sequence to be annotated


55/62

A metric based on:1)Topology: the GO graph2)Information content: how informative is theterm ? Can you quantify it ?3)Semantic similarity: a measure to establish "Howmuch does term A have to do with term B?4)A weighting scheme: finding some biological

features in common between our target and knownproteins annotated in GO (BLAST,HMM etc.). How dowe get and weight these features ?

What do you need?

A C d D i il ? A A d B i il ?


56/62

Are C and D similar ? Are A and B similar ?

Edge distance:AB = 2CD = 2 very close !!!

but

Is antioxidant activity a sort oftranscription regulator activity certainly notFor sure, glutathione peroxidaseactivity shares something with

phospholipid-hydroperoxideglutathione peroxidase activity !!

C

B

D

A

Edge distance cutoff


57/62

C IC=4.2

B IC=1.8

D IC=5.8

A IC=2.9

Information content (Resnick 1999)

Semantic similarity (Lin 1998)

List of common subsumers

IC=0

IC=3.1

Are C and D similar ? Are A and B similar ?

Semantic similarity >= 0.6: A is NOT similar to B and C is similar to D

Semantic similarity:AB = 0 absolutely not similar !CD = 0.62 quite similar !

YESYES

Step I


58/62

Step I

Trimming the GOgraph

Keeping the nodesof BLAST hits only(black circles) andtheir parents

(white circles)

Step II


59/62

Step II

1) Calculating IC

2) Calculating Weights

the absolute value of the sumof the log of the child nodesBLAST e-values.

Step III


60/62

Step III

1) Discarding nodes with Z-score < 0

Where Sis the average calculated asthe score of the root node dividedby the total number of the nodesthat compose the initial trimmed

GO graph, Si is the score of node iand is the standard deviationassuming a Gaussian distribution ofthe weights

1) Clustering of nodes based onsemantic similarity(stringent 0.7 threshold).

ROC plots (10,000)


61/62

p ( , )Specificity (TN/(TN+FP))Sensitivity (TP/(TP+FN))Y-axis = sensitivity X-axis = 1-specificity

In (a) the results of InC, AC and

TS scores are reported for hitsunder 100% sequenceidentity (ROC 100 plots). In (b)the performances of the threeindexes are reported for low

sequence similarity hits below40% identity (ROC 40 plots).In (c), (d), and (e) the AC, TS,and InC scores are shownrespectively, with comparisons of

their trends at low (ROC 40plots) and high (ROC 100 plots)sequence similarity. In (f) theannotations of up to the firsttop five BLAST hits areevaluated (TOPBLAST).

http://www medcomp medicina unipd it/Argot2/


62/62

http://www.medcomp.medicina.unipd.it/Argot2/

2 Fontana Salsomaggiore2011

Documents

Transcript of 2 Fontana Salsomaggiore2011