G. Paolella Napoli, 18/12/ 2007 1 G. Paolella High performance computing per lannotazione e il...
-
Upload
gioacchino-cattaneo -
Category
Documents
-
view
215 -
download
2
Transcript of G. Paolella Napoli, 18/12/ 2007 1 G. Paolella High performance computing per lannotazione e il...
G. Paolella Napoli, 18/12/ 2007 1
G. Paolella
High performance computing per l’annotazione
e il mining di genomi interi
G. Paolella Napoli, 18/12/ 2007 2
DG-CST
1022 genes related to
genetically transmitted
disease
G. Paolella Napoli, 18/12/ 2007 3
CST
Identificazione e caratterizzazione di sequenze nucleotidiche conservate tra uomo e topo (CSTs) in altre specie.
H. Sapiens
M. Musculus
CSTs
CST identificate in geni associati a malattie: 64.495.Analisi da effettuare mediante BLAST contro altri genomi (ratto, cane, scimmia, pollo, etc).
G. Paolella Napoli, 18/12/ 2007 4
KinWeb
500 genes coding for
human protein
kinases
G. Paolella Napoli, 18/12/ 2007 5
(a)
(b)
(c)
(d)
(e)
KinWeb DB
G. Paolella Napoli, 18/12/ 2007 6
Annotation is carried out through a pipeline which goes through the various phases wit hout requiring human assistance. Tasks requiring intensive CPU usage, such as BLAST homology search, are spread on several collaborating servers using a system specifically developed for load distribution and monitoring.
CST ANNOTATIONCSTs- chromosome position- type (i.e. intergenic, intronic, exonic, etc.)- coding %- closest gene and relative distances- .......
ENSEMBL gene and gene structure data- Max L-Score- Avg L-Score- .......
UCSC Log Score dataMatches with:- EST- Other genomes- Proteins (BlastX)
BLAST- repeats type- repeats %Repeat MaskerCoding Potential ScoreCPS - Redundancy- Overlapping- ........
PHP ScriptsDBRemote Servers Remote Servers
Pipeline units
G. Paolella Napoli, 18/12/ 2007 7
Assemble
…
Contigs Scaffolds
…
geneA tRNA prom oprA oprB
geneCluster A
Annotation
High throughput sequencing
G. Paolella Napoli, 18/12/ 2007 8
Sequencing
At CEINGE, Nonomuraea sequencing genome project by 454 FLX is in progress
0
5000
10000
15000
20000
25000
30000
35000
40000
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
Coverage
N
gaps100200500100020005000100002000050000100000
0
2
4
6
8
10
12
Bases (Mbps)
G. Paolella Napoli, 18/12/ 2007 9
Annotation
G. Paolella Napoli, 18/12/ 2007 10
• Identification of genes and other genetic elements.• Protein functional annotation.• Cellular process annotation.
• Identification of ORFs, tRNAs, rRNAs• Scanning for signals, such as promoters and microRNAs• Identification of operons and gene clusters• Comparison with known genomes/proteins• Identification of orthologs and paralogs • Characterization of protein domains• Reconstruction of complete metabolic pathways• …• …
Annotation Steps
G. Paolella Napoli, 18/12/ 2007 11
Stem Loop Structure(SLS)
Protein and coding genes Forward strandReverse strand
E. coli k12
ERIC
Rib (BIME family)
SLS in Bacteria
G. Paolella Napoli, 18/12/ 2007 12
Identification of SLSs in bacterial genomes
realshuffled
G. Paolella Napoli, 18/12/ 2007 13
Blast e-value
Markov clustering (MCL)
XSLSs
BLAST all vs all
Clustering by sequence similarity
G. Paolella Napoli, 18/12/ 2007 14
RESULTSFolding probability
Clustered SLSs SLSs Random sequences
p = probability that the Minimum Free Energy (MFE) of a given sequence is equal to a distribution of MFE computed with random sequences. (RANDFOLD)
G. Paolella Napoli, 18/12/ 2007 15
RESULTSGrouping clusters
sequence strand location IS rRNA
B. anthracis 4 3 2 2B. halodurans 6 6 4 3 1B. subtilis 2 2 1 1 1C. perfringens 6 2 1 1C. tetani 14 13 10 6 3E. faecalis 7 5 3 3 1L. johnsonii 3 3 2 2 1S. aureus 11 7 5 4S. pneumoniae 28 22 13 9 6
M. genitalium 1 1 1 1M. pneumoniae 20 20 18 12
C. diphtheriae 9 7 5 4 1M. leprae 29 18 11 5M. tuberculosis 59 36 21 15 3
B. melitensis 11 7 5 4R. conorii 19 6 4 4
B. bronchiseptica 26 8 5 4B. parapertussis 30 16 10 5 4B. pertussis 52 28 16 4 3N. meningitidis 44 9 7 6
E. coli 12 8 6 6 2H. influenzae 3 1 1 1P. multocida 1 1 1 1P. aeruginosa 9 5 4 4P. putida 75 35 26 14 4 2S. typhi 8 4 3 3 2S. typhimurium 7 6 4 4 1V. cholerae 7 7 5 4 2Y. pestis 20 15 11 5 2
Total 523 301 205 137 28 11
Species Located withinGrouped byClusters
137- 39= 98 clusters
manualrefinement
92 families
G. Paolella Napoli, 18/12/ 2007 16
Genome search
Identification of all family members by HMM
Sequence alignment
matches
HMM
New alignment
Final elements
HMM build
Align& extend
G. Paolella Napoli, 18/12/ 2007 17
An example of elongation process: Myg-1 M. genitalium
G. Paolella Napoli, 18/12/ 2007 18
Pae-1 (P. aeuruginosa)
ExamplesComplex structures
Efa-1 (E. faecalis)
G. Paolella Napoli, 18/12/ 2007 19
Bacterial SLSs
Pae-1 (Pseudomonas aeuruginosa)Eric (Escherichia coli)
G. Paolella Napoli, 18/12/ 2007 20
Known(5)
Known(20)
0100 50% of genic
families
Novel(37) Novel (30)
RNAz test
Containknownmotifs
(14)
Predict to be structured(57)
Predict to be not structured(35)
Containknownmotifs
(12)
Secondary structure prediction analysis of families
G. Paolella Napoli, 18/12/ 2007 21
4x14x2=112 procs 2.8 GHz
4x14x2=112 GB RAM
2 GB/s per scheda - 4 GB/s aggregata
Cluster
2.8GHz biproc. node, 2GB RAM160 GB HD
G. Paolella Napoli, 18/12/ 2007 22
Processing time
SLSs Proj CSTs Proj1 1
BLAST vs self 1 1BLAST vs hum EST - 15BLAST vs musEST - 12BLAST vs Hum Genome - 13BLAST vs Mus/Rat Genome - 10BLAST vs Small Genomes - 6RepeatMasker 3 5Mfold 2 -RandFold 30 30RNA-z 0.5 0.5
SLSs Proj CSTs Proj2469003 103340
BLAST vs self 28.6 1.2BLAST vs hum EST - 17.9BLAST vs musEST - 14.4BLAST vs Hum Genome - 15.5BLAST vs Mus/Rat Genome - 12BLAST vs Small Genomes - 7.2RepeatMasker 85.7 6Mfold 57.2 -RandFold 857.3 35.9RNA-z 14.3 0.6
SLSs Proj CSTs Proj2469003 103340
Time (months) ALL 33.6 3.6
Operation
Operation
Operation
Time (days)
Time (sec)
G. Paolella Napoli, 18/12/ 2007 23
The procedure requires high performance computing
Blast + MCL
Blast + MCL
Blast + MCL
Pcma
HMMbuildHMMcalibrateHMMsearchPcma
n
Identification Characterization
Randfold
RNAz
Infernal
SCoPE GRID computing
G. Paolella Napoli, 18/12/ 2007 24
Sito medicina
• HD attached to the system:• 1 Cluster Element (CE)• 5 Worker nodes (WN) biproc (expandable up to 40)• 1 Storage Element (SE) with 50 Gb• 1 User Interface (UI)
G. Paolella Napoli, 18/12/ 2007 25
BLAST
• Eseguibile submitted da un repository locale di programmi • Librerie di dati genomici conservate su SE
• Esempio Blast delle 65597 CST contro genomi di cane, gallo, scimmia e ratto.
• Numero jobs sottomessi 67• Gruppo di sequenze di input: 1000 sequenze• Tempo totale di esecuzione dei 67 jobs: 4 ore• Tempo medio per job: 18 minuti (2 spesi per scaricare il dataset).
• Tempo CPU• Ricerca di 1 sequenza nel genoma di topo => 5 sec. • 64.495 sequenze => 3,75 giorni• 10 genomi => 37,5 giorni
G. Paolella Napoli, 18/12/ 2007 26
Ricerca strutture secondarie
Identificazione e caratterizzazione in genomi batterici di famiglie di sequenze ripetute che condividono una struttura secondaria conservata.
Analisi da effettuare su oltre 300 genomi batterici
EsempioRicerca di una famiglia in un genoma =====> 6 ore.Ricerca di 50 famiglie in un genoma =====> 12,5 giorniRicerca di 50 famiglie in 300 genomi =====> 10 anni
G. Paolella Napoli, 18/12/ 2007 27
RandFold
Identificazione di sequenze potenzialmente strutturate nel trascrittoma umano.
Analisi da effettuare mediante RANDFOLD sui geni frammentati a finestre di 50 basi in sequenze di 150 basi.
EsempioGeni : 408 pari a circa 14 mln di basiSequenze di 150 basi generate: 291.589Analisi di 1 sequenza =====> 45 sec.Analisi di 291.589 sequenze =====> 152 giorni.
G. Paolella Napoli, 18/12/ 2007 28
What about more interactive uses ?
G. Paolella Napoli, 18/12/ 2007 29
CAPRI
DBsPrivate network
Cluster
BrokerBroker
getNode
Access server
Access serverschedul
er
Access server
Access serverschedul
er
Access server
Access serverschedul
er
web requests
http request for an
available node
rsh launch on the node
web display
of results
Requests distributionon the cluster Cluster
Status
ClusterStatus
StatusDB
sql
sql Updates fromnode agents
ClusterManager
ClusterManager
Display
server
sql
...
Cluster activity
G. Paolella Napoli, 18/12/ 2007 31
Broker
virtualnode
virtualnode
DB
DB
Grid
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
no
de
Hierarchical node organization
G. Paolella Napoli, 18/12/ 2007 32
PROJECT------------------------------------------
*Cell line*Colture conditions*Fixation and inclusion methods, stainings, ecc
*Objective*Focus Position*Stage position x/y
*Project title *Experiment name, *Author, group, group leader, ecc.
*Exposure time*Resolution, ecc.
Digital images are produced by a variety of microscope devices.
The management of large number of images requires the use of databases (DB),
Processing of the acquired images is often necessary to enhance the visibility of cell features, that would otherwise be hidden
Integrated image storage and processing environment
G. Paolella Napoli, 18/12/ 2007 33
IPROC
IPROC
IPROC
The image processing system: IPROC
G. Paolella Napoli, 18/12/ 2007 34
Version number 1 features tab-delimited
Name filename
Depth size 16bit
wdim size 4 where files
cdim size 3 where files
pdim size n where files
tdim size n unit min scale 10 where files
ldim size n unit µm scale 0.4 where layers
Time 1 Time 2 Time n
well1 well2
well3 well4
Channel1Channel 2
Channel 3
Position 1
Position n
l1
ln
File format
Data input: text description
G. Paolella Napoli, 18/12/ 2007 35
Acquisition parameters Buttons to slide
the acquisition
Image processing menus
Info panel for each frame
hide/show control command
IPROC
Image processing
G. Paolella Napoli, 18/12/ 2007 36
image in
iProcStep
iProcStepImageMagick
iProcStepPHP
iProcStepPerl
commandLine program
Image MagickPackage
PHPPackage
PERLPackage
Command LinePackages
adapter adapter
image out
adapter
Image processing modules
G. Paolella Napoli, 18/12/ 2007 37
HPCon
ClusternodesG
ateway
iPage
image
area
data + images
page
iPaneiPaneiPane
proc-steps
IPROC architecture
G. Paolella Napoli, 18/12/ 2007 38
Cluster Nodes
AccessServer
AccessServer
AccessServer
CLUSTER
IPROC
Parallel processing
G. Paolella Napoli, 18/12/ 2007 39
The group
Angelo BocciaGianluca BusielloMauro PetrilloConcita CantarellaLuca CozzutoLeandra Sepe
Vittorio LucignanoMarisa Passaro