Apache Solr, il motore di ricerca enterprise open source

39
LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013 Apache Solr la piattaforma di ricerca enterprise

description

Evento Titulus User Group del 4 dicembre 2013, organizzato da Kion/Cineca a Bologna.

Transcript of Apache Solr, il motore di ricerca enterprise open source

Page 1: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Solrla piattaforma di ricerca enterprise

Page 2: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Informatico Lanciatore di giavellotti

ProgrammatoreSuonatore di chitarra basso

Sistemista Imprenditore

IT Manager MaritoTecnico di prevendita

Mountainbike-istaWebmaster Padre2

VenditoreCantore

Markettaro

Chi sono Luca Bonesini

http://www.lucabonesini.it

@lbonesini

http://it.linkedin.com/in/lucabonesini/

[email protected]

+39 366 688 7125

Page 3: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Sourcesense Making sense of Open Source

ContributorsLucene/SolrApache ChemistryApache JackrabbitOpenSSO-AlfrescoCommittersHibernate Search ProjectApache/UIMA projectJBoss GateIn Portal

Lead developerLucene

Infinispan integration

Page 4: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Lucene e SolrCosa sono?

Page 5: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Lucene (core) Search by ASF

“Apache Lucene is a high-performance, full-featured text search engine library written

entirely in Java. It is a technology suitable for nearly any application that requires full-text

search, especially cross-platform”.

http://lucene.apache.org/core/

fast and efficient scoring and indexing algorithms

lots of contributions to make common tasks easier: highlighting, spatial, query parsers, benchmarking tools, etc.

most widely deployed search library on the planet

Page 6: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Solr Search by ASF

“Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text

search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF)

handling, and geospatial search”.

Highly reliable, scalable, fault tolerant, distributed indexing, replication, load-balanced querying, automated failover and recovery, centralized

configuration.

Page 7: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Apache Solr Search by ASF

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Jetty. Solr uses the Lucene Java search library at its core for

full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from

virtually any programming language.http://lucene.apache.org/solr

Access Lucene over HTTP: Java, XML, Ruby, Python, .NET, JSON, PHP, etc.

Most programming tasks in Lucene are configuration tasks in Solr

Faceting (guided navigation, filters, etc.)

Replication and distributed search support

Page 8: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise SearchLa ricerca con la cravatta

Page 9: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise Search, cosa e come.

“Enterprise search is the practice of making content from multiple enterprise-

type sources, such as databases and intranets, searchable to a defined

audience”. [wikipedia]

PullIntegrationAPIPushCrawlerconnector

Documents types and formats ( XML, HTML, Office, etc.) to plain textStemming, lemmatization, synonym expansion, entity extraction, part of speech tagging, tokenization.

Dictionary of all unique words in the corpus.Ranking.Term frequency.

User query.Faceting.Paging.

Query-index comparison.References to source documents.

Ingestion → Processing and analysis → Indexing → Query parsing → MatchingIngestion → Processing and analysis → Indexing → Query parsing → Matching

Page 10: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise Search, cosa e come.

Page 11: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise Search, cosa e come.

● Crawler: an Internet bot that systematically browses the World Wide Web, typically for the purpose of Web indexing (also called Web spider, ant, automatic indexer, web scutter

● Precision/Recall: in pattern recognition and information retrieval, precision (also called positive predictive value) is the fraction of retrieved instances that are relevant, while recall (also known as sensitivity) is the fraction of relevant instances that are retrieved

● Stemming: the process for reducing inflected (or sometimes derived) words to their stem, base or root form (ie: "fishing", "fished", and "fisher" to the root word, "fish")

● Lemmatization: in linguistics is the process of grouping together the different inflected forms of a word so they can be analysed as a single item (ie: word "better" has "good" as its lemma)

● Named-entity recognition (entity extraction) is a subtask of information extraction that seeks to locate and classify atomic elements in text into predefined categories such as the names of persons, organizations, locations, expressions of times, quantities, monetary values, percentages, etc.

● Part of speech: a linguistic category of words (or more precisely lexical items), which is generally defined by the syntactic or morphological behaviour of the lexical item in question (ie: noun and verb)

● Tokenization: the process of demarcating and possibly classifying sections of a string of input characters. The resulting tokens are then passed on to some other form of processing. The process can be considered a sub-task of parsing input.

Page 12: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Search e Open Source

Page 13: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Enterprise Search: prodotti e vendor

Vendors of proprietary enterprise search softwareAskMeNow, Attivio, Concept Searching Limited, Content Analyst Company LLC, Coveo, Dassault Systèmes (acquired Exalead), Denodo, Dieselpoint, Inc., dtSearch Corp., EMC Corp., Exorbyte GmbH, Expert System S.p.A., Exterro, Inc., Fabasoft, Funnelback, Google Search Appliance, HP (acquired Autonomy Corporation which in turn acquired Verity K2 and Ultraseek), IBM (acquired Vivisimo), Inbenta, inter:gator Enterprise Search, ISYS Search Software, MarkLogic, Microsoft (includes Microsoft Search Server, Fast Search & Transfer), Mindbreeze, Neofonie (includes WeFind), Omniture (acquired by Adobe Systems), Open Text Corporation, Oracle Corporation (includes Secure Enterprise Search and Endeca Technologies Inc.), Perception Software, PolySpot, Q-go, Q-Sensei, Recommind, SAP (includes SAP NetWeaver Enterprise Search, Search Services in SAP NetWeaver AS ABAP, and Search and Classification TREX), Sinequa, SLI_Systems, Sophia Search Limited, TeraText, X1 Technologies, Inc., ZyLAB Technologies, ZL Technologies

Free and open source enterprise search software

Apache Solr, DataparkSearch, ElasticSearch, ht://Dig, Jumper 2.0, mnoGoSearch, OpenSearchServer, Searchdaimon, Sphinx

Vendo rs o f open sou rce e n te rp ris e s e a rc h s o ftw a re

30 D ig its , A p a c h e S o ftw a re F o u n d a tio n, Lu cid W o rks , S e m a te x t, F la x

Page 14: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Open Source, lo fanno anche loro.

Page 15: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Perché Innovazione = Bu$ine$$

Open Source

Open Standard Inn

ova

zio

ne

OAGi OASIS W3C IETF IEEE

ETSI Ecma OGF IEC ISO ITU

CENELEC CEN BSI UNI CEI

DKE DIN AFNOR GIETS

LDTI Interoperabilità

Page 16: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr e Business

Page 17: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr features

● Advanced Full-Text Search Capabilities

● Optimized for High Volume Web Traffic

● Standards Based Open Interfaces - XML, JSON and HTTP

● Comprehensive HTML Administration Interfaces

● Server statistics exposed over JMX for monitoring

● Linearly scalable, auto index replication, auto failover and recovery

● Near Real-time indexing

● Flexible and Adaptable with XML configuration

● Extensible Plugin Architecture

● A Real Data Schema, with Numeric Types, Dynamic Fields, Unique Keys

● Powerful Extensions to the Lucene Query Language● Faceted Search and Filtering● Geospatial Search with support for multiple points per document and

geo polygons● Advanced, Configurable Text Analysis● Highly Configurable and User Extensible Caching● Performance Optimizations● External Configuration via XML● An AJAX based administration interface● Monitorable Logging● Fast near real-time incremental indexing and index replication● Highly Scalable Distributed search with sharded index across multiple

hosts● JSON, XML, CSV/delimited-text, and binary update formats● Easy ways to pull in data from databases and XML files from local disk

and HTTP sources● Rich Document Parsing and Indexing (PDF, Word, HTML, etc) using

Apache Tika● Apache UIMA integration for configurable metadata extraction● Multiple search indices

Related Projects: Apache Hadoop, Apache ManifoldCF, Apache Lucene.Net, Apache Lucy, Apache Mahout, Apache Nutch, Apache OpenNLP, Apache Tika, Apache Zookeeper

Page 18: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Search, già una 'commodity'

Sea rch is Eve ryw he re ! Keyw o rd sea rch is a comm od ityH o lis tic v ie w o f th e d a ta a n d th e u s e rs is c ritic a lS c a la b le S e a rc h , D is c o v e ry a n d A n a ly tic s a re th e k e y to u n lo c k in g th is v ie w o f u s e rs a n d d a ta

Documents

Access

Content Relation-

ships

User interacti

on

Traditional

• Fast, fuzzy text matching across a large document collection

• De-normalized data, “light” relational

• Top N problems

• Key-value (top 1)

• Recommendations• “Good enough” classification,

clustering• Faceting, slicing and dicing of

enumerated data

• Spatial, spell checking, record linkage, highlighting

• NoSQL

And:●eCommerce●Search + Recs + Analysis of users ●Knowledge Management●Financial, transportation, pharma●Fraud detection●Social media●Trend monitoring●Information technology●Log monitoring, analysis●Healthcare●DNA Analysis

Page 19: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Smart senza Search?

Page 20: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Solr: chi lo usa?

Buy.com

Page 21: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Oltre il Search

Page 22: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Un caso di successo

Page 23: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 24: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 25: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 26: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 27: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 28: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 29: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 30: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 31: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 32: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 33: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 34: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 35: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 36: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 37: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 38: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Page 39: Apache Solr, il motore di ricerca enterprise open source

LucaBonesini | Titulus User Group, Kion – Bologna 4/dic/2013

Buon search a tutti.

Grazie!

Luca Bonesiniwww.sourcesense.com

[email protected]. +39 366 688 7125

www.lucabonesini.ittwitter: @lbonesini

skype: lbonesini