Apache Solr vs ElasticSearch

May 16, 2013 § 1 Comment

Other day I had really good discussion on Apache Solr. I was totally fascinated with Apache Solr, when I did my last project. Another competitive product came to discussion. I still haven’t started playing around with Elastic Search. Here is start to my pet project

Apache Solr vs ElasticSearch

The Feature Smackdown


API

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Format XML,CSV,JSON JSON
HTTP REST API
Binary API   SolrJ  TransportClient, Thrift (through a plugin)
JMX support
Client libraries  PHP, Ruby, Perl, Scala, Python, .NET, Javascript PHP, Ruby, Perl, Scala, Python, .NET, Javascript, Erlang, Clojure
3rd-party product integration (open-source) Drupal, Magento, Django, ColdFusion, WordPress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak (via Yokozuna) Django, Symfony2
3rd-party product integration (commercial) DataStax Enterprise Search SearchBlox

Indexing

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Data Import DataImportHandler – MySQL, CSV, XML, Tika, URL, Flat File Rivers modules – Wikipedia, MongoDB, CouchDB, RabbitMQ, RSS, Sofa, JDBC, FileSystem, Dropbox, ActiveMQ, LDAP, Amazon SQS, St9, OAI, Twitter
ID field for updates and deduplication
Partial Doc Updates   with stored fields  with _source field
Custom Analyzers and Tokenizers 
Per-field analyzer chain 
Per-doc/query analyzer chain 
Synonyms   Supports Solr and Wordnet synonym format
Multiple indexes 
Near-Realtime Search/Indexing 
Complex documents   Flat document structure. No native support for nesting documents
Multiple document types per schema   One set of fields per schema, one schema per core
Online schema changes   Schema change requires restart. Workaround possible using MultiCore.  Only backward-compatible changes.
Apache Tika integration 
Dynamic fields 
Field copying   via multi-fields
Hash-based deduplication 

Searching

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Lucene Query parsing 
Structured Query DSL   Need to programmatically create queries if going beyond Lucene query syntax.
Span queries   via SOLR-2703
Spatial search 
Multi-point spatial search 
Faceting   The way top N facets work now is by getting the top N from each shard, and merging the results. This can give incorrect counts when num shards > 1.
Pivot Facets 
More Like This
Boosting by functions 
Boosting using scripting languages 
Push Queries   Percolation
Field collapsing/Results grouping   possibly 0.20+
Spellcheck  Suggest API
Autocomplete Beta implementation from community plugin
Query elevation 
Joins   It’s not supported in distributed search. See LUCENE-3759.  via has_children and top_children queries
Filter queries   also supports filtering by native scripts
Filter execution order   local params and cache property  _cache and _cache_key property
Alternative QueryParsers   DisMax, eDisMax  query_string, dis_max, match, multi_match etc
Negative boosting   but awkward. Involves positively boosting the inverse set of negatively-boosted documents.
Search across multiple indexes  it can search across multiple compatible collections
Result highlighting
Custom Similarity 
Searcher warming on index reload   Warmers API

Customizability

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Pluggable API endpoints 
Pluggable search workflow   via SearchComponents
Pluggable update workflow 
Pluggable Analyzers/Tokenizers
Pluggable Field Types
Pluggable Function queries
Pluggable scoring scripts
Pluggable hashing 

Distributed

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Self-contained cluster   Depends on separate ZooKeeper server  Only ElasticSearch nodes
Automatic node discovery  ZooKeeper  internal Zen Discovery or ZooKeeper
Partition tolerance  The partition without a ZooKeeper quorum will stop accepting indexing requests or cluster state changes, while the partition with a quorum continues to function.  Partitioned clusters can diverge unless discovery.zen.minimum_master_nodes set to at least N/2+1, where N is the size of the cluster. If configured correctly, the partition without a quorum will stop operating, while the other continues to work. See this
Automatic failover  If all nodes storing a shard and its replicas fail, client requests will fail, unless requests are made with the shards.tolerant=true parameter, in which case partial results are retuned from the available shards.
Automatic leader election
Shard replication
Sharding 
Automatic shard rebalancing   it can be machine, rack, availability zone, and/or data center aware. Arbitrary tags can be assigned to nodes and it can be configured to not assign the same shard and its replicates on a node with the same tags.
Change # of shards  specified at index-creation time, with command-line param -DnumShards=n. Cannot be changed once index is created. Shard splitting is a work in progress (SOLR-3755). Additional replicas can be created.  each index has 5 shards by default. Number of primary shards cannot be changed once the index is created. Replicas can be increased anytime.
Relocate shards and replicas   can move shards and replicas to any node in the cluster on demand
Control shard routing   with some config changes  routing parameter
Consistency Indexing requests are synchronous with replication. A indexing request won’t return until all replicas respond. No check for downed replicas. They will catch up when they recover. When new replicas are added, they won’t start accepting and responding to requests until they are finished replicating the index. Replication between nodes is synchronous by default, thus ES is consistent by default, but it can be set to asynchronous on a per document indexing basis. Index writes can be configured to fail is there are not sufficient active shard replicas. The default is quorum, but all or one are also available.

Thoughts…

As a number of folks point out in the discussion below, feature comparisons are inherently shallow and only go so far. I think they serve a purpose, but shouldn’t be taken to be the last word on these 2 fantastic search products.

If you’re running a smallish site and need search features without the distributed bells-and-whistles, I think you’ll be very happy with either Solr or ElasticSearch.

The exception to this is if you need RIGHT NOW some very specific feature like field grouping which is currently implemented in Solr and not ElasticSearch. Because of the considerable momentum behind ElasticSearch, it is very likely that the feature-set between the 2 products will converge considerably in the near future.

If you’re planning a large installation that requires running distributed search instances, I suspect you’re going to be happier with ElasticSearch.

As Matt Weber points out below, ElasticSearch was built to be distributed from the ground up, not tacked on as an ‘afterthought’ like it was with Solr. This is totally evident when examining the design and architecture of the 2 products, and also when browsing the source code.


Resources


Contribute

If you see any mistakes, or would like to append to the information on this webpage, you can clone the GitHub repo for this site with:

git clone https://github.com/superkelvint/solr-vs-elasticsearch

and submit a pull request.

Advertisements

§ One Response to Apache Solr vs ElasticSearch

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

What’s this?

You are currently reading Apache Solr vs ElasticSearch at Naik Vinay.

meta

%d bloggers like this: