Building Search Interface Using Apache Solr in .NET

August 7, 2013 § 1 Comment

A quick, granular and accurate search interface is of prime importance for most web applications today. Many have traditional search interfaces where the scope of search is restricted to specific fields; thus limiting the ability to get relevant search results. However, most commercial web sites will require an advanced search interface that will index the content of documents, build complex queries based on multiple criteria and fetch maximum results. In this paper, we introduce the Apache Solr Search Engine that will, most importantly, provide content search, explain how to construct queries involving multiple search criteria using Solr and integrate with the application to build a quicker, accurate and more refined search interface.

Apache Solr is an open source enterprise search server, based on the Lucene Java search library, with XML/HTTP and JSON APIs. It runs in a Java servlet container, such as Tomcat. Input to Solr is the document with optional metadata. We put documents in it (called as indexing) via XML, JSON, CSV or binary over HTTP. We query it via HTTP Get and receive XML, JSON, CSV or binary result. It is architected to deliver very fast search operations across wide variety of data.

Features

  • Advanced full-text search – Users can search for one or more words or phrases in the content of documents, specific fields, or combinations of one or more fields, thus providing results that match user’s interests.
  • Faceted Search – User can narrow down the search results further by applying filters on the fields (numeric, date fields, unique fields) if the user wishes to drill down. Thus providing categorized search.
  • Sort – User can prioritize the search results based on field count.
  • Pagination – User can display the search results in pages of fixed size.
  • Hit-Term Highlighting – Provides highlighting of the search keyword in the document.
  • It is optimized for high-volume web traffic.
  • It supports rich Document Parsing and Indexing (PDF, Word, HTML, etc.)
  • Admin UI – It has a very simple and user-friendly interface for designing and executing queries over the data.
  • Caching – It caches the results of filter queries, thus delivering faster search operations.

Architecture

Diagram

The above block diagram shows the sequence of actions for uploading documents to Solr and executing Search queries as per specific search criteria to get the relevant matches.

Building Search Interface for Your Web Application:

We assume that Solr is configured and running. User will be required to know the Solr endpoint.

For more information on installing, configuring and running Solr, go here.

We propose a generic search interface that can be implemented to search any application specific entity that is indexed by Solr. Search method accepts the SearchParameters and returns the SearchResult of generic type <T>.



public interface ISearch<T>

    {

        SearchResult<T> Search (SearchParameters parameters);

    }

 

Let us see how the SearchParameters look like and how they are constructed.

public class SearchParameters {
        public const int DefaultPageSize = 4;

        public SearchParameters () {
SearchFor = new Dictionary<string, string>();
Exclude = new Dictionary<string, string>();
SortBy = new List<SortQuery>();
FilterBy = new List<FilterQuery>();
PageSize = DefaultPageSize;
PageIndex = 1;
        }

        public string FreeSearch { get; set; }
        public int PageIndex { get; set; }
        public int PageSize { get; set; }
        public IDictionary<string, string> SearchFor { get; set; }
        public IDictionary<string, string> Exclude { get; set; }
        public IList<SortQuery> SortBy { get; set; }
        public IList<FilterQuery> FilterBy{ get; set; }
        }
  }
  • SearchFor : We add the advanced full-text parameters to this dictionary in the following pattern:
Key – Name of field Value – Word/phrase
Title Azure, “cloud computing”
tags Azure, “cloud computing”
  • Exclude: These are parameters to be excluded in the advanced full-text Search added to a dictionary.
Key – Name of field Value – Word/phrase
Title business
tags business
  • SortBy: This is a List of SortQuery items. FieldName will map to the field on which the sorting needs to be done. Order will indicate the SortOrder (Ascending/Descending)
public class SortQuery
    {
        public string FieldName { get; set; }
        public SortOrder order { get; set; }
    }
public enum SortOrder
        {
            Ascending,
            Descending,
        }
FieldName order Description
Like _integer Descending Will sort the search results in descending order of number of Likes
  • FilterBy: This is a list of FilterQuery items.
public class FilterQuery

    {

        public string FieldName {get; set;}

        public string LowerLimit { get; set; }

        public string UpperLimit { get; set;}

        public string Value { get; set; }

        public string DataType { get; set; }

    }
  • FieldName = The field on which filtering needs to be done
  • Value = Value of the filter field
  • DataType = If filtering values are restricted to a particular range, this will indicate datatype of the filter field
  • LowerLimit = If filtering values are restricted to a particular range, this will indicate lowerlimit of the filter field
  • UpperLimit = If filtering values are restricted to a particular range, this will indicate upperlimit of the filter field

Table1

  • PageSize: This is used for pagination to specify number of search results per page.
  • PageIndex: This is used for pagination to specify the offset for query’s results set. It will instruct Solr to display results from this offset.

Implementing search interface:

SolrNet is a free Open Source API that can be integrated with your .NET web application to build queries programmatically and execute them over Solr.

Creating Solr search result entity

We first create a class with properties that will map to the fields returned in the Solr search result. To identify the fields we will fire the search query in Solr Admin UI or contact the administrator.

For example:
To get search results for keyword “twitter” we will enter the keyword in the “Query String” textbox of Solr Admin UI and hit the Search button.

A query such as http://endpoint/solr/select?q=twitter&fq=&start=0&rows=10 will appear in the browser which is the search query.

The response will be XML with nodes for each search result. Users can identify the fields returned from this response and create the properties accordingly.

SolrNet.Attributes namespace contains attributes to map fields from Solr search result to entity. These attributes can be used to augment existing application entity or create a parallel entity.

Example:

public class Product

 {

        [SolrUniqueKey("company_id_text")]

        public string CompanyId { get; set; }

        [SolrField ("product_count_integer")]

        public int ProductCount { get; set; }

        [SolrField("title_text")]

        public string Title { get; set; }

        [SolrField("created_on_datetime")]

        public DateTime CreatedOn { get; set; }

        [SolrField("downloadable_boolean")]

        public bool Downloadable { get; set; } 

}

SolrUniqueKey uniquely identifies the document. In database terms, it is the primary key. So users should choose to map a SolrUniqueKey in case the field is unique for each document, or SolrField can be used. The property name must exactly map to the field name in the Solr output XML.

Initialization:

using SolrNet;

public class Search Product: ISearch<Product>

{

static ISolrReadOnlyOperations< Product > solr;

static SolrConnection connection;

        static Search Product ()

        {                      

            connection = new SolrConnection("solrendpoint");          

            Startup.Init< Product >(connection);

           Solr = ServiceLocator.Current.GetInstance<ISolrReadOnlyOperations< Product>>();

        }

}

The above code snippet will initialize the SolrConnection. We also instantiate ISolrReadOnlyOperations<T> variable that we will use to build the Solr query. Here Product refers to the type of search result to be fetched.

Building Queries:

public SearchResult<Product> Search(SearchParameters parameters)

        {

            int? start = null;

            int? rows = null;

            if (parameters.PageIndex > 0)

            {

                start = (parameters.PageIndex - 1) * parameters.PageSize;

                rows = parameters.PageSize;

            }

            var matchingProducts = solr.Query(BuildQuery(parameters), new QueryOptions

            {

                FilterQueries = BuildFilterQueries(parameters),

                Rows = rows,

                Start = start,

                OrderBy = GetSelectedSort(parameters),

            });

            return new SearchResult< Product>(matchingProducts)

            {

                TotalResults = matchingProducts.NumFound,

            }}

The Query() method has the following prototype:

SolrQueryResults<T> Query (ISolrQuery query, QueryOptions options); 
  • Query = the advanced full-text search query
  • Options = Filter, Sort, Pagination options.
  • Returns SolrQueryResults of type T. In our case T = Product.
  • In the above code,FilterQueries = the filter query to be executed on the search results obtained after applying the full-text search query.
  • OrderBy = the sort query to be executed on the search results obtained after filtering.
  • Rows = for pagination specifies the number to search results to be returned.
  • Start = for pagination specifies the offset in the Solr response from where the results will be fetched.

Build advanced full-text search query:

public ISolrQuery BuildQuery(SearchParameters parameters)

        {

            if (!string.IsNullOrEmpty(parameters.FreeSearch))

                return new SolrQuery(parameters.FreeSearch);

            AbstractSolrQuery searchquery = null;

            List<SolrQuery> solrQuery = new List<SolrQuery>();

            List<SolrQuery> solrNotQuery = new List<SolrQuery>();

            foreach (var searchType in parameters.SearchFor)

            {

                solrQuery.Add(new SolrQuery(string.Format("{0}:{1}", searchType.Key,   searchType.Value)));

            }

            if (solrQuery.Count > 0)

                searchquery = new SolrMultipleCriteriaQuery(solrQuery, SolrMultipleCriteriaQuery.Operator.OR);

            foreach (var excludeType in parameters.Exclude)

            {

                solrNotQuery.Add(new SolrQuery(string.Format("{0}:{1}", excludeType.Key,  excludeType.Value)));

            }

            if (solrNotQuery.Count > 0)

            {

                searchquery = (searchquery ?? SolrQuery.All) - new SolrMultipleCriteriaQuery(solrNotQuery, SolrMultipleCriteriaQuery.Operator.OR);

            }

            return searchquery ?? SolrQuery.All;

        }
  • new SolrQuery(“fieldname : value”) = This class will create a SolrQuery for a full-text search of the given word/phrases in the fieldname
  • new SolrMultipleCriteriaQuery(solrQuery, Operator (AND/OR etc.)) = This class will apply the respective boolean operator on a list of solr queries.
  • SolrQuery.All = This will return all the documents in Solr without applying any search query.

Build Filter Queries:

public ICollection<ISolrQuery> BuildFilterQueries(SearchParameters parameters)

        {

            List<ISolrQuery> filter = new List<ISolrQuery>();

            foreach (var filterBy in parameters.FilterBy)

            {

                if (!String.IsNullOrEmpty(filterBy.DataType) &&  filterBy.DataType.Equals(Constants.DATE_DATATYPE))

                {

                    DateTime upperlim = Convert.ToDateTime(filterBy.UpperLimit);

                    DateTime lowerlim = Convert.ToDateTime(filterBy.LowerLimit);

                    if (upperlim.Equals(lowerlim))

                    {

                        upperlim = upperlim.AddDays(1);

                    }

                    filter. Add(new SolrQueryByRange<DateTime>(filterBy.FieldName, lowerlim,

                        upperlim));

                }

                else

                {

                    string[] filterValues;

                        if (filterBy.Value.Contains(";"))

                        {

                            filterValues = filterBy.Value.Split(';');

                            List<SolrQueryByField> filterForProduct = new List<SolrQueryByField>();

                            foreach (string filterVal in filterValues)

                            {

                               filterForProduct.Add(new SolrQueryByField(filterBy.FieldName, filterVal) { Quoted = false });

                            }

                            filter.Add(new SolrMultipleCriteriaQuery(filterForProduct, SolrMultipleCriteriaQuery.Operator.OR));

                        }

                        else

                        {

                            filter.Add(new SolrQueryByField(filterBy.FieldName, filterBy.Value));

                        }

                    }

                }           

            return filter;

        }

There are 2 types of filters:

  • SolrQueryByField:
new SolrQueryByField(filterBy.FieldName, filterVal) { Quoted = false })

This accepts the fieldName and filter value. Quoted = false indicates that character escaping is disabled.

  • SolrQueryByRange:
new SolrQueryByRange<DateTime>(filterBy.FieldName, lowerlim, upperlim) 

We use SolrQueryByRange when our filter values fall between some ranges. This needs the datatype of the field. DateTime in our case. The fieldname for filter, its lower limit and upper limit.

Build Sort Queries:

private ICollection<SortOrder> GetSelectedSort(SearchParameters parameters)

        {

            List<SortOrder> sortQueries = new List<SortOrder>();

            foreach (var sortBy in parameters.SortBy)

            {

                if (sortBy.order.Equals(SortQuery.SortOrder.Ascending))

                    sortQueries.Add(new SortOrder(sortBy.FieldName, Order.ASC));

                else

                    sortQueries.Add(new SortOrder(sortBy.FieldName, Order.DESC));

            }

            return sortQueries;

        }

SortOrder class will accept the fieldName to be sorted on along with the Sort Order (Ascending/Descending).

Relevance Score

Solr will by default order the search results based on the relevancy score that is calculated to determine how relevant a given Document is to a user’s query. The more times a query term appears in a document relative to the number of times the term appears in all the documents in the collection, the more relevant that document is to the query. Thus the priority of the search results will always be on track unless one explicitly gives a sort parameter in the search query.

Conclusion

This paper has provided a detailed description of the search options available with Apache Solr and will hopefully serve as a thorough guide in deciding on the search parameters, constructing queries and for building an Advanced Search interface.

Advertisements

Apache Solr vs ElasticSearch

May 16, 2013 § 1 Comment

Other day I had really good discussion on Apache Solr. I was totally fascinated with Apache Solr, when I did my last project. Another competitive product came to discussion. I still haven’t started playing around with Elastic Search. Here is start to my pet project

Apache Solr vs ElasticSearch

The Feature Smackdown


API

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Format XML,CSV,JSON JSON
HTTP REST API
Binary API   SolrJ  TransportClient, Thrift (through a plugin)
JMX support
Client libraries  PHP, Ruby, Perl, Scala, Python, .NET, Javascript PHP, Ruby, Perl, Scala, Python, .NET, Javascript, Erlang, Clojure
3rd-party product integration (open-source) Drupal, Magento, Django, ColdFusion, WordPress, OpenCMS, Plone, Typo3, ez Publish, Symfony2, Riak (via Yokozuna) Django, Symfony2
3rd-party product integration (commercial) DataStax Enterprise Search SearchBlox

Indexing

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Data Import DataImportHandler – MySQL, CSV, XML, Tika, URL, Flat File Rivers modules – Wikipedia, MongoDB, CouchDB, RabbitMQ, RSS, Sofa, JDBC, FileSystem, Dropbox, ActiveMQ, LDAP, Amazon SQS, St9, OAI, Twitter
ID field for updates and deduplication
Partial Doc Updates   with stored fields  with _source field
Custom Analyzers and Tokenizers 
Per-field analyzer chain 
Per-doc/query analyzer chain 
Synonyms   Supports Solr and Wordnet synonym format
Multiple indexes 
Near-Realtime Search/Indexing 
Complex documents   Flat document structure. No native support for nesting documents
Multiple document types per schema   One set of fields per schema, one schema per core
Online schema changes   Schema change requires restart. Workaround possible using MultiCore.  Only backward-compatible changes.
Apache Tika integration 
Dynamic fields 
Field copying   via multi-fields
Hash-based deduplication 

Searching

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Lucene Query parsing 
Structured Query DSL   Need to programmatically create queries if going beyond Lucene query syntax.
Span queries   via SOLR-2703
Spatial search 
Multi-point spatial search 
Faceting   The way top N facets work now is by getting the top N from each shard, and merging the results. This can give incorrect counts when num shards > 1.
Pivot Facets 
More Like This
Boosting by functions 
Boosting using scripting languages 
Push Queries   Percolation
Field collapsing/Results grouping   possibly 0.20+
Spellcheck  Suggest API
Autocomplete Beta implementation from community plugin
Query elevation 
Joins   It’s not supported in distributed search. See LUCENE-3759.  via has_children and top_children queries
Filter queries   also supports filtering by native scripts
Filter execution order   local params and cache property  _cache and _cache_key property
Alternative QueryParsers   DisMax, eDisMax  query_string, dis_max, match, multi_match etc
Negative boosting   but awkward. Involves positively boosting the inverse set of negatively-boosted documents.
Search across multiple indexes  it can search across multiple compatible collections
Result highlighting
Custom Similarity 
Searcher warming on index reload   Warmers API

Customizability

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Pluggable API endpoints 
Pluggable search workflow   via SearchComponents
Pluggable update workflow 
Pluggable Analyzers/Tokenizers
Pluggable Field Types
Pluggable Function queries
Pluggable scoring scripts
Pluggable hashing 

Distributed

Feature Solr 4.2 ElasticSearch 0.90.0.RC1
Self-contained cluster   Depends on separate ZooKeeper server  Only ElasticSearch nodes
Automatic node discovery  ZooKeeper  internal Zen Discovery or ZooKeeper
Partition tolerance  The partition without a ZooKeeper quorum will stop accepting indexing requests or cluster state changes, while the partition with a quorum continues to function.  Partitioned clusters can diverge unless discovery.zen.minimum_master_nodes set to at least N/2+1, where N is the size of the cluster. If configured correctly, the partition without a quorum will stop operating, while the other continues to work. See this
Automatic failover  If all nodes storing a shard and its replicas fail, client requests will fail, unless requests are made with the shards.tolerant=true parameter, in which case partial results are retuned from the available shards.
Automatic leader election
Shard replication
Sharding 
Automatic shard rebalancing   it can be machine, rack, availability zone, and/or data center aware. Arbitrary tags can be assigned to nodes and it can be configured to not assign the same shard and its replicates on a node with the same tags.
Change # of shards  specified at index-creation time, with command-line param -DnumShards=n. Cannot be changed once index is created. Shard splitting is a work in progress (SOLR-3755). Additional replicas can be created.  each index has 5 shards by default. Number of primary shards cannot be changed once the index is created. Replicas can be increased anytime.
Relocate shards and replicas   can move shards and replicas to any node in the cluster on demand
Control shard routing   with some config changes  routing parameter
Consistency Indexing requests are synchronous with replication. A indexing request won’t return until all replicas respond. No check for downed replicas. They will catch up when they recover. When new replicas are added, they won’t start accepting and responding to requests until they are finished replicating the index. Replication between nodes is synchronous by default, thus ES is consistent by default, but it can be set to asynchronous on a per document indexing basis. Index writes can be configured to fail is there are not sufficient active shard replicas. The default is quorum, but all or one are also available.

Thoughts…

As a number of folks point out in the discussion below, feature comparisons are inherently shallow and only go so far. I think they serve a purpose, but shouldn’t be taken to be the last word on these 2 fantastic search products.

If you’re running a smallish site and need search features without the distributed bells-and-whistles, I think you’ll be very happy with either Solr or ElasticSearch.

The exception to this is if you need RIGHT NOW some very specific feature like field grouping which is currently implemented in Solr and not ElasticSearch. Because of the considerable momentum behind ElasticSearch, it is very likely that the feature-set between the 2 products will converge considerably in the near future.

If you’re planning a large installation that requires running distributed search instances, I suspect you’re going to be happier with ElasticSearch.

As Matt Weber points out below, ElasticSearch was built to be distributed from the ground up, not tacked on as an ‘afterthought’ like it was with Solr. This is totally evident when examining the design and architecture of the 2 products, and also when browsing the source code.


Resources


Contribute

If you see any mistakes, or would like to append to the information on this webpage, you can clone the GitHub repo for this site with:

git clone https://github.com/superkelvint/solr-vs-elasticsearch

and submit a pull request.

Securing Solr on Tomcat

September 10, 2012 § Leave a comment

Note: Throughout this document the reference [Tomcat install dir] is the directory where Tomcat is installed. Typically, this is C:\Program Files\Apache Software Foundation\Tomcat 7.0 or C:\Program Files (x86)\Apache Software Foundation\Tomcat 7.0.

Restricting access using a user account

  1. Open [Tomcat install dir]\tomcat-users.xml for editing.
  2. Add the following lines within the <tomcat-user> element and save the changes (using your own username and password):
    <role rolename="solr_admin"/>
    <user username="your_username"
          password="your_password"
          roles="solr_admin"
          />

    Open [Tomcat install dir]\webapps\solr\WEB-INF\web.xml for editing.

    “solr” in the path is the name of the instance you want to secure. Typically this is “solr,” but may be different if you are running an advanced setup.

  3. Add the following lines within the <web-app> element:
      <security-constraint>
        <web-resource-collection>
          <web-resource-name>Solr Lockdown</web-resource-name>
          <url-pattern>/</url-pattern>
        </web-resource-collection>
        <auth-constraint>
          <role-name>solr_admin</role-name>
          <role-name>admin</role-name>
        </auth-constraint>
      </security-constraint>
      <login-config>
        <auth-method>BASIC</auth-method>
        <realm-name>Solr</realm-name>
      </login-config>
  4. Save the changes and restart Tomcat. Test your changes by starting a new browser session and navigating to your site, for ex. http://localhost:8080/solr/. You should be prompted for credentials.
  5. Download the following communityserver_override.config file.
  6. Place this file in the root of your Web site directory.
  7. Find the section in this file which is highlighted below and change it to read
    http://your_username:your_password@localhost:8080/solr
    replacing “your_username” and “your_password” and the host URI with the configured values

    1
    2
    3
    4
    5
    <Override xpath="/CommunityServer/Search/Solr"
              mode="change"
              name="host"
              value="<span style="background-color:#ffff00;">http://localhost:8080/solr</span>"
              />
  8. Save changes.

Restricting by IP Address

You can work with your IT department to limit connections, but in the case this is not an option you can enforce connection rules using Tomcat configuration.

  1. Open the solr.xml file in [Tomcat Install Dir]/conf/Catalina/localhost/. Create the file if it does not exist.

    The name of the file should match your instance name. A typical setup is running Solr at http://localhost:8080/solr, so the file should be named solr.xml (all lowercase, case-sensitive characters). If you set up Solr to run at a different location, e.g., http://localhost/solr1, the file must be named solr1.xml.

  2. Add the following snippet to the file, but update the docBase value with the path to where your solr.war file resides. Typically this would be [Tomcat install dir]/webapps/.

    For example, the following example only allows local connections from the local computer and connections from 172.15.1.1:

    <Context docBase="C:\Program Files\Apache Software Foundation\Tomcat 6.0\webapps\solr.war"
             debug="0" crossContext="true"> 
     <Valve className="org.apache.catalina.valves.RemoteAddrValve"
            allow="127.0.0.1,172.15.1.1" /> 
    </Context>
  3. Save your work and restart Tomcat.

Full text search

July 19, 2012 § Leave a comment

In text retrieval, full text search refers to techniques for searching a single computer-stored document or a collection in a full text database. Full text search is distinguished from searches based on metadata or on parts of the original texts represented in databases (such as titles, abstracts, selected sections or bibliographical references).

In a full text search, the search engine examines all of the words in every stored document as it tries to match search criteria (e.g., words supplied by a user). Full text searching techniques became common in online bibliographic databases in the 1990s. Many web sites and application programs (such as word processing software) provide full-text search capabilities. Some web search engines such as AltaVista employ full text search techniques while others index only a portion of the web pages examined by its indexing system.

Indexing

When dealing with a small number of documents it is possible for the full-text search engine to directly scan the contents of the documents with each query, a strategy called serial scanning. This is what some rudimentary tools, such as grep, do when searching.

However, when the number of documents to search is potentially large or the quantity of search queries to perform is substantial, the problem of full text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms, often called an index, but more correctly named a concordance. In the search stage, when performing a specific query, only the index is referenced rather than the text of the original documents.

The indexer will make an entry in the index for each term or word found in a document and possibly its relative position within the document. Usually the indexer will ignore stop words, such as the English “the”, which are both too common and carry too little meaning to be useful for searching. Some indexers also employ language-specific stemming on the words being indexed, so for example any of the words “drives”, “drove”, or “driven” will be recorded in the index under a single concept word “drive”.

The precision vs. recall tradeoff

This diagram represents a low-precision, low-recall search as described in the text.

Recall measures the quantity of results returned by a search and precision is the measure of the quality of the results returned. Recall is the ratio of relevant results returned divided by all relevant results. Precision is the number of relevant results returned divided by the total number of results returned.

The diagram at right represents a low-precision, low-recall search. In the diagram the red and green dots represent the total population of potential search results for a given search. Red dots represent irrelevant results, and green dots represent relevant results. Relevancy is indicated by the proximity of search results to the center of the inner circle. Of all possible results shown, those that were actually returned by the search are shown on a light-blue background. In the example only one relevant result of three possible relevant results was returned, so the recall is a very low ratio of 1/3 or 33%. The precision for the example is a very low 1/4 or 25%, since only one of the four results returned was relevant.

Due to the ambiguities of natural language, full text search systems typically includes options like stop words to increase precision and stemming to increase recall. Controlled-vocabulary searching also helps alleviate low-precision issues by tagging documents in such a way that ambiguities are eliminated. The trade-off between precision and recall is simple: an increase in precision can lower overall recall while an increase in recall lowers precision.

False-positive problem

Free text searching is likely to retrieve many documents that are not relevant to the intended search question. Such documents are called false positives. The retrieval of irrelevant documents is often caused by the inherent ambiguity of natural language. In the sample diagram at right, false positives are represented by the irrelevant results (red dots) that were returned by the search (on a light-blue background).

Clustering techniques based on Bayesian algorithms can help reduce false positives. For a search term of “football”, clustering can be used to categorize the document/data universe into “American football”, “corporate football”, etc. Depending on the occurrences of words relevant to the categories, search terms a search result can be placed in one or more of the categories. This technique is being extensively deployed in the e-discovery domain.

Performance improvements

The deficiencies of free text searching have been addressed in two ways: By providing users with tools that enable them to express their search questions more precisely, and by developing new search algorithms that improve retrieval precision.

Improved querying tools

  • Keywords. Document creators (or trained indexers) are asked to supply a list of words that describe the subject of the text, including synonyms of words that describe this subject. Keywords improve recall, particularly if the keyword list includes a search word that is not in the document text.
  • Field-restricted search. Some search engines enable users to limit free text searches to a particular field within a stored data record, such as “Title” or “Author.”
  • Boolean queries. Searches that use Boolean operators (for example, “encyclopedia” AND “online” NOT “Encarta”) can dramatically increase the precision of a free text search. The AND operator says, in effect, “Do not retrieve any document unless it contains both of these terms.” The NOT operator says, in effect, “Do not retrieve any document that contains this word.” If the retrieval list retrieves too few documents, the OR operator can be used to increase recall; consider, for example, “encyclopedia” AND “online” OR “Internet” NOT “Encarta”. This search will retrieve documents about online encyclopedias that use the term “Internet” instead of “online.” This increase in precision is very commonly counter-productive since it usually comes with a dramatic loss of recall.
  • Phrase search. A phrase search matches only those documents that contain a specified phrase, such as “Wikipedia, the free encyclopedia.”
  • Concept search. A search that is based on multi-word concepts, for example Compound term processing. This type of search is becoming popular in many e-Discovery solutions.
  • Concordance search. A concordance search produces an alphabetical list of all principal words that occur in a text with their immediate context.
  • Proximity search. A phrase search matches only those documents that contain two or more words that are separated by a specified number of words; a search for “Wikipedia” WITHIN2 “free” would retrieve only those documents in which the words “Wikipedia” and “free” occur within two words of each other.
  • Regular expression. A regular expression employs a complex but powerful querying syntax that can be used to specify retrieval conditions with precision.
  • Fuzzy search will search for document that match the given terms and some variation around them (using for instance edit distance to threshold the multiple variation)
  • Wildcard search. A search that substitutes one or more characters in a search query for a wildcard character such as an asterisk. For example using the asterisk in a search query “s*n” will find “sin”, “son”, “sun”, etc. in a text.

Improved search algorithms

The PageRank algorithm developed by Google gives more prominence to documents to which other Web pages have linked.

Software

The following is a partial list of available software products whose predominant purpose is to perform full text indexing and searching. Some of these are accompanied with detailed descriptions of their theory of operation or internal algorithms, which can provide additional insight into how full text search may be accomplished.

Free and open source software

  • Apache Solr
  • BaseX
  • Clusterpoint Server (freeware licence for a single-server)
  • DataparkSearch
  • ElasticSearch [2] (Apache License, Version 2.0)
  • Ferret
  • ht://Dig
  • Hyper Estraier
  • KinoSearch
  • Lemur/Indri
  • Lucene
  • mnoGoSearch
  • Sphinx
  • Swish-e
  • Xapian

Proprietary software

  • askSam
  • Attivio
  • Autonomy Corporation
  • BA Insight
  • Brainware
  • BRS/Search
  • Clusterpoint Server (cluster license)
  • Concept Searching Limited
  • Dieselpoint
  • dtSearch
  • Endeca
  • Exalead
  • Fast Search & Transfer
  • Inktomi
  • Locayta
  • Lucid Imagination
  • MarkLogic
  • Vivísimo

SOLR : What is schema.xml ?

April 28, 2012 § 3 Comments

One of the configuration files that describe each implementation Solr is schema.xml file. It describes one of the most important things of the implementation – the structure of the data index. The information contained in this file allow you to control how Solr behaves when indexing the data, or when making queries. Schema.xml is not only the very structure of the index, is also detailed information about data types that have a large influence on the behavior Solr, and usually are treated with neglect. This entry will try to bring some insight about schema.xml.

Schema.xml file consists of several parts:

  • version,
  • type definitions,
  • field definitions,
  • copyField section,
  • additional definitions.

Version

The first thing we come across in the schema.xml file is the version. This is the information for Solr how to treat some of the attributes in schema.xml file. The definition is as follows:

1 <schema name="example" version="1.3">

Please note that this is not the definition of the version from the perspective of your project. At this point Solr supports four versions of a schema.xml file:

  • 1.0 – multiValued attribute does not exist, all fields are multivalued by default.
  • 1.1 – introduced multiValued attribute, the default attribute value is false.
  • 1.2 – introduced omitTermFreqAndPositions attribute, the default value is true for all fields, besides text fields.
  • 1.3 – removed the possibility of an optional compression of fields.

Type definitions

Type definitions can be logically divided into two separate sections – the simple types and complex types. Simple types as opposed to the complex types do not have a defined filters and tokenizer.

Simple types

First thing we see in the schema.xml file after version are types definition. Each type is described as a number of attributes defining the behavior of that type. First, some attributes that describe each type and are mandatory:

  • name – name of the type (required attribute).
  • class – class that is responsible for the implementation. Please note that classes are delivered from standard Solr packaged will have names with ‘solr’ prefix.

Besides the two mentioned above, types can have the following optional attributes:

  • sortMissingLast – attribute specifying how values in a field based on this type should be treated in case of sorting. When set to true documents without value in a field of this type will always be at the end of the results list regardless of sort order. The default attribute value is false. Attribute can be used only for types that are considered by Lucene as a string.
  • sortMissingFirst – attribute specifying how values in a field based on this type should be treated in case of sorting. When set to true documents without value in a field of this type will always be at the first positions of the results list regardles of sort order. The default attribute value is false. Attribute can be used only for types that are considered by Lucene as a string.
  • omitNorms – attribute specifying whether field normalization should take place.
  • omitTermFreqAndPositions – attribute specifying whether term frequency and term positions should be calculated.
  • indexed – attribute specifying whether the field based on this type will keep their original values.
  • positionIncrementGap – attribute specifying how many position Lucene should skip.

It is worth remembering that in the default settings sortMissingLast and sortMissingFirst attributes Lucene will apply behavior of placing a document with blank field values at the beginning of the ascending sort, and at the end of the list of results for descending sorting.

One more options for simple types, but only those based on Trie*Field classes:

  • precisionStep – attribute specifying the number of bits of precision. The greater the number of bits, the faster the queries based on numerical ranges. This however, also increases the size of the index, as more values are indexed. Set attribute value to 0 to disable the functionality of indexing at various precisions.

An example of a simple type defined:

1 <fieldType name="string" class="solr.StrField" sortMissingLast="<em>true</em>" omitNorms="<em>true</em>"/>

Complex types

In addition to simple types, schema.xml file may include types consisting of a tokenizer and filters. Tokenizer is responsible for dividing the contents of the field in the tokens, while the filters are responsible for further token analysis. For example, the type that is responsible for dealing with the texts in Polish, would consist of a tokenizer in charge of the division of words based on whitespace, commas and periods. Filters for that type could be responsible for bringing generated tokens to lowercase, further division of tokens (for example on the basis of dashes), and then bringing tokens to the basic form.

Complex types, like simple types, have their name (name attribute) and the class which is responsible for implementation (class attribute). They can also be characterized by other attributes as described in the case of simple types (on the same basis). In addition, however, complex types can have a definition of tokenizer and filters to be used at the stage of indexing, and at the stage of query. As most of you know, for a given phase (indexing, or query) there can can be many filters defined but only one tokenizer. For example, just looks like a text type definition look like in the example provided with Solr:

01 <fieldType name="text" class="solr.TextField" positionIncrementGap="100"autoGeneratePhraseQueries="<em>true</em>">
02    <analyzer type="index">
03       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
04       <filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" />
05       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"splitOnCaseChange="1"/>
06       <filter class="solr.LowerCaseFilterFactory"/>
07       <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
08       <filter class="solr.PorterStemFilterFactory"/>
09    </analyzer>
10    <analyzer type="query">
11       <tokenizer class="solr.WhitespaceTokenizerFactory"/>
12       <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"ignoreCase="<em>true</em>" expand="<em>true</em>"/>
13       <filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" />
14       <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0"splitOnCaseChange="1"/>
15       <filter class="solr.LowerCaseFilterFactory"/>
16       <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
17       <filter class="solr.PorterStemFilterFactory"/>
18    </analyzer>
19 </fieldType>

It is worth noting that there is an additional attribute for the text field type:

  • autoGeneratePhraseQueries

This attribute is responsible for telling filters how to behave when dividing tokens. Some filters (such asWordDelimiterFilter) can divide tokens into a set of tokens. Setting the attribute to true (default value) will automatically generate phrase queries. This means that WordDelimiterFilter will divide the word “wi-fi” into two tokens “wi” and “fi”. With autoGeneratePhraseQueries set to true query sent to Lucene will look like "field:wi fi", while with set to false Lucene query will look like field:wi OR field:fi. However, please note, that this attribute only behaves well with tokenizers based on white spaces.

Returning to the type definition. As you can see, I gave an example which has two main sections:

1 <analyzer type="index">

and

1 <analyzer type="query">

The first section is responsible for the definition of the type, which will be used for indexing documents, the second section is responsible for the definition of type used for queries to fields based on this type. Note that if you want to use the same definitions for indexing and query phase, you can opt out of the two sections. Then our definition will look like this:

1 <fieldType name="text" class="solr.TextField" positionIncrementGap="100"autoGeneratePhraseQueries="<em>true</em>">
2    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
3    <filter class="solr.StopFilterFactory" ignoreCase="<em>true</em>" words="stopwords.txt" enablePositionIncrements="<em>true</em>" />
4    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1"generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0"splitOnCaseChange="1"/>
5    <filter class="solr.LowerCaseFilterFactory"/>
6    <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
7    <filter class="solr.PorterStemFilterFactory"/>
8 </fieldType>

As I mentioned in the definition of each complex type there is a tokenizer and a series of filters (though not necessarily). I will not describe each filter and tokenizer available in Solr. This information is available at the following address: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

At the end I wanted to add an important thing. Starting from 1.4 Solr tokenizer does not need to be the first mechanism that deals with the analysis of the field. Solr 1.4 introduced new filters – CharFilters that operate on the field before tokenizer and transmit the result to the tokenizer. It is worth to know because it might come in useful.

Multi-dimensional types

At the end I left myself a little addition – a novelty in Solr 1.4 – multi-dimensional fields – fields consisting of a number of other fields. Generally speaking, the assumption of this type of field was simple – to store in Solr pairs of values, triples or more related data, such as georaphical point coordinates. In practice this is realized by means of dynamic fields, but let me not get into the implementation details. The sample type definition that will consist two fields:

1 <fieldType name="location" class="solr.PointType" dimension="2"subFieldSuffix="_d"/>

In addition to standard attributes: name and class there are two others:

  • dimension – the number of dimensions (used by the class attribute solr.PointType).
  • subFieldSuffix – suffix, which will be added to the dynamic fields created by that type. It is important to remember that the field based on the presented type will create three fields in the index – the actual field (for example named mylocation and two additional dynamic fields).

Field Definitions

Definitions of the fields is another section in the schema.xml file, the section, which in theory should be of interest to us the most during the design of Solr index. As a rule, we find here two kinds of field definitions:

  1. Static Fields
  2. Dynamic Fields

These fields are treated differently by the Solr. The first type of fields, are fields that are available under one name. Dynamic fields are fields that are available under many names – actually their name are a simple regular expression (name starting or ending with a ‘*’ sign). Please note that Solr first selects the static field, then the dynamic field. In addition, if the field name matches more than one definition, Solr will select a field with a longer name pattern.

Returning to the definition of the fields (both static and dynamic), they consist of the following attributes:

  • name – the name of the field (required attribute).
  • type – type of field, which is one of the pre-defined types (required attribute).
  • indexed – if a field is to be indexed (set to true, if you want to search or sort on this field).
  • stored – whether you want to store the original values (set to true, if we want to retrieve the original value of the field).
  • omitNorms – whether you want norms to be ignored (set to true for the fields for which You will apply the full-text search).
  • termVectors – set to true in the case when we want to keep so called term vectors. The default parameter value is false. Some features require setting this parameter to true (eg MoreLikeThis orFastVectorHighlighting).
  • termPositions – set to true, if You want to keep term positions with the term vector. Setting to true will cause the index to expand its size.
  • termOffsets – set to true, if You want to keep term offsets together with term vector. Setting to true will cause the index to expand its size.
  • default – the default value to be given to the field when the document was not given any value in this field.

The following examples of definitions of fields:

1 <field name="id" type="string" indexed="<em>true</em>" stored="<em>true</em>" required="<em>true</em>" />
2 <field name="includes" type="text" indexed="<em>true</em>" stored="<em>true</em>" termVectors="<em>true</em>" termPositions="<em>true</em>" termOffsets="<em>true</em>" />
3 <field name="timestamp" type="date" indexed="<em>true</em>" stored="<em>true</em>" default="NOW" multiValued="<em>false</em>"/>
4 <dynamicField name="*_i" type="int" indexed="<em>true</em>" stored="<em>true</em>"/>

And finally, additional information to remember. In addition to the attributes listed above in the fields definition, we can overwrite the attributes that have been defined for type (eg whether a field is to be multiValued – the above example for a field called timestamp). Sometimes, this functionality can be useful if you need a specific field whose type is slightly different from other types (as in the example – only multiValued attribute). Of course, keep in mind the limitations imposed on the individual attributes associated with types.

CopyField section

In short, this section is responsible for copying the contents of fields to other fields. We define the field which value should be copied, and the destination field. Please note that copying takes place before the field value is analyzed. Example copyField definition:

1 <copyField source="category" dest="text"/>

For the sake of accuracy, occurring attributes mean:

  • source – the source field,
  • dest – the destination field.

Additional definitions

1. Unique key definition

The definition of a unique key that makes possible to unambiguously identify the document. Defining a unique key is not necessary, but is recommended. Sample definition:

1 <uniqueKey>id</uniqueKey>

2. Default search field definition

The Section is responsible for defining a default search field, which Solr use in case You have not given any field. Sample definition:

1 <defaultSearchField>content</defaultSearchField>

3. Default logical operator definition

This section is responsible for the definition of default logical operator that will be used. Sample definition looks as follows:

1 <solrQueryParser defaultOperator="OR" />

Possible values are: OR and AND.

4. Defining similarity

Finally we define the similarity that we will use. It is rather a topic for another post, but you must know that if necessary You can change the default similarity (currently in Solr trunk there are already two classes of similarity). The sample definition is as follows:

1 <similarity class="pl.solr.similarity.CustomSimilarity" />

A few words at the end

Information presented above should give some insight on what schema.xml file is and what correspond to the different sections in this file. Soon I will try to write what You should avoid when designing the index.

Where Am I?

You are currently browsing the SOLR category at Naik Vinay.