Solr is an open source enterprise search platform, written in Java, from the Apache Lucene project. It helps us in finding the required information from a large data source like Hadoop. It's capability don't stop with searching, it can be used for storage purpose like NoSQL databases and thus understood that it is non-relational data storage and processing technology.
Solr is widely used for enterprise search and analytics use cases and has an active development community and projects. It runs on a standalone Search server and uses Lucene (a Java Search Library) as its core and has APIs like REST, HTTP/XML, JSON which makes it accessible from other distributed applications which are developed in various programming languages.
Felt stereotypic.. ha ha.. Lets make it simple..
Lets say you are preparing some research notes an wanted to refer books from university central library. Got the book you want, lets say "Advanced Algorithm Design" and wanted to read "Pattern matching Algorithm". What is the first thing you do.. open the index of the book to know on which page your topic is in and directly open the page to read.
Its something similar Solr does.. it stores the information you give to it based on a reference called index and fetch back on the index when you search. So, Solr works on the concept of indexing.
Features of Solr:Lets take a look at what Apache Solr is capable of doing..
- Document parsing (to store)
- Full-Text search using Lucene
- Text highlighiting
- JSON, XML, PHP, Ruby, Python, XSLT, Velocity and custom Java binary output formats over HTTP
- Distributed search through Sharding
- Geo-spatial search
- Cluster based search results and so on..
Architecture:This how your search request will be processed to give result..
- Request Handling: The request we send to Solr will be taken care by the Request Handlers. There will be different Request handlers to receive different types of requests. We need to properly choose the appropriate request handler by mapping it to the end-point URI (/select, /update) to get our request resolved.
- Search: Solr is equipped with multiple Search components like spell-checking component, highlighting component, query component etc,. to perform searching with the help of Search handlers.
- Query Parsing: After a quick syntactical check, Solr parses the queries we pass to it into the language the underlying Lucene understands.
- Response Writing: Once search is completed, Response Writing components will format the result into one of the formats it supports like XML, JSON, CSV, etc.
Get Solr..Now, lets get our hands dirty with Solr.. Download Solr form here..
NOTE: As Apache Solr is open-source technology, even source code (solr-xxx-src.zip) will be available to download. As we just want to use it, make sure you download solr-xxx.zip.
Lets get Solr up and run..
Once you download, unzip and copy the folder to some other location (better not to leave in ../Downloads.) Lets say now it is in D:/varun/tools/ApacheSolr
- Launch command prompt (Terminal in case of Mac)
- Navigate to the folder
- Now start the Solr Server using the start command
after printing some verbose.. finally it says
There you go.. You have successfully started Solr Server..
Waiting up to 180 seconds to see Solr running on port 8983
Started Solr server on port 8983 (pid=8248).Happy searching!
- Now navigate to localhost:8983/solr (Solr Admin ) on browser and you should be shown the following screen
Now that you have started the Solr Server, you need to know some basic commands of Solr to operate it.
- By default solr uses port 8983, if you want to start the server on some other port, pass on the port number as
bin/solr start -p 4567
- To stop the server use
means saying good bye to you.
ending stop command to Solr running on port 4567 ... waiting 5 seconds to
allow Jetty process 6035 to stop gracefully.
- Sometime you may want to restart the server. Then execute
server should restart
For any help , use
it shows different commands you could use
Usage: solr COMMAND OPTIONS where COMMAND is one of: start, stop, restart, status, healthcheck, create, create_core, create_collection, delete, version, zk, auth, assert, config
Standalone server example (start Solr running in the background on port 4567):
bin/solr start -p 4567
SolrCloud example (start Solr running in SolrCloud mode using localhost:2181 to connect to Zookeeper, with 1g max heap size and remote Java debug options enabled):
bin/solr start -c -m 1g -z localhost:2181 -a "-Xdebug -Xrunjdwp:transport=dt_socket,server=y,suspend=n,address=1044"
Pass -help after any COMMAND to see command-specific usage information,such as: bin/solr start -help or bin/solr stop -help
If you observe, the highlighted region of above screenshot says, 'No cores available'. So, what are cores? to understand that, let us get familiar with the terminology..
- Instance is Solr server instance (just like our tomcat instance). We can see the instances in Solr home directory. Solr instance runs in JVM and an instance contains one or more cores.
- core is a unit in instance into which you store the indexes. Instead if maintaining multiple instances of Solr servers, we'll have one or more cores in an instance.
- document is the basic unit of information in Solr, which is set of data that describes it. Say, person is the document.. name, age, gender, height etc,. are the set data that describes the person.
- fields are the ones that describe the document. In the above example, name, age, gender, height are the fields.
- field type is something that says about the type of data in the field. If you properly mention the field type, like name is a string, age of type int, it make Solr easily search and get the results.
Let's feed Solr..
Along with data, Solr take some configuration as well. Following is the configuration.
solr.xml is the configuration file, which carries information about the cores. Solr takes the help of this file to load / identify the cores.
schema.xml is the file which has the fields and field types with the complete schema. This is also named as managed-schema.xml
solrconfig.xml carries core specific configurations like request-handlers(process your requests to Solr), response formatting and lot many.
data-config.xml is required when you are pulling data from databases. In this file you need to configure dataSource (URL, username, password), entities(tables) and field(columns) in it.
Now let us give the content.. We can call this process as indexing.. Which makes the content searchable.
A Solr index can accept data from multiple different sources including XML, CSV, Word document, PDF and the data from databases as well.
Now to post the data from documents into Solr. you can find the sample data at ../../example/exampledocs directory
On Linux / Mac,
* means all the files in the given directory
bin/post -c techproducts example/exampledocs/*
java -jar -Dc=techproducts -Dauto example\exampledocs\post.jar example\exampledocs\*
You should see some verbose saying POSTing the file one after another..
Let's Search now..
We can search via REST clients, POSTMAN on chrome, curl etc,. Also with the native clients for many programming languages.
As of now we will explore Solr Admin UI. if you see, the core has loaded with techproducts. (if not loaded, click on the dropdown, and choose) and click on "Execute Query" button, you'll get 10 results(default) displayed in JSON format.
In this way, we can search for the information by changing the query in "q" field. We can search for : Single term (inStock:true), particular field (cat:electronics), combining the query (+electronics - music), this means results with electronics and without music.
You can explore by changing the query.
Thats all for now.. Hope this blog gave you atleast basic understanding of Apache Solr. In the upcoming blog, we'll see how to get the data from database and we'll explore Client API.
Till then, Happy Searching !!