Installing SolrCloud on Hadoop

11.05.2015

Apache Solr is one of the powerful open-source search libraries available in the market. In this article, we will discuss on how to set up a Solr cluster i.e. SolrCloud on Hadoop so that Solr index data is stored in HDFS and is made available for search functionality.

Terminology

For more information on Solr terminology, please refer to https://wiki.apache.org/solr/SolrTerminology

Pre-requisites

  • A working Hadoop cluster
  • Solr 4.10.2
  • Apache Tomcat 8.0.21
  • Apache Zookeeper 3.4.6

Architecture

The illustration below depicts on  how various elements are distributed across different machines in the cluster. For our purposes, we have configured each machine to work as both Hadoop data node and SolrCloud node. However, these can be put on different machines altogether, depending on the business need.

 1.png

During this demonstration, we shall create a Solr collection called ‘whishworks-solr-collection’ which contains 3 shards with a replication factor of 2. As we have three machines in our SolrCloud cluster, each machine will have a maximum of 2 shards.

Steps

  1. Ensure Java is installed on each of the Solr node machines and JAVA_HOME and PATH variables are appropriately set to the correct locations
  2. On one of the machines, download Solr 4.10.2 and extract the contents to the desired folder. This folder will be referred as SOLR_HOME in subsequent steps. We chose Slave 1 on which Solr will be downloaded and extracted.

Upload configuration to Zookeeper

  1. For SolrCloud, configuration information will be maintained centrally in Zookeeper. Hence, the configuration will be modified on any one machine in the cluster and then the modified configuration is uploaded to Zookeeper. Again, make the necessary changes on ‘Slave 1’ and then upload the same to Zookeeper.

Note: As this article is only for setting up SolrCloud cluster, only solrconfig.xml is considered. However, when uploading configuration information to Zookeeper, schema.xml has to be modified and uploaded as per schema requirements. Modifications to schema.xml are not covered in this blog.

  1. In order to allow Solr to store index data in HDFS, modify solrconfig.xml file present under SOLR_HOME/example/solr/collection1/conf folder
  2. Search for the tag directoryFactory. Replace the complete contents under the tag i.e. all the content between
     

    <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.NRTCachingDirectoryFactory}>

    and </directoryFactory> should be replaced with the below snippet

&lt;directoryFactory name="DirectoryFactory" class="solr.HdfsDirectoryFactory"&gt;
      &lt;str name="solr.hdfs.home"&gt;hdfs://name_node:8020/solr_location&lt;/str&gt;
      &lt;bool name="solr.hdfs.blockcache.enabled"&gt;true&lt;/bool&gt;
      &lt;int name="solr.hdfs.blockcache.slab.count"&gt;1&lt;/int&gt;
      &lt;bool name="solr.hdfs.blockcache.direct.memory.allocation"&gt;true&lt;/bool&gt;
      &lt;int name="solr.hdfs.blockcache.blocksperbank"&gt;16384&lt;/int&gt;
      &lt;bool name="solr.hdfs.blockcache.read.enabled"&gt;true&lt;/bool&gt;
      &lt;bool name="solr.hdfs.blockcache.write.enabled"&gt;true&lt;/bool&gt;
      &lt;bool name="solr.hdfs.nrtcachingdirectory.enable"&gt;true&lt;/bool&gt;
      &lt;int name="solr.hdfs.nrtcachingdirectory.maxmergesizemb"&gt;16&lt;/int&gt;
      &lt;int name="solr.hdfs.nrtcachingdirectory.maxcachedmb"&gt;192&lt;/int&gt;
    </directoryFactory>
  1. Modify the entries for name_node and solr_location with the appropriate values for tge cluster. Once the cluster is setup, Solr index data will be stored in Hadoop under solr_location
  2. Now, search for tag lockType in solrconfig.xml. Value within the tag has to be changed to hdfs i.e. post modification, which should be as mentioned below:
&lt;lockType&gt;hdfs&lt;/lockType&gt;
  1. In order to upload configuration to Zookeeper, we need to run the Zookeeper CLI shell (zkcli.sh) with the commands upconfig and linkconfig. Prior to this, we need to create a folder which contains all the required libraries to be used by ZkCLI.sh while uploading the configuration. These libraries will be put on the classpath while the commands are in the run mode.
  2. Create a temporary directory and term it as TEMP. Extract SOLR_HOME/dist/solr/ solr-4.10.2.war to this directory. Use the below command to extract the war file

unzip SOLR_HOME/dist/solr/ solr-4.10.2.war TEMP

  1. Copy all the jar libraries present under TEMP/WEB-INF/lib to a folder called SOLR_LIB. Command for the same is

cp TEMP/WEB-INF/lib/* SOLR_LIB/

  1. Solr requires SLF4j and other Logger libraries. Hence, we need to copy those libraries as well to SOLR_LIB using the below command

cp SOLR_HOME/example/lib/ext/* SOLR_LIB/

  1. Now SOLR_LIB contains all the required libraries needed by zkcli.sh. So, upload the configuration to Zookeeper.

Note: The complete conf folder that has come along with the example server in Solr installation has been uploaded. However, only solrconfig.xml has been modified for our purposes.

  1. Navigate to the folder SOLR_HOME/example/scripts/cloud-scripts/ where the zkcli.sh script is present
  2. Run the below command to upload the configuration to Zookeeper

java -classpath .:SOLR_LIB/* zkcli.sh -cmd upconfig -zkhost slave1:2181, slave2:2181,

slave3:2181 -confdir SOLR_HOME/example/solr/collection1/conf -confname whishworks_solr_conf

  1. The configuration which has been uploaded , need to be linked with the collection. Run the below command for the same.

java -classpath .:SOLR_LIB/* zkcli.sh -cmd linkconfig -collection whishworks-solr-collection -confname whishworks_solr_conf -zkhost slave1:2181, slave2:2181, slave3:2181

 Tomcat and solr.xml changes

Following changes and installations are to be made on ALL the Solr node machines. However, for simplicity purposes, all the below changes are described from Slave 1 perspective. These steps are to be repeated for each machine in the Solr cluster

  1. Download and install Tomcat. Tomcat installation directory will be referred as TOMCAT_HOME
  2. Copy solr-4.10.2.war from SOLR_HOME/dist folder of Slave 1 machine to TOMCAT_HOME/webapps on each of the cluster nodes and rename it as solr.war
  3. Copy the logger libraries from SOLR_HOME/example/lib/ext/ folder of Slave 1 machine to TOMCAT_HOME/lib. Additionally, it is advised to download apache-commns.jar and place the same under TOMCAT_HOME/lib folder
  4. Create setenv.sh under TOMCAT_HOME/bin folder. Edit setenv.sh to add the below content to it
#!/bin/sh
JAVA_HOME=$(YOUR_JAVA_HOME)
JAVA_OPTS="$JAVA_OPTS -server"
JAVA_OPTS="$JAVA_OPTS -Xms128m -Xmx2048m"
JAVA_OPTS="$JAVA_OPTS -XX:PermSize=64m -XX:MaxPermSize=128m -XX:+UseG1GC"
SOLR_OPTS="-Dsolr.solr.home=SOLR_CORE_HOME -Dhost=slave1 -Dport=8080
-DhostContext=solr -DzkClientTimeout=20000 -DzkHost=slave1:2181, slave2:2181, slave3:2181
JAVA_OPTS="$JAVA_OPTS $SOLR_OPTS"
  1. Create folder called SOLR_CORE_HOME which has to contain a file called solr.xml
  2. Create solr.xml file under SOLR_CORE_HOME to contain the following information

 

&lt;?xml version="1.0" encoding="UTF-8" ?&gt;
&lt;solr&gt;
  &lt;solrcloud&gt;
     &lt;str name="host"&gt;${host:}&lt;/str&gt;
     &lt;int name="hostPort"&gt;${port:}&lt;/int&gt;
     &lt;str name="hostContext"&gt;${hostContext:}&lt;/str&gt;
     &lt;int name="zkClientTimeout"&gt;${zkClientTimeout:}&lt;/int&gt;
     &lt;bool name="genericCoreNodeNames"&gt;${genericCoreNodeNames:true}&lt;/bool&gt;
   &lt;/solrcloud&gt;
  &lt;shardHandlerFactory name="shardHandlerFactory"
     class="HttpShardHandlerFactory"&gt;
     &lt;int name="socketTimeout"&gt;${socketTimeout:0}&lt;/int&gt;
     &lt;int name="connTimeout"&gt;${connTimeout:0}&lt;/int&gt;
   &lt;/shardHandlerFactory&gt;
 &lt;/solr&gt;

SolrCloud – Collection and Shards

As the configuration changes have been made and tomcat is been installed, let us create shards on Solr by calling the appropriate REST APIs

  1. Run the below command to create the collection and shards, set replication and the maximum shards per node

curl 'http://slave1:8080/solr/admin/collections?action=CREATE&name= whishworks-solr-collection &numShards=3&replicationFactor=2&maxShardsPerNode=2'

  1. For each of the shards, we need to create the replicas on the corresponding nodes. For our purposes, let us have the following set up

Shard 1 – Slave 1 & Slave 2

Shard 2 – Slave 2 & Slave 3

Shard 3 – Slave 1 & Slave 3

  1. Run the below mentioned commands in sequence so that the shards and their replicas are created accurately on each of the Solr node machines

 

curl 'http://slave1:8080/solr/admin/cores?action=CREATE&name=shard1-replica-1&collection=whishworks-solr-collection&shard=shard1'

 

curl http://slave2:8080/solr/admin/cores?action=CREATE&name=shard1-replica-2&collection=whishworks-solr-collection&shard=shard1'

 

curl 'http://slave2:8080/solr/admin/cores?action=CREATE&name=shard2-replica-1&collection=whishworks-solr-collection&shard=shard2'

 

curl 'http://slave3:8080/solr/admin/cores?action=CREATE&name=shard2-replica-2&collection=whishworks-solr-collection&shard=shard2'

 

curl 'http://slave3:8080/solr/admin/cores?action=CREATE&name=shard3-replica-1&collection=whishworks-solr-collection&shard=shard3'

 

curl 'http://slave1:8080/solr/admin/cores?action=CREATE&name=shard3-replica-2&collection=whishworks-solr-collection&shard=shard3'

Once all the commands are run, SolrCloud should have been setup. In order to check if everything is set up perfectly, open the URL http://slave1:8080/solr and click on Cloud in the side navigation bar. On the right, SolrCloud cluster view will be displayed with all the shards and their corresponding replicas.

WHISHWORKS is a  Hortonworks Gold Partner, an Authorised Reseller and Certified Consulting Partner for MapR and Cloudera Silver Partner.

If you would like to find out more about how Big Data could help you make the most out of your current infrastructure while enabling you to open your digital horizons, do give us a call at +44 (0)203 475 7980 or email us at marketing@whishworks.com

 

 

Recent Posts