documentation updates for scaling the cluster

This commit is contained in:
mwells
2014-06-04 07:17:34 -07:00
parent 585e6a357f
commit 3fd973a53e
2 changed files with 92 additions and 30 deletions

@ -30,9 +30,19 @@ A work-in-progress <a href=/compare.html>comparison to SOLR</a>.
<!--<a href=#weighting>Weighting Query Terms</a> - how to pass in your own query term weights<br><br>-->
<a href=#requirements>Hardware Requirements</a> - what is required to run gigablast<br><br><a href=#perf>Performance Specifications</a> - various statistics.
<a href=#requirements>Hardware Requirements</a> - what is required to run gigablast
<br>
<br>
<a href=#perf>Performance Specifications</a> - various statistics.
<br>
<br>
<a href=#scaling>Scaling the Cluster</a> - how to add more gb instances.
<br>
<br>
<!--<a href=#files>List of Files</a> - the necessary files to run Gigablast
<br>
<br>-->
@ -248,6 +258,10 @@ Also, keep in mind not to use these weighting operators between another query op
At least one computer with 4GB RAM, 10GB of hard drive space and any distribution of Linux with the 2.4.25 kernel or higher. For decent performance invest in Intel Solid State Drives. I tested other brands around 2010 and found that they would freeze for up for 500ms every hour or so to do "garbage collection". That is unacceptable in general for a search engine.
Plus, Gigablast, reads and writes a lot of data at the same time under heavy spider and query loads, therefore disk will probably be your MAJOR bottleneck.<br><br>
<br>
<a name=perf></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Performance Specifications</td></tr></table>
@ -297,6 +311,36 @@ Each directory should have the following files and subdirectories:<br><br>
<br>
-->
<a name=scaling></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Scaling the Cluster</td></tr></table>
<br>
&lt;<i>Last Updated May 2014</i>&gt;
<br>
<br>
1. Turn off spidering in the <a href=/admin/master>master controls</a>.
<br>
2. Shut down the clustering by doing a <b>gb stop</b> command on the command line OR by clicking on "save & exit" in the <a href=/admin/master>master controls</a>
<br>
3. Edit the hosts.conf file in the working directory to add the new hosts. (<a href=/hosts.conf.txt>sample hosts.conf</a>)
<br>
4. Ensure you can do passwordless ssh from host #0 to each new IP address you added. This generally requires running <b>ssh-keygen -t dsa</b> on host #0 to create the files <i>~/.ssh/id_dsa</i> and <i>~/.ssh/id_dsa.pub</i>. Then you need to insert the key in <i>~/.ssh/id_dsa.pub</i> into the <i>~/.ssh/authorized_keys2</i> file on every host, including host #0, in your cluster. Furthermore, you must do a <b>chmod 700 ~/.ssh/authorized_keys2</b> on each one otherwise the passwordless ssh will not work.
<br>
5. Run <b>gb install &lt;hostid&gt;</b> on host #0 for each new hostid to copy the required files from host #0 to the new hosts. This will do an <i>scp</i> which requires the passwordless ssh. &lt;hostid&gt; can be a range of hostids like <i>5-12</i> as well.
<br>
6. Run <b>gb start</b> on the command line to start up all gb instances/processes in the cluster.
<br>
7. Click on <b>rebalance shards</b> in the <a href=/admin/master>master controls</a> to begin moving data from the old shards to the new shards. The <a href=/admin/hosts>hosts table</a> will let you know when the rebalance operation is complete. It should be able to serve queries during the rebalancing, but spidering can not resume until it is completed.
<br>
<br>
<br>
<a name=clustermaint></a>
<table cellpadding=1 border=0 width=100% bgcolor=#0079ba>
<tr><td><center><b><font color=#ffffff size=+1>Cluster Maintenance</td></tr></table>

@ -2,18 +2,17 @@
# Tells us what hosts are participating in the distributed search engine.
# the working directory for the gb process:
working-dir: /home/mwells/github/
0 5998 7000 8000 9000 127.0.0.1 127.0.0.1 /home/mwells/gigablast/
# How many mirrors do you want? If this is 0 then your data
# will NOT be replicated. If it is 1 then each host listed
# below will have one host that mirrors it, thereby decreasing
# total index capacity, but increasing redundancy. If this is
# 1 then the first half of hosts will be replicated by the
# second half of the hosts listed below.
# This is how many pieces you want the index split into.
# So if you have 64 machines, and you want a unique piece of index on
# each machine, then make this 64. But if you have 64 machines and you
# want one level of redundancy then make this 32.
index-splits: 1
num-mirrors: 0
@ -31,9 +30,20 @@ index-splits: 1
# seventh column: like sixth column but for secondary ethernet port. (optional)
# By default just use the local host as the single host.
# The client DNS uses port 6000, https listens on 7000, http listens on port
# 8000 and the udp server listens on port 9000.
# This file consists of a list of lines like this:
#
# <ClientDnsPort> <HttpsPort> <HttpPort> <UdpPort> <IP1> <IP2> <Path>
#
# By default just use the local host as the single host as listed below.
#
# The client DNS uses port 5998, https listens on 7000, http listens on port
# 8000 and the udp server listens on port 9000. We used to use port 6000 for
# DNS listening but it seemed to have some issues. If your DNS keeps timing
# out try a different port from 5998.
#
# If your server only has one IP then just repeat it as IP1 and IP2. You
# can also use an alphanumeric name from /etc/hosts in place of a direct
# IP address. (see example below)
#
# Use './gb N' to run the gb process as host #N where N is 0 to run as
# the first host in the list below.
@ -45,20 +55,28 @@ index-splits: 1
# Use './gb kstart N' to run the Nth host in a bash keep-alive loop. So if it
# cores it will restart. It will send out an email alert if it restarts.
#
0 6000 7000 8000 9000 127.0.0.1
# The working directory is the last string on each line. That is where the
# 'gb' binary resides.
#
# Example of a four-node distributed search index:
#
# Example of a four-node distributed search index running on a single
# server with four cores. The working directories are /home/mwells/hostN/.
# The 'gb' binary resides in the working directories. We have to use
# different ports for each gb instance since they are all on the same
# server.
#
# Use './gb 2' to run as the host on IP 1.2.3.8 for example.
#
#0 6000 7000 8000 9000 1.2.3.4 1.2.3.5
#1 6000 7000 8000 9000 1.2.3.6 1.2.3.7
#2 6000 7000 8000 9000 1.2.3.8 1.2.3.9
#3 6000 7000 8000 9000 1.2.3.10 1.2.3.11
#0 5998 7000 8000 9000 1.2.3.4 1.2.3.5 /home/mwells/host0/
#1 5997 7001 8001 9001 1.2.3.4 1.2.3.5 /home/mwells/host1/
#2 5996 7002 8002 9002 1.2.3.4 1.2.3.5 /home/mwells/host2/
#3 5995 7003 8003 9003 1.2.3.4 1.2.3.5 /home/mwells/host3/
# A four-node cluster on four different servers:
#0 5998 7000 8000 9000 1.2.3.4 1.2.3.5 /home/mwells/gigablast/
#1 5998 7000 8000 9000 1.2.3.6 1.2.3.7 /home/mwells/gigablast/
#2 5998 7000 8000 9000 1.2.3.8 1.2.3.9 /home/mwells/gigablast/
#3 5998 7000 8000 9000 1.2.3.10 1.2.3.11 /home/mwells/gigablast/
#
@ -66,14 +84,14 @@ index-splits: 1
# Each line represents a single gb process with dual ethernet ports
# whose IP addresses are in /etc/hosts under se0, se0b, se1, se1b, ...
#
#0 6000 7000 8000 9000 se0 se0b
#1 6000 7000 8000 9000 se1 se1b
#2 6000 7000 8000 9000 se2 se2b
#3 6000 7000 8000 9000 se3 se3b
#4 6000 7000 8000 9000 se4 se4b
#5 6000 7000 8000 9000 se5 se5b
#6 6000 7000 8000 9000 se6 se6b
#7 6000 7000 8000 9000 se7 se7b
#0 5998 7000 8000 9000 se0 se0b /home/mwells/gigablast/
#1 5998 7000 8000 9000 se1 se1b /home/mwells/gigablast/
#2 5998 7000 8000 9000 se2 se2b /home/mwells/gigablast/
#3 5998 7000 8000 9000 se3 se3b /home/mwells/gigablast/
#4 5998 7000 8000 9000 se4 se4b /home/mwells/gigablast/
#5 5998 7000 8000 9000 se5 se5b /home/mwells/gigablast/
#6 5998 7000 8000 9000 se6 se6b /home/mwells/gigablast/
#7 5998 7000 8000 9000 se7 se7b /home/mwells/gigablast/
# Proxies