"To get search results from Gigablast use a url like: <b><a href=\"/search?q=test&sc=0&dr=0&raw=8&topics=20+100\">http://www.gigablast.com/search?q=test&sc=0&dr=0&raw=8&topics=20+100</a></b> where:<br>"
""
"<br>"
"<table cellpadding=4>\n"
"\n"
"<tr><td bgcolor=#eeeeee>n=X</b></td>"
"<td bgcolor=#eeeeee>returns X search results. Default is 10. Max is 50.</td></tr>"
"\n"
"<tr><td>s=X</b></td>\n"
"<td>returns results starting at result #X. The first result is result #0. Default is 0. Max is 499.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>ns=X</b></td>"
"<td bgcolor=#eeeeee>returns X <b>summary excerpts</b> in the summary of each search result. Default is defined on a per collection basis in the <a href=\"/admin\">Display Controls</a>.</td></tr>"
"\n"
"<tr><td>site=X</b></td>\n"
"<td>returned results will have URLs from the site, X.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>plus=X</b></td>"
"<td bgcolor=#eeeeee>returned results will have all words in X. Like a default AND.</td></tr>"
"\n"
"<tr><td>minus=X</b></td>\n"
"<td>returned results will not have any words in X.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>rat=1</b></td>"
"<td bgcolor=#eeeeee>returned results will have ALL query terms. This is also known as a <i>default and</i> search. <i>rat</i> means Require All Terms. </td></tr>"
"\n"
"<tr><td>sc=X</b></td>\n"
"<td>X can be 0 or 1 to respectively disable or enable <a href=#siteclustering><b>site clustering</b></a>. Default is 1, but 0 if the <i>raw</i> parameter is used.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>dr=X</b></td>"
"<td bgcolor=#eeeeee>X can be 0 or 1 to respectively disable or enable <a href=#dupremoval><b>duplicate result removal</b></a>. Default is 1, but 0 if the <i>raw</i> parameter is used.</td></tr>"
"\n"
"<tr><td>raw=X</b></td>\n"
"<td>X ranges from 0 to 8 to specify the format of the search results. raw=8 requests the <b>XML feed</b>.</td></tr>\n"
"\n"
"<tr><td>raw=2</b></td>"
"<td>Just display a list of docids between <pre> tags. Will display one extra docid than requested if possible, so you know if you have more docids available or not. Does not have to generate summaries so it is a bit faster, especially if you do not perform <a href=\"#siteclustering\">site clustering</a> or <a href=\"#dupremoval\">dup removal</a>.</td></tr>"
""
"<tr><td bgcolor=#eeeeee>qh=X</b></td>"
"<td bgcolor=#eeeeee>X can be 0 or 1 to respectively disable or enable <b>highlighting</b> of query terms in the titles and summaries. Default is 1, but 0 if the <i>raw</i> parameter is used.</td></tr>"
"\n"
"<tr><td>usecache=X</b></td>\n"
"<td>X can be 0 or 1 to respectively disable or enable <b>caching</b> of the search results pages. Default is 1.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>rcache=X</b></td>"
"<td bgcolor=#eeeeee>X can be 0 or 1 to respectively disable or enable reading from the search results page cache. Default is 1.</td></tr>"
"\n"
"<tr><td>wcache=X</b></td>\n"
"<td>X can be 0 or 1 to respectively disable or enable writing to the search results page cache. Default is 1.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>bq=X</b></td>"
"<td bgcolor=#eeeeee>X can be 0 or 1 or 2. 0 means the query is NOT boolean, 1 means the query is boolean and 2 means to auto-detect. Default is 2.</td></tr>"
"\n"
"<a name=rt></a>\n"
"<tr><td>rt=X</b></td>\n"
"<td>X can be 0 or 1 to respectively disable or enable <b>real time searches</b>. If enabled, query response time will suffer because Gigablast will have to read from multiple files, usually 3 or 4, of varying ages, to satisfy a query. Default value of rt is 1, but 0 if the <i>raw</i> paramter is used.</td></tr>\n"
"\n"
"<a name=dt></a>\n"
"<tr><td bgcolor=#eeeeee>dt=X</b></td>"
"<td bgcolor=#eeeeee>X is a space-separated string of <b>meta tag names</b>. Do not forget to url-encode the spaces to +'s or %%20's. Gigablast will extract the contents of these specified meta tags out of the pages listed in the search results and display that content after each summary. i.e. <i>&dt=description</i> will display the meta description of each search result. <i>&dt=description:32+keywords:64</i> will display the meta description and meta keywords of each search result and limit the fields to 32 and 64 characters respectively. When receiving the XML feed from gigablast, the <i><display name=\"meta_tag_name\">meta_tag_content</display></i> XML tag will be used to convey each requested meta tag's content.</td></tr>\n"
"\n"
"<a name=spell></a>\n"
"<tr><td>spell=X</b></td>\n"
"<td>X can be 0 or 1 to respectively disable or enable <b>spell checking</b>. If enabled while using the XML feed, when Gigablast finds a spelling recommendation it will be included in the XML <spell> tag. Default is 0 if using an XML feed, 1 otherwise.</td></tr>\n"
"<b>NUM</b> is how many <b>related topics</b> you want returned. \n"
"<br><br>\n"
"<b>MAX</b> is the maximum number of topics to generate and store in cache, so if TW is increased, but still below MT, it will result in a fast cache hit.\n"
"<br><br>\n"
"<b>SCAN</b> is how many documents to scan for related topics. If this is 30, for example, then Gigablast will scan the first 30 search results for related topics.\n"
"<br><br>\n"
"<b>MIN</b> is the minimum score of returned topics. Ranges from 0%% to over 100%%. 50%% is considered pretty good. BUG: This must be at least 1 to get any topics back.\n"
"<br><br>\n"
"<b>MAXW</b> is the maximum number of words per topic.\n"
"<br><br>\n"
"<b>META</b> is the meta tag name to which Gigablast will restrict the content used to generate the topics. Do not specify thie field to restrict the content to the body of each document, that is the default.\n"
"<br><br>\n"
"\n"
"<b>DEL</b> is a single character delimeter which defines the topic candidates. All candidates must be separated from the other candidates with the delimeter. So <meta name=test content=\" cat dog ; pig rabbit horse\"> when using the ; as a delimeter would only have two topic candidates: \"cat dog\" and \"pig rabbit horse\". If no delimeter is provided, default funcationality is assumed.\n"
"<br><br>\n"
""
"<b>IDF</b> is 1, the default, if you want Gigablast to weight topic candidates by their idf, 0 otherwise."
"<br><br>\n"
""
"<b>DEDUP</b> is 1, the default, if the topics should be deduped. This involves removing topics that are substrings or superstrings of other higher-scoring topics."
"<br><br>\n"
""
""
"Example: topics=49+100+30+1+6+author+%%3B+0+0"
"<br><br>\n"
"The default values for those parameters with unspecifed defaults can be defined on the \"Search Controls\" page. "
"<br><br>\n"
""
"XML feeds will contain the generated topics like: <topic><name><![CDATA[some topic]]></name><score>13</score><from>metaTagName</from></topic>"
"<br><br>\n"
"Even though somewhat nonstandard, you can specify multiple <i>&topic=</i> parameters to get back multiple topic groups."
"<br><br>\n"
"Performance will decrease if you increase the MAX, SCAN or MAXW."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td align=top>rdc=X</b></td>\n"
"<td>\n"
"<a name=rdc></a>\n"
"X is 1 if you want Gigablast to return the number of documents that "
"contained each topic."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td align=top bgcolor=#eeeeee>rd=X</b></td>\n"
"<td bgcolor=#eeeeee>\n"
"<a name=rd></a>\n"
"X is 1 if you want Gigablast to return the list of docIds that "
"contained each topic."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td align=top>rp=X</b></td>\n"
"<td>\n"
"<a name=rd></a>\n"
"X is 1 if you want Gigablast to return the popularity of each topic."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td align=top bgcolor=#eeeeee>mdc=X</b></td>\n"
"<td bgcolor=#eeeeee>\n"
"<a name=rd></a>\n"
"Gigablast will not display topics that are not contained in at least X "
"documents. The default is configurable in the Search Controls page on a per "
"collection basis."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td align=top>t0=X</b></td>\n"
"<td>\n"
"<a name=exact></a>\n"
"Gigablast will use at least X docids from each termlist. Used to get more accurate hit counts."
"<br><br>\n"
"For performance reasons, most large search engines nowadays only return a rough estimate of the number of search results, but you may desire to get a better approximation or even an exact count. Gigablast allows you to do this, but it may be at the expense of query resonse time."
"By using the <b>t0</b> variable you can tell Gigablast to use a minimum number of docids from each termlist. Typically, <b>t0</b> defaults to something of around 10,000 docids. Often more docids than that are used, but this is just the minimum. So if Gigablast is forced to use more docids it will take longer to compute the search results on average, but it will give you a more precise hit count. By setting <b>t0</b> to the truncation limit or higher you will max out the hit count precision."
"It is often undesirable to have many results listed from the same site. Site Clustering will essentially limit the number returned results from any given site to two, but it will provide a link which says \"more results from this site\" in case the searcher wishes it.\n"
"When dup results removal is enabled Gigablast will remove results that have the same content as other results. Right now the comparison is very strict, but will be somewhat relaxed in the future.\n"
""
"<br><br>\n"
"<b><a name=dupremoval>Cached Web Page Parameters</a></b> "
"<br><br>\n"
"To get a cached web page from Gigablast use a url like: <b>http://www.gigablast.com/get?d=12345&ih=1&q=my+query</b> where: <br>\n"
""
"<br>\n"
"<table cellpadding=4>\n"
""
"<tr><td bgcolor=#eeeeee>d=X</b></td>\n"
"<td bgcolor=#eeeeee>X is the docId of the page you want returned. DocIds are 64-bit, so you'll need 8 bytes to hold one.</td></tr>\n"
""
"<tr><td>ih=X</b></td>\n"
"<td>X is 1 to include the Gigablast header in the returned page, and 0 to exclude it.</td></tr>\n"
""
"<tr><td bgcolor=#eeeeee>ibh=X</b></td>\n"
"<td bgcolor=#eeeeee>X is 1 to include the Gigablast BASE HREF tag in the cached page. The default is 1.</td></tr>\n"
""
"<tr><td>q=X</b></td>\n"
"<td>X is the the query that, when present, will cause Gigablast to highlight the query terms on the returned page.</td></tr>\n"
""
"<tr><td bgcolor=#eeeeee>cas=X</b></td>\n"
"<td bgcolor=#eeeeee>"
"X can be 0 or 1 to respectively disable or enable click and scroll. Default is 1.</td></tr>\n"
""
"<tr><td>strip=X</b></td>\n"
"<td>"
"X can be 0, 1 or 2. If X is 0 then no stripping is performed. If X is 1 then image and other tags are removed. An X of 2 is another form of removing tags. Default is 0.</td></tr>\n"
"<tr><td><center><b><font color=#ffffff size=+1>The XML Feed\n"
"</td></tr></table>\n"
"<br><br>\n"
"Gigablast allows you to receive the search results in a number of formats useful for interfacing to your program. By specify a \"raw=8\" as a cgi parameter you can receive the results in XML. Here is an <b><a href=/example.xml>example</a></b> of the raw=8 feed.\n"
"Additionally, raw=9 may be used to obtain the feed encoded in UTF-8.\n"
"<br><br>\n"
"The XML reply has the following format (but without the comments):\n"
"<br><br>\n"
"<pre>\n"
"# The XML reply uses the Latin-1 Character Set (ISO 8859-1) when using raw=8\n"
"Gigablast allows you to pass in weights for each term in the provided query. The query term weight operator, which is directly inserted into the query, takes the form: <b>[XY]</b>, where <i>X</i> is the weight you want to apply and <i>Y</i> is <b><i>a</i></b> if you want to make it an absolute weight or <b><i>r</i></b> for a relative weight. Absolute weights cancel any weights that Gigablast may place on the query term, like weights due to the term's popularity, for instance. The relative weight, on the other hand, is multiplied by any weight Gigablast may have already assigned."
"<br><br>\n"
"The query term weight operator will affect all query terms that follow it. To turn off the effects of the operator just use the blank operator, <b>[]</b>. Any weight operators you apply override any previous weight operators."
"<br><br>\n"
"The weight applied to a phrase is unaffected by the weights applied to its constituent terms. In order to weight a phrase you must use the <b>[XYp]</b> operator. To turn off the affects of a phrase weight operator, use the phrase blank operator, <b>[p]</b>."
"<br><br>\n"
"Applying a relative weight of 0 to a query term, like <b>[0r]</b>, has the effect of still requiring the term in the search results (if it was not ignored), but not allowing it to contribute to the ranking of the search results. However, when doing a default OR search, if a document contains two such terms, it will rank above a document that only contains one such term. "
"<br><br>\n"
"Applying an absolute weight of 0 to a query term, like <b>[0a]</b>, causes it to be completely ignored and not used for generating the search results at all. But such ignored or devalued query terms may still be considered in a phrase context. To affect the phrases in a similar manner, use the phrase operators, <b>[0rp]</b> and <b>[0ap]</b>."
"<br><br>\n"
"Example queries:"
"<br><br>\n"
"<b>[10r]happy [5rp][13r]day []lucky</b><br>\n"
"<i>happy</i> is weighted 10 times it's normal weight.<br>\n"
"<i>day</i> is weighted 13 times it's normal weight.<br>\n"
"<i>\"day lucky\"</i>, the phrase, is weighted 5 times it's normal weight.<br>\n"
"<i>lucky</i> is given it's normal weight assigned by Gigablast."
"<br><br>\n"
"Also, keep in mind not to use these weighting operators between another query operator, like '+', and its affecting query term. If you do, the '+' or '-' operator will not work."
"Create one directory for every Gigablast process you would like to run. Each Gigablast process is also called a <i>host</i>. "
"<br><br>\n"
""
"<b>2.</b>\n"
"Populate each directory with the following files and subdirectories:"
"<br><br>\n"
"<dd>\n"
"<table cellpadding=3>\n"
"<tr><td><b>gb</b></td><td>The Gigablast executable. Contains the web server, the database and the spider. This file is required to run gb.</td></tr>\n"
"<tr><td><b><a href=#hosts>hosts.conf</a></b></td><td>This file describes each host (gb process) in the Gigablast network. Every gb process uses the same hosts.conf file. This file is required to run gb."
""
"<tr><td><b><a href=#config>gb.conf</a></b></td><td>Each gb process is called a <i>host</i> and each gb process has its own gb.conf file. This file is required to run gb."
"<tr><td><b>coll.XXX.YYY/</b></td><td>For every collection there is a subdirectory of this form, where XXX is the name of the collection and YYY is the collection's unique id. Contained in each of these subdirectories is the data associated with that collection.</td></tr>"
"<tr><td><b>coll.XXX.YYY/coll.conf</b></td><td>Each collection contains a configuration file called coll.conf. This file allows you to configure collection specific parameters. Every parameter in this file is also controllable via your the administrative web pages as well.</td></tr>"
"<tr><td><b>trash/</b></td><td>Deleted collections are moved into this subdirectory. A timestamp in milliseconds since the epoch is appended to the name of the deleted collection's subdirectory after it is moved into the trash sub directory. Gigablast doesn't physically delete collections in case it was a mistake.</td></tr>"
"<tr><td><b><a href=#ruleset>tagdbN.xml</a></b></td><td>Several files where N is an integer. The files must be contiguous, starting with an N of 0. Each one of these files is a <a href=#ruleset>ruleset</a> file. This file is required for indexing and deleting documents.</tr>\n"
"<tr><td><b>html/</b></td><td>A subdirectory that holds all the html files and images used by Gigablast. Includes Logos and help files.</tr>\n"
"<tr><td><b>dict/</b></td><td>A subdirectory that holds files used by the spell checker and the GigaBits generator. Each file in dict/ holds all the words and phrases starting with a particular letter. The words and phrases in each file are sorted by a popularity score.</tr>\n"
"<tr><td><b>antiword</b></td><td>Executable called by gbfilter to convert Microsoft Word files to html for indexing.</tr>\n"
"<tr><td><b>.antiword/</b></td><td>A subdirectory that contains information needed by antiword.</tr>\n"
"<tr><td><b>pdftohtml</b></td><td>Executable called by gbfilter to convert PDF files to html for indexing.</tr>\n"
"<tr><td><b>pstotext</b></td><td>Executable called by gbfilter to convert PostScript files to text for indexing.</tr>\n"
"<tr><td><b>ppthtml</b></td><td>Executable called by gbfilter to convert PowerPoint files to html for indexing.</tr>\n"
"<tr><td><b>xlhtml</b></td><td>Executable called by gbfilter to convert Microsoft Excel files to html for indexing.</tr>\n"
"<tr><td><b>gbfilter</b></td><td>Simple executable called by Gigablast with document HTTP MIME header and document content as input. Output is an HTTP MIME and html or text that can be indexed by Gigablast.</tr>\n"
"<tr><td><b><a href=#gbstart>gbstart</a></b></td><td>An optional simple script used to start up the gb process(es) on each computer in the network. Otherwise, iff you have passwordless ssh capability then you can just use './gb start' and it will spawn an ssh command to start up a gb process for each host listed in hosts.conf.</tr>\n"
"</table>\n"
"<br><br>\n"
""
"<b>2.</b> "
"Edit or create the <a href=#hosts>hosts.conf</a> file."
"<br><br>\n"
""
"<b>3.</b> "
"Edit or create the <a href=#config>gb.conf</a> file."
"<br><br>\n"
/*""
"<b>4.</b> "
"Edit or create the <a href=#gbstart>gbstart</a> shell script on each participating computer so it will run all the required gb processes on that computer."
"<br><br>\n"
""
"<b>5.</b> "
"Execute the <a href=#gbstart>gbstart</a> shell script on each participating computer."
"<br><br>\n"
*/
""
"<b>4.</b> "
"Direct your browser to any host listed in the <a href=#hosts>hosts.conf</a> file to begin administration."
"For the purposes of this section, we assume the name of the cluster "
"is gf and all hosts in the cluster "
"are named gf*. The Master host of the cluster is gf0. The "
"gigablast working directory is assumed to be /a/ and the /etc/dsh/machines.list file contains only the machine names in the cluster."
"<br><br>"
"<b>To perform operations on all machines on the network:</b>"
"<ul>"
"<li>To setup dsh:"
"<ul>"
"<li> Install the dsh package, on debian it would be:<br>"
" <b> $ apt-get install dsh</b><br>"
"<li> Add the names of all of the machines in the cluster to /etc/dsh/machines.list (newline separated, but does not end in a new line)<br>"
"</ul>"
"<b>To setup dsh on a machine on which we do not have root:</b>"
"<ul>\n"
"<li>cd to the working directory\n"
"<li>Copy /usr/lib/libdshconfig.so.1.0.0 to the working directory.\n"
"<li><b>export LD_PATH=.\n</b>"
"<li>run <b>dsh -r rcp -f filename hostname</b> as a test. Use scp if rcp not available. filename is the file that contains the hostnames to dsh to.\n"
"</ul>\n"
"<li>To use the dsh command."
"<ul>"
"<li>to copy a master configuration file to all hosts:<br>\n"
" <b>$ dsh -a 'scp gf0:/a/coll.conf /a/coll.conf'</b><br>\n"
"<li>to check running processes on all machines concurrently (-c option):<br>\n"
" <b>$ dsh -ac 'ps auxww'</b><br>\n"
"</ul>"
"</ul>"
"<b>To prepare a new cluster or erase an old cluster:</b>"
"<ul>\n"
"<li>Save <b>/a/gb.conf</b>, <b>/a/hosts.conf</b>, and <b>/a/coll.*.*/coll.conf</b> files somewhere besides on /dev/md0 if they exist and you want to keep them.\n"
"<li>cd to a directory not on /dev/md0\n"
"<li>Login as root using <b>su</b>\n"
"<li>Use <b>dsh -ac 'umount /dev/md0'</b> to unmount the working directory. All login shells must exit or cd to a different directory, and all processes with files opened in /dev/md0 must exit for the unmount to work.\n"
"<li>Use <b>dsh -ac 'umount /dev/md0'</b> to unmount the working directory.\n"
"<li>Use <b>dsh -ac 'mke2fs -b4096 -m0 -N20000 -R stride=32 /dev/md0'</b> to revuild the filesystem on the raid. CAUTION!!! WARNING!! THIS COMPLETELY ERASES ALL DATA ON /dev/md0\n"
"<li>Use <b>dsh -ac 'mount /dev/md0'</b> to remount it.\n"
"<li>Use <b>dsh -ac 'mkdir /mnt/raid/a ; chown mwells:mwells /mnt/raid/a</b> to create the 'a' directory and let user mwells, or other search engine administrator username, own it.\n"
"<li>Recopy over the necessary gb files to every machine.\n"
"<li>\n"
"</ul>\n"
"<br>\n"
"<b>To test a new gigablast executable:</b>"
"<ul>\n"
"<li>Change to the gigablast working directory.<br>"
" <b>$ cd /a</b>"
"<li>Run gb stop on the running gb executable on gf0.<br>"
" <b>$ gb stop</b>"
"<li>Wait until all hosts have stopped and saved their data. "
"(the following line should not print anything)<br>"
" <b>$ dsh -a 'ps auxww' | grep gb</b>"
"<li>Move the current gb executable to gb.SAVED.<br>"
" <b>$ mv gb gb.SAVED </b>"
"<li>Copy the new executable onto gf0<br>"
" <b>$ scp gb user@gf0:/a/</b>"
"<li>Install the executable on all machines.<br>"
" <b>$ gb installgb</b><br>"
"<li>This will copy the gb executable to all hosts. You"
" must wait until all of the scp processes have completed"
" before starting the gb process. Run ps to verify that all of"
" the scp processes have finished.<br>"
" <b>$ ps auxww</b>"
"<li>Run gb start<br>"
" <b>$ gb start </b>"
"<li>As soon as all of the hosts have started, you can use the "
"web interface to gigablast.<br>"
"</ul>\n"
"<b>To switch the live cluster from the current (cluster1) to another"
" (cluster2):</b>"
"<ul>\n"
"<li>Ensure that the gb.conf of cluster2 matches that of cluster1,"
" excluding any desired changes.<br>"
"<li>Ensure that the coll.conf for each collection on cluster2 matches those"
" of cluster1, excluding any desired changes.<br>"
"<li>Thoroughly test cluster2 using the blaster program.<br>"
"<li>Test duplicate queries between cluster1 and cluster2 and ensure results"
" properly match, with the exception of any known new changes.<br>"
"<li>Make sure port 80 on cluster2 is directing to the correct port for gb.<br>"
"<b>A host in the network crashed. How do I temporarily decrease query latency on the network until I get it up again?</b><br>"
"You can go to the <i>Search Controls</i> page and cut all nine tier sizes in half. This will reduce search result recall, but should cut query latency times in half for slower queries until the crashed host is recovered. "
"<br><br>"
"<b>A host in the network crashed. What is the recovery procedure?</b><br>"
"First determine if the host's crash was clean or unclean. It was clean "
"if the host was able to save all data in memory before it crashed. If the "
"log ended with <i>allExit: dumping core after saving</i> then the crash "
"was clean, otherwise it was not."
"<br><br>"
"If the crash was clean then you can simply restart the crashed host by typing"
" <b>gb start <i>i</i></b> where <i>i</i> is the hostId of the crashed host. "
"However, if the crash was not clean, like in the case of a sudden power "
"outtage, then in order to ensure no data gets lost, you must copy the data "
"of the crashed host's twin. "
"If it does not have a twin then there may be some data loss and/or "
"corruption. In that case try reading the section below, <i>How do I minimize "
"the damage after an unclean crash with no twin?</i>, but you may be better "
"off starting the index build from "
"scratch. To recover from an unclean crash using the twin, follow the steps "
"below: "
"<br><br>"
"a. Click on 'all spiders off' in the 'master controls' of host #0, or "
"host #1 if host #0 was the host that crashed.<br>"
"b. If you were injecting content directly into Gigablast, stop.<br>"
"c. Click on 'all just save' in the 'master controls' of host #0 or host #1 "
"if host #0 was the one that crashed.<br>"
"d. Determine the twin of the crashed host by looking in the "
"<a href=\"#hosts\">hosts.conf</a> file. "
"The twin will have the same group number as the crashed host.<br>"
"e. Recursively copy the working directory of the twin to a spare host "
" using rcp since it is much faster than scp.<br>"
"f. Restart the crashed host by typing <b>gb start <i>i</i></b> where "
"<i>i</i> is the hostId of the crashed host. If it is not restartable, then "
"skip this step.<br>"
"g. If the crashed host was restarted, wait for it to come back up. Monitor "
"another host's <i>hosts</i> table to see when it is up, or watch the log of "
"the crashed host.<br>"
"h. If the crashed host was restarted, wait a minute for it to absorb all "
"of the data add requests that may still be lingering. Wait for all hosts' "
"<i>spider queues</i> of urls currently being spidered to be empty of "
"urls.<br>"
"i. Perform another <i>all just save</i> command to relegate any new data "
"to disk.<br>"
"j. After the copy completes edit the hosts.conf on host #0 and replace the "
"ip address of the crashed host with that of the spare host.<br>"
"k. Do a <b>gb stop</b> to safely shut down all hosts in the network.<br>"
"l. Do a <b>gb installconf</b> to propagate the hosts.conf file from host #0 "
"to all other hosts in the network (including the spare host, but not the "
"crashed host)<br>"
"m. Do a <b>gb start</b> to bring up all hosts under the new hosts.conf file."
"<br>"
"n. Monitor all logs for a little bit by doing <i>dsh -ac 'tail -f /a/log? "
"/a/log\?\?'</i><br>"
"o. Check the <i>hosts</i> table to ensure all hosts are up and running.<br>"
"<br><br>"
"<b>How do I minimize the damage after an unclean crash with no twin?</b><br>"
"You may never be able to get the index 100%% back into shape right now, but "
"in the near future there may be some technology that allows gigablast to "
"easily recover from these situations. For now though, "
"2. Try to determine the last url that was indexed and *fully* saved to disk. "
"Every time you index a url some data is added to all of these databases: "
"checksumdb, indexdb, spiderdb, titledb and tfndb. These databases all have "
"in-memory data that is periodically dumped to disk. So you must determine "
"the last time each of these databases dumped to disk by looking at the "
"timestamp on the corresponding files in the appropriate collection "
"subdirectories contained in the working directory. If tfndb was "
"means your disk is slowing everything down. Make sure that if you are doing "
"realtime queries that you do not have too many big indexdb files. If you "
"tight merge everything it should fix that problem. Otherwise, consider "
"getting a raid level 0 and faster disks. Perhaps the filesystem is "
"severly fragmented."
"Or maybe your query traffic is repetetive. If the queries are sorted "
"alphabetically, or you have many duplicate queries, then most of the "
"workload might be falling on one particular host in the network, thus "
"bottle-necking everything."
"<br><br>"
"<b>I get different results for the XML feed "
"(raw=X) as compared to the HTML feed. "
"What is going on?</b><br>"
" Try adding the &rt=1 cgi parameter to the "
"search string to tell Gigablast to return real "
"time results."
"rt is set to 0 by default for the XML feed, but "
"not for the HTML feed. That means Gigablast will "
"only look at the root indexdb file when looking up "
"queries. Any newly added pages will be indexed "
"outside of the root file until a merge is done. "
"This is done for performance reasons. You can enable "
"real time look ups by adding &rt=1 to the search "
"string. Also, in your search controls there are "
"options to enable or disable real time lookups for "
"regular queries and XML feeds, labeled as \"restrict "
"indexdb for queries\" and \"restrict indexdb for "
"xml feed\". Make sure both regular queries and "
"xml queries are doing the same thing when comparing "
"results."
"<br>"
"<br>"
"Also, you need to look at the tier sizes at the "
"top of the Search Controls page. The tier sizes "
" (tierStage0, tierStage1, ...) listed for the "
"raw (XML feed) queries needs to match non-raw "
"in order to get exactly the same results. Smaller "
"tier sizes yield better performance but yield "
"less search results."
"<br><br>"
"<b>The spider is on but no urls are showing up in the Spider Queue table "
"as being spidered. What is wrong?</b><br>"
"<table width=100%%>"
"<tr><td>1. Set <i>log spidered urls</i> to YES on the <i>log</i> page. Then "
"check the log to see if something is being logged."
"</td></tr>"
"<tr><td>2. Check the <i>master controls</i> page for the following:<br>"
" a. the <i>spider enabled</i> switch is set to YES.<br>"
" b. the <i>spider max pages per second</i> control is set "
"high enough.<br>"
" c. the <i>spider max kbps</i> control is set high enough.</td></tr>"
"</td></tr>"
"<tr><td>3. Check the <i>spider controls</i> page for the following:<br>"
" a. the collection you wish to spider for is selected (in red).<br>"
" a. the <i>old</i> or <i>new spidering</i> is set to YES.<br>"
" b. the appropriate <i>old</i> and <i>new spider priority</i> "
"checkboxes are checked.<br>"
" c. the <i>spider start</i> and <i>end times</i> are set "
"appropriately.<br>"
" d. the <i>use current time</i> control is set correctly.<br>"
" e. the <i>spider max pages per second</i> control is set "
"high enough.<br>"
" f. the <i>spider max kbps</i> control is set high enough.</td></tr>"
"<tr><td>3. If you have urls from only a few domains then the <i>same domain "
"wait</i> or <i>same ip wait</i> controls could be limiting the spidering "
"of the urls such that you do not see any in the Spider Queue table. If the "
"indexed document count on the home page is increasing then this may be the "
"case. Even if the count is not increasing, it may still be the case if the "
"documents all have errors, like 404 not found.</td></tr>"
"<tr><td>"
"4. Make sure you have urls to spider by running 'gb dump s <collname>' "
"on the command line to dump out spiderdb. See 'gb -h' for the help menu and "
"more options.</td></tr>"
"</table>"
"<br><br>"
"<b>The spider is slow.</b><br>"
"<table width=100%%>"
"<tr><td>In the current spider queue, what are the statuses of each url? If "
"they are mostly \"getting cached web page\" and the IP address column is "
"mostly empty, then Gigablast may be bogged down looking up the cached web "
"pages of each url in the spider queue only to discover it is from a domain "
"that was just spidered. This is a wasted lookup, and it can bog things down "
"pretty quickly when you are spidering a lot of old urls from the same "
"domain. "
"Try setting <i>same domain wait</i> and <i>same ip wait</i> both to 0. This "
"will pound those domain's server, though, so be careful. Maybe set it to "
"1000ms or so instead. We plan to fix this in the future."
"</td></tr>"
"</table>"
"<br><br>"
"<b>The spider is always bottlenecking on <i>adding links</i>.</b><br>\n"
"<table width=100%%>\n"
"<tr><td>Try increasing the <tfnbMaxPageCacheMem> in the gb.conf for all hosts in the cluster to minimize the <i>disk seeks</i> into tfndb as seen on the Stats page. Stop all gb processes then use <i>./gb installconf</i> to distribute the gb.conf to all hosts in the cluster. You migh also try decreasing the size of the url filters table, every regular expression in that table is consulted for every link added and it can really block the cpu.\n"
"<tr><td><center><b><font color=#ffffff size=+1>Building an Index\n"
"</td></tr></table>\n"
"<br>\n"
"\n"
"<b>1.</b> Determine a collection name for your index. You may just want to use the default, unnamed collection. Gigablast is capable of handling many sub-indexes, known as collections. Each collection is independent of the other collections. You can add a new collection by clicking on the <b>add new collection</b> link on the <a href=\"/admin/spider\">Spider Controls</a> page."
"<br><br>\n"
"\n"
"<b>2.</b> Describe the set of URLs you would like to index to Gigablast by inputting <a href=\"http://www.phpbuilder.com/columns/dario19990616.php3\">regular expressions</a> into Gigablast's \n"
"On that page you can tell Gigablast how often to \n"
"re-index a URL in order to pick up any changes to that URL's content.\n"
"You can assign <a href=#ruleset>rulesets</a> and spider priorities to URLs as well. Furthermore, you can assign a default ruleset and spider priority for all URLs not conforming to any regular expressions you entered.\n"
"<br><br>\n"
"\n"
"<b>3.</b> Test your Regular Expressions. Once you've submitted your \n"
"regular expressions try entering some URLs in the second pink box, entitled,\n"
"<i>URL Filters Test</i> on the <a href=\"/admin/filters\">URL Filters page</a>. This will help you make sure that you've entered your regular expressions correctly.\n"
"<br><br>\n"
"\n"
"<b>4.</b> Enable \"add url\". By enabling the add url interface you will be able to tell Gigablast to index some URLs. You must make sure add url is enabled on the <a href=\"/master\">Master Controls</a> page and also on the <a href=\"/admin/spider\">Spider Controls</a> page for your collection. If it is disabled on the Master Controls page then you will not be able to add URLs for *any* collection.\n"
"<br><br>\n"
"\n"
"<b>5.</b> Submit some seed URLs. Go to the <a href=\"/addurl\">add url \n"
"page</a> for your collection and submit some URLs you'd like to put in your\n"
"index. Usually you want these URLs to have a lot of outgoing links that \n"
"point to other pages you would like to have in your index as well. Gigablast's\n"
"spiders will follow these links and index whatever web pages they point to,\n"
"then whatever pages the links on those pages point to, ad inifinitum. But you\n"
"must make sure that <b>spider links</b> is enabled on the <a href=\"/admin/spider\">Spider Controls</a> page for your collection.\n"
"<br><br>\n"
"\n"
"<b>5.a.</b> Check the spiders. You can go to the <b>Spider Queue</b> page to "
"see what urls are currently being spidered from all collections, as well as see what urls exist in various priority queues, and what urls are cached from various priority queues. If you urls are not being spidered check to see if they are in the various spider queues. Urls added via the add url interface usually go to priority queue 5 by default, but that may have been changed on the Spider Controls page to another priority queue. And it may have been added to any of the hosts' priority queue on the network, so you may have to check each one to find it."
"<br><br>\n"
"If you do not see it on any hosts you can do an <b>all just save</b> in the Master Controls on host #0 and then dump spiderdb using gb's command line dumping function, <b>gb dump s 0 -1 1 -1 5</b> (see gb -h for help) on every host in the cluster and grep out the url you added to see if you can find it in spiderdb."
"<br><br>"
"Then make sure that your spider start and end time on the Spider Controls encompas, and old or new spidering is enabled, and spidering is enabled for that priority queue. If all these check out the url should be spidered asap."
"<br><br>\n"
"<b>6.</b> Regulate the Spiders. Given enough hardware, Gigablast can index \n"
"millions of pages PER HOUR. If you don't want Gigablast to thrash your or\n"
"someone else's website\n"
"then you should adjust the time Gigablast waits between page requests to the\n"
"same web server. To do this go to the \n"
"<a href=\"/admin/spider\">Spider Controls</a> page for your collection and set\n"
"the <b>same domain wait</b> and <b>same ip wait</b> values to how long you want Gigablast to wait in between page requests to the same domain or the same IP address respectively. This value is in milliseconds (ms). There are 1000"
"Gigabot/1.0 is used for the User-Agent field of all HTTP mime headers that Gigablast transmits. \n"
"Gigabot respects the <a href=/spider.html>robots.txt convention</a> (robot exclusion) as well as supporting the meta noindex, noarchive and nofollow meta tags. You can tell Gigabot to ignore robots.txt files on the <a href=\"/admin/spider\">Spider Controls</a> page.\n"
"<li>Using the <a href=\"/admin/tagdb\">tagdb interface</a>, you can assign a <a href=#ruleset>ruleset</a> to a set of sites. All you do is provide Gigablast with a list of sites and the ruleset to use for those sites.\n"
"You can enter the sites via the <a href=\"/admin/tagdb\">HTML form</a> or you can provide Gigablast with a file of the sites. Each file must be limited to 1 Megabyte, but you can add hundreds of millions of sites. \n"
"Sites can be full URLs, hostnames, domain names or IP addresses.\n"
"If you add a site which is just a canonical domain name with no explicit host name, like gigablast.com, then any URL with the same domain name, regardless of its host name will match that site. That is, \"hostname.gigablast.com\" will match the site \"gigablast.com\" and therefore be assigned the associated ruleset.\n"
"Sites may also use IP addresses instead of domain names. If the least significant byte of an IP address that you submit to tagdb is 0 then any URL with the same top 3 IP bytes as that IP will be considered a match.\n"
"<li>You can specify a regular expression to describe a set of URLs using the interface on the <a href=\"/admin/filters\"></a>URL filters</a> page. You can then assign a <a href=#ruleset>ruleset</a> that describes how to spider those URLs and how to index their content. Currently, you can also explicitly assign a spider frequency and spider queue to matching URLs. If these are specified they will override any values in the ruleset."
"</ul>\n"
"If the URL being spidered matches a site in tagdb then Gigablast will use the corresponding ruleset from that and will not bother searching the regular expressions on the <a href=\"/admin/filters\"></a>URL filters</a> page.\n"
"Gigablast uses spider queues to hold and partition URLs. Each spider queue has an associated priority which ranges from 0 to 7. Furthermore, each queue is either denoted as <i>old</i> or <i>new</i>. Old spider queues hold URLs whose content is currently in the index. New spider queues hold URLs whose content is not in the index. The priority of a URL is the same as the priority of the spider queue to which it belongs. You can explicitly assign the priority of a URL by specifying it in a <a href=#ruleset>ruleset</a> to which that URL has been assigned or by assigning it on the <a href=\"/admin/filters\"></a>URL filters</a> page.\n"
"On the <a href=\"/admin/spider\">Spider Controls</a> page you can toggle the spidering of individual spider queues as well as link harvesting. More control on a per queue basis will be available soon, perhaps including the ability to assign a ruleset to a spider queue.\n"
"<br><br>\n"
"The general idea behind spider queues is that it allows Gigablast to prioritize its spidering. If two URLs are overdue to be spidered, Gigabot will download the one in the spider queue with the highest priority before downloading the other. If the two URLs have the same spider priority then Gigabot will prefer the one in the new spider queue. If they are both in the new queue or both in the old queue, then Gigabot will spider them based on their scheduled spider time.\n"
"<br><br>\n"
"Another aspect of the spider queues is that they allow Gigabot to perform depth-first spidering. When no priority is explicitly given for a URL then Gigabot will assign the URL the priority of the \"linker from which it was found\" minus one.\n"
"<br><br>\n"
"<b>Custom Filters</b>\n"
"<br><br>\n"
"You can write your own filters and hook them into Gigablast. A filter is an executable that takes an HTTP reply as input through stdin and makes adjustments to that input before passing it back out through stdout. The HTTP reply is essentially the reply Gigabot received from a web server when requesting a URL. The HTTP reply consists of an HTTP MIME header followed by the content for the URL.\n"
"\n"
"<br><br>\n"
"Gigablast also appends <b>Last-Indexed-Date</b>, <b>Collection</b>, <b>Url</b> and <b>DocId</b> fields to the MIME in order to supply your filter with more information. The Last-Indexed-Date is the time that Gigablast last indexed that URL. It is -1 if the URL's content is currently not in the index.\n"
"<br><br>\n"
"You can specify the name of your filter (an executable program) on the <a href=\"/admin/spider\">Spider Controls</a> page. After Gigabot downloads a web page it will write the HTTP reply into a temporary file stored in the /tmp directory. Then it will pass the filename as the first argument to the first filter by calling the system() function. popen() was used previously but was found to be buggy under Linux 2.4.17. Your program should send the filtered reply back out through stdout.\n"
"<br><br>\n"
"You can use multiple filters by using the pipe operator and entering a filter like \"./filter1 | ./filter2 | ./filter3\". In this case, only \"filter1\" would receive the temporary filename as its argument, the others would read from stdin.\n"
"<br><br>\n"
"<a name=quotas></>\n"
"<b>Document Quotas</b>\n"
"<br><br>\n"
"You can limit the number of documents on a per site basis. By default "
"the site is defined to be the full hostname of a url, like, "
"<i>www.ibm.com</i>. However, using tagdb you can define the site as a "
"domain or even a subfolder within the url. By adjusting the "
"<maxDocs> "
"parameter in the <a href=#ruleset>ruleset</a> for a particular url you "
"can control how many documents are allowed into the index from that site. "
"Additionally, the quotaBoost tables in the same ruleset file allow you to "
"influence how a quota is changed based on the quality of the url being "
"indexed and the quality of its root page. Furthermore, the Spider Controls "
"allow you to turn quota checking on and off for old and new documents. "
"<br><br>"
"The quota checking routine quickly obtains a decent approximation of how "
"many documents a particular site has in the index, but this approximation "
"becomes "
"higher than the actual count as the number of big indexdb files increases, "
"so you may want to keep <indexdbMinFilesToMerge> in "
"<a href=#config>gb.conf</a> "
"down to a value of around "
"five or so to ensure a half way decent approximation. Typically you can "
"excpect to be off by about 1000 to 2000 documents for every indexdb file "
"Gigablast allows you to inject documents directly into the index by using the command <b>gb [-c <<a href=#hosts>hosts.conf</a>>] <hostId> --inject <file></b> where <file> must be a sequence of HTTP requests as described below. They will be sent to the host with id <hostId>."
"<br><br>\n"
"You can also inject your own content a second way, by using the <a href=\"/admin/inject\">Inject URL</a> page. "
"<br><br>\n"
"Thirdly you can use your own program to feed the content directly to Gigablast using the same form parameters as the form on the Inject URL page."
"<br><br>\n"
"In any of the three cases, be sure that url injection is enabled on the <a href=/master>Master Controls</a> page."
""
"<br><br><br>\n"
"<b>Input Parameters</b>\n"
"<br><br>\n"
"When sending an injection HTTP request to a Gigablast server, you may optionally supply an HTTP MIME in addition to the content. This MIME is treated as if Gigablast's spider downloaded the page you are injecting and received that MIME. If you do supply this MIME you must make sure it is HTTP compliant, preceeds the actual content and ends with a \"\r\n\r\n\" followed by the content itself. The smallest mime header you can get away with is \"HTTP 200\r\n\r\n\" which is just an \"OK\" reply from an HTTP server."
""
"<br><br>\n"
"The cgi parameters accepted by the /inject URL for injecting content are the following: (<b>remember to map spaces to +'s, etc.</b>)"
"<br><br>\n"
"\n"
"<table cellpadding=4>\n"
"\n"
"<tr><td bgcolor=#eeeeee>u=X</b></td>\n"
"<td bgcolor=#eeeeee>X is the url you are injecting. This is required.</td></tr>\n"
"\n"
"<tr><td>c=X</b></td>\n"
"<td>X is the name of the collection into which you are injecting the content. This is required.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>delete=X</b></td>\n"
"<td bgcolor=#eeeeee>X is 0 to add the URL/content and 1 to delete the URL/content from the index. Default is 0.</td></tr>\n"
"\n"
"<tr><td>ip=X</b></td>\n"
"<td>X is the ip of the URL (i.e. 1.2.3.4). If this is ommitted or invalid then Gigablast will lookup the IP, provided <i>iplookups</i> is true. But if <i>iplookups</i> is false, Gigablast will use the default IP of 1.2.3.4.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>iplookups=X</b></td>\n"
"<td bgcolor=#eeeeee>If X is 1 and the ip of the URL is not valid or provided then Gigablast will look it up. If X is 0 Gigablast will never look up the IP of the URL. Default is 1.</td></tr>\n"
"\n"
"<!--<tr><td>isnew=X</b></td>\n"
"<td>If X is 0 then the URL is presumed to already be in the index. If X is 1 then URL is presumed to not be in the index. Omitting this parameter is ok for now. In the future it may be put to use to help save disk seeks. Default is 1.</td></tr>-->\n"
"\n"
"<tr><td>dedup=X</b></td>\n"
"<td>If X is 1 then Gigablast will not add the URL if another already exists in the index from the same domain with the same content. If X is 0 then Gigablast will not do any deduping. Default is 1.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>rs=X</b></td>\n"
"<td bgcolor=#eeeeee>X is the number of the <a href=#ruleset>ruleset</a> to use to index the URL and its content. It will be auto-determined if <i>rs</i> is omitted or <i>rs</i> is -1.</td></tr>\n"
"\n"
"<tr><td>quick=X</b></td>\n"
"<td>If X is 1 then the reply returned after the content is injected is the reply described directly below this table. If X is 0 then the reply will be the HTML form interface.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>hasmime=X</b></td>\n"
"<td bgcolor=#eeeeee>X is 1 if the provided content includes a valid HTTP MIME header, 0 otherwise. Default is 0.</td></tr>\n"
"\n"
"<tr><td>content=X</b></td>\n"
"<td>X is the content for the provided URL. If <i>hasmime</i> is true then the first part of the content is really an HTTP mime header, followed by \"\r\n\r\n\", and then the actual content.</td></tr>\n"
"\n"
"<tr><td bgcolor=#eeeeee>ucontent=X</b></td>\n"
"<td bgcolor=#eeeeee>X is the UNencoded content for the provided URL. Use this one <b>instead</b> of the <i>content</i> cgi parameter if you do not want to encode the content. This breaks the HTTP protocol standard, but is convenient because the caller does not have to convert special characters in the document to their corresponding HTTP code sequences. <b>IMPORTANT</b>: this cgi parameter must be the last one in the list.</td></tr>\n"
"\n"
"</table>\n"
"\n"
"<br><br>\n"
"<b>Sample Injection Request</b> (line breaks are exclusively specified by \r\n sequences):<br>\n"
"<html>This is the unencoded content of the page we are injecting.</html>\n"
"</pre>\n"
"<br>\n"
"<b>The Reply</b>\n"
"<br><br>\n"
"<a name=ireply></a>"
"The reply is always a typical HTTP reply, but if you defined <i>quick=1</i> then the *content* (the stuff below the returned MIME) of the HTTP reply to the injection request is of the format:"
"Where <X> is a string of digits in ASCII, corresponding to the error code. X is 0 on success (no error) in which case it will be followed by a <b>int64_t</b> docId and a hostId, which corresponds to the host in the <a href=#hosts>hosts.conf</a> file that stored the document. Any twins in its group will also have copies. If there was an error then X will be greater than 0 and may be followed by a space then the error message itself. If you did not define <i>quick=1</i>, then you will get back a response meant to be viewed on a browser."
"You can delete documents from the index two ways:"
"<ul>\n"
"<li>Perhaps the most popular is to use the <a href=\"/admin/reindex\">Reindex URLs</a> tool which allows you to delete all documents that match a simple query. Furthermore, that tool allows you to assign rulesets to all the domains of all the matching documents. All documents that match the query will have their docids stored in a spider queue of a user-specified priority. The spider will have to be enabled for that priority queue for the deletion to take place. Deleting documents is very similar to adding documents."
"<br><br>\n"
"<li>To delete a single document you can use the <a href=\"/admin/inject\">Inject URL</a> page. Make sure that url injection is enabled on the <a href=/master>Master Controls</a> page."
"<tr><td><center><b><font color=#ffffff size=+1>Scoring A Document</font></a>\n"
"</td></tr></table>\n"
"<br><br>\n"
"Gigablast scores word and phrases in a document based on the following criteria:"
"<br>\n"
"<ul>\n"
"<li>the <b>quality</b> of the document"
"<li>the <b>count</b> of the word or phrase in the document"
"<li>the <b>locations</b> of the word or phrase in the document"
"<li>the <b>spaminess</b> of the word or phrase in the document"
"<li>the <b>length</b> of the part of the document being indexed"
"</ul>\n"
"<br>\n"
"By assigning a <a href=\"#ruleset\">ruleset</a> to a document, you can control exactly how Gigablast uses these criteria to generate the score of a word or phrase in that document."
"<br><br>\n"
"<br>\n"
"<b>Index Rules</b></center>\n"
"<br>\n"
"When Gigablast indexes a document it first chains down the <a href=\"#indexblock\"><index> tags</a> that are listed in the ruleset in the order they are presented. Each of these <index> tags, and all that it contains, represents one <b><font color=red>index rule</font></b>. Each index rule describes how to index a portion of the document. Different portions of the document may be indexed and scored in different ways. The order of these index rules can be very important, since the same word's score accumulates from one index rule to the next, and different index rules may have different score ceilings for that word."
""
"<br><br>\n"
"In addition to describing the various sub tags of an index rule in the <a href=\"#indexingsection\">sample ruleset file</a>, they are further described in the following table:"
"<td bgcolor=#eeeeee>This tag tells Gigablast what part of the document to index. You can have multiple <name> tags in the same index rule. <i>X</i> can be one of the following values:<br><br>\n"
""
"<table cellpadding=1>\n"
"<tr><td><i>ABSENT</i></td><td>If the <name> tag is not present Gigablast will restrict itself to all words and phrases in the document that are not in between a < and > of any tag in the document.</td></tr>\n"
"<tr><td>title</td><td>Tells Gigablast to index the words and phrases in between the first pair of <title> and </title> tags present in the document, if one exists.</td></tr>\n"
"<tr><td>meta.keywords</td><td>Tells Gigablast to index the words and phrases contained in the content field of the <meta name=keywords content=X> tag.</td></tr>"
"<tr><td>meta.keyword</td><td> See above</td></tr>"
"<tr><td>meta.summary</td><td> See above</td></tr>"
"<tr><td>meta.description</td><td> See above</td></tr>"
"<tr><td>meta.foobar</td><td>COMING SOON (user-defined meta tags)</td></tr>"
"<tr><td>foobar</td><td>You can add your own tag, like <i><foobar>index this</></i> to the document and then have an index rule with a <name>foobar</> that contains rules on how to index it.</td></tr>\n"
"</table>\n"
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"<prefix>X</>"
"</td>\n"
"<td>\n"
"If present, Gigablast will index the words and phrases with the specified prefix, <i>X</i>. Fielded searches can then be performed. Example: <prefix>title</>"
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<maxQualityForSpamDetect>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"Spam detection will be performed on the words and phrases if the document's quality is <i>X</i> or lower. Spam detection generally lowers the scores of repeated words and phrases based on the degree of repetition."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"<minQualityToIndex>X</>"
"</td>\n"
"<td>\n"
"If the document's quality is below <i>X</i>, then do not index the words and phrases for this index rule."
"</td>\n"
"</tr>\n"
""
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<filterHtmlEntities>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"If <i>X</i> is <i>yes</i> then convert HTML entities, like &gt;, into their represented characters before indexing."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"<indexIfUniqueOnly>X</>"
"</td>\n"
"<td>\n"
"If <i>X</i> is <i>yes</i> then each word or phrase will only be indexed if not already indexed by a previous index rule in the ruleset, and only the first occurence of the word or phrase will be indexed, subsequent occurences will not count towards the score."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<indexSingletons>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"If <i>X</i> is <i>yes</i> then index the words, otherwise do not."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"<indexPhrases>X</>"
"</td>\n"
"<td>\n"
"If <i>X</i> is <i>yes</i> then index the phrases, otherwise do not."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<indexAsWhole>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"If <i>X</i> is <i>yes</i> then index the whole sequence of indexable words as a checksum."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"<useStopWords>X</>"
"</td>\n"
"<td>\n"
"If <i>X</i> is <i>yes</i> then use <a href=\"#stopwords\">stop words</a> when forming phrases."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<useStems>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"If <i>X</i> is <i>yes</i> then index stems. Currently unsupported."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
""
"<quality11> X1 </><br>\n"
"<quality12> X2 </>...<br>\n"
"<quality1N> XN </><br>\n"
"<maxLen11> Y1 </><br>\n"
"<maxLen12> Y2 </>...<br>\n"
"<maxLen1N> YN </><br>\n"
""
"</td>\n"
"<td>\n"
"This maps the quality of the document to a maximum number of CHARACTERS to index. <a name=\"mapdesc\"></a><b>The (Xn,Yn) points form a piecewise function which is linearly interpolated between points. The edges are horizontal, meaning, if X is 0 Y will be Y1, or if X is infinite, Y will be YN.</b>\n"
"</td>\n"
"</tr>\n"
""
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<a name=\"v1\"></a>\n"
"<quality21> X1 </><br>\n"
"<quality22> X2 </>...<br>\n"
"<quality2N> XN </><br>\n"
"<maxScore21> Y1 </><br>\n"
"<maxScore22> Y2 </>...<br>\n"
"<maxScore2N> YN </><br>\n"
""
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"This maps the quality of the document to a percentage of the absolute max score a word or phrase can have. This is the QUALITY_WEIGHT_MAX value in the <a href=\"#formula\">formula</a>."
"</td>\n"
"</tr>\n"
""
""
"<tr>\n"
"<td>\n"
"<a name=\"v2\"></a>\n"
"<quality31> X1 </><br>\n"
"<quality32> X2 </>...<br>\n"
"<quality3N> XN </><br>\n"
"<scoreWeight31> Y1 </><br>\n"
"<scoreWeight32> Y2 </>...<br>\n"
"<scoreWeight3N> YN </><br>\n"
""
"</td>\n"
"<td>\n"
"This maps the quality of the document to a percentage weight on the base score of the words and phrases being indexed. This is the QUALITY_WEIGHT value in the <a href=\"#formula\">formula</a>."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"<a name=\"v3\"></a>\n"
"<len41> X1 </><br>\n"
"<len42> X2 </>...<br>\n"
"<len4N> XN </><br>\n"
"<scoreWeight41> Y1 </><br>\n"
"<scoreWeight42> Y2 </>...<br>\n"
"<scoreWeight4N> YN </><br>\n"
""
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"This maps the length (in characters) of the what is being indexed to a percentage weight on the base score of the words and phrases being indexed. This is the LENGTH_WEIGHT value in the <a href=\"#formula\">formula</a>."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"<a name=\"v4\"></a>\n"
"<len51> X1 </><br>\n"
"<len52> X2 </>...<br>\n"
"<len5N> XN </><br>\n"
"<maxScore51> Y1 </><br>\n"
"<maxScore52> Y2 </>...<br>\n"
"<maxScore5N> YN </><br>\n"
"</td>\n"
"<td>\n"
"This maps the length (in characters) of the what is being indexed to a percentage of the absolute maximum score a word or phrase can have. This is the LENGTH_WEIGHT_MAX value in the <a href=\"#formula\">formula</a>."
"</td>\n"
"</tr>\n"
"</table>\n"
""
"<a name=\"formula\"></>\n"
"<br><br>\n"
"<b>Computing the Score</b>\n"
"<br><br>\n"
"Each word in the document is assigned a base score with this formula : "
"See above table for descriptions of these variables. The BOOST is 256 if the page links to gigablast.com or has a submission for to gigablast.com, but if not, the BOOST is 0."
""
"<br>\n"
"<br>\n"
"After the base score is computed, it is multiplied by the number of occurences of the word or phrase in the portion of the document being indexed as specified by the index rule. This score may then be reduced if spam detection occurred and the word or phrase was deemed repetitious. Spam detection is triggered when the quality of the document is at or below the value specified in the <minQualityForSpamDetect> tag in the index rule. Finally, the score is mapped into an 8 bit value, from 1 to 255, and stored in the index."
"To see the scoring algorithm in action you can use the <b><a href=\"/admin/parser\">Parser Tool</a></b>. It will show each indexed word and phrase and its associated score, as well as some attributes associated with the indexed document."
"Attributes which describe the document are often indexed but are deemed too simple to require their own index rule. This includes indexing portions of the url in various ways and indexing the content type of the document, as the table below illustrates."
"<br><br>\n"
"<table cellpadding=4>\n"
"<tr>\n"
"<td><b>Item</b></td>\n"
"<td><b>Ruleset Tag</b></td>\n"
"<td><b>Desription</b></td>\n"
"</tr>\n"
""
"<td bgcolor=#eeeeee>\n"
"<meta name=foo content=bar>"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"--"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"User-defined meta tags use the quality of the document multiplied by 256 as their score. If this product is 0 it is upped to 1. This score is then mapped to an 8-bit final score an indexed. Furthermore, when indexing user-defined meta tags, only one occurence of each word or phrase is counted. In the future, these meta tags may have their own index rule."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"http://www.xxx.com/abc"
"</td>\n"
"<td>\n"
"<indexUrl>X</>"
"</td>\n"
"<td>If X is <i>yes</i> then the entire url is indexed as one word with a BASE_SCORE of 1 and with a url: prefix so a search for <i>url:http://www.xxx.com/</i> will bring up the document."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"http://www.xxx.com/abc"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"<indexSubUrl>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>If X is <i>yes</i> then the url is indexed as if it occured in the document, but with a random BASE_SCORE (based on url hash) and a suburl: prefix so a search for <i>suburl:\"com/abc\"</i> will bring up the document."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"http://www.xxx.com/abc"
"</td>\n"
"<td>\n"
"<indexIp>X</>"
"</td>\n"
"<td>If X is <i>yes</i> then the IP of the url will be indexed as if it were one word but with a random BASE_SCORE (based on url hash). Furthermore, the last number of the IP address is replaced with a zero and that IP address is indexed in order to provide an IP domain search ability. So if a url has the IP address 1.2.3.4 then a search for ip:1.2.3.4 or for ip:1.2.3 should bring it up."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"http://www.xxx.com/abc?q=hi"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"<indexSite>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>If X is <i>yes</i> then the following terms would be indexed with a base score of BASE_SCORE (but multiplied by 3 if the url is a root url): "
"<ul>\n"
"<li>site:www.xxx.com/abc?q=hi"
"<li>site:www.xxx.com/abc?"
"<li>site:www.xxx.com/"
"<li>site:xxx.com/abc?q=hi"
"<li>site:xxx.com/abc?"
"<li>site:xxx.com/"
"</ul>\n"
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"http://www.xxx.com/form.php"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"<indexExt>X</>"
"</td>\n"
"<td bgcolor=#eeeeee>If X is <i>yes</i> then the file extension, if any, of the url would be indexed with the ext: prefix and a score of BASE_SCORE. So a query of ext:php would bring up the document in this example case."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"links"
"</td>\n"
"<td>\n"
"<indexLinks>X</>"
"</td>\n"
"<td>If X is <i>yes</i> then the various links in the document will be indexed with a link: prefix. Scores are special in this case."
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td bgcolor=#eeeeee>\n"
"collection name"
"</td>\n"
"<td bgcolor=#eeeeee>\n"
"--"
"</td>\n"
"<td bgcolor=#eeeeee>The collection name of the document is indexed with the coll: prefix and a BASE_SCORE of 1. "
"</td>\n"
"</tr>\n"
""
"<tr>\n"
"<td>\n"
"content type"
"</td>\n"
"<td>\n"
"--"
"</td>\n"
"<td>The content type of the document is indexed with the type: (or filetype:) prefix and a BASE_SCORE of 1. If the content type is not one of these supported content types, then nothing will be indexed: "
"<tr><td><center><b><font color=#ffffff size=+1>Indexing User-Defined Meta Tags"
"</td></tr></table>\n"
"<br>\n"
"Gigablast supports the indexing, searching and displaying of user-defined meta tags. For instance, if you have a tag like <i><meta name=\"foo\" content=\"bar baz\"></i> in your document, then you will be able to do a search like <i><a href=\"/search?q=foo%%3Abar&dt=foo\">foo:bar</a></i> or <i><a href=\"/search?q=foo%%3A%%22bar+baz%%22&dt=foo\">foo:\"bar baz\"</a></i> and Gigablast will find your document. "
"<br><br>\n"
"You can tell Gigablast to display the contents of arbitrary meta tags in the search results, like <a href=\"/search?q=gigablast&s=10&dt=author+keywords%%3A32\">this</a>. Note that you must assign the <i>dt</i> cgi parameter to a space-separated list of the names of the meta tags you want to display. You can limit the number of returned characters of each tag to X characters by appending a <i>:X</i> to the name of the meta tag supplied to the <i>dt</i> parameter. In the link above, I limited the displayed keywords to 32 characters. The content of the meta tags is also provided in the <display> tags in the <a href=\"#output\">XML feed</a>\n"
"<br><br>\n"
"Gigablast will index the content of all meta tags in this manner. Meta tags with the same <i>name</i> parameter as other meta tags in the same document will be indexed as well.\n"
"<br><br>\n"
"Why use user-defined metas? Because it is very powerful. It allows you to embed custom data in your documents, search for it and retrieve it."
"<br>\n"
"<br>\n"
"You can also explicitly specify how to index certain meta tags by making an <index> tag in the <a href=\"#ruleset\">ruleset</a> as shown <a href=\"#rsmetas\">here</a>. The specified meta tags will be indexed in the user-defined meta tag fashion as described above, in addition to any method described in the ruleset."
"<tr><td><center><b><font color=#ffffff size=+1>Indexing Big Documents"
"</td></tr></table>\n"
"<br>\n"
"When indexing a document you will be bound by the available memory of the machine that is doing the indexing. A document that is dense in words can takes as much as ten times the memory as the size of the document in order to process it for indexing. Therefore you need to make sure that the amount of available memory is adequate to process the document you want to index. You can turn off Spam detection to reduce the processing overhead by a little bit."
"<br>\n"
"<br>\n"
"The <b><maxMem></b> tag in the <a href=#config>gb.conf</a> file controls the maximum amount of memory that the whole Gigablast process can use. HOWEVER, this memory is shared by databases, thread stacks, protocol stacks and other things that may or may not use most of it. Probably, the best way to see much memory is available to the Gigablast process for processing a big document is to look at the <b>Stats Page</b>. It shows you exactly how much memory is being used at the time you look at it. Hit refresh to see it change."
"<br>\n"
"<br>\n"
"You can also check all the tags in the gb.conf file that have the word \"mem\" in them to see where memory is being allocated. In addition, you will need to check the first 100 lines of the log file for the gigablast process to see how much memory is being used for thread and protocol stacks. These should be displayed on the Stats page, but are currently not."
"<br>\n"
"<br>\n"
"After ensuring you have enough extra memory to handle the document size, you will need to make sure the document fits into the tree that is used to hold the documents in memory before they get dumped to disk. The documents are compressed using zlib before being added to the tree so you might expect a 5:1 compression for a typical web page. The memory used to hold document in this tree is controllable from the <b><titledbMaxTreeMem></b> parameter in the gb.conf file. Make sure that is big enough to hold the document you would like to add. If the tree could accomodate the big document, but at the time is partially full, Gigablast will automatically dump the tree to disk and keep trying to add the big document."
"Finally, you need to ensure that the <b>max text doc len</b> and <b>max other doc len</b> controls on the <b>Spider Controls</b> page are set to accomodating sizes. Use -1 to indicate no maximum. <i>Other</i> documents are non-text and non-html documents, like PDF, for example. These controls will physically prohibit the spider from downloading more than this many bytes. This causes excessively int32_t documents to be truncated. If the spider is downloading a PDF that gets truncated then it abandons it, because truncated PDFs are useless."
"<tr><td><center><b><font color=#ffffff size=+1>Indexing Different Languages"
"</td></tr></table>\n"
"<br>\n"
"Gigablast currently just supports indexing the ISO-8859-1 (aka Latin-1) character set. This character set uses one byte (8 bits) per character. It covers most West European languages such as French, Spanish, Catalan, Galician, Basque, Portuguese, Italian, Albanian, Afrikaans, Dutch, German, Danish, Swedish, Norwegian, Finnish, Faroese, Icelandic, Irish, Scottish and English.<br><br>"
"Gigablast has a switch in the Spider Controls for enabling and disabling "
"the indexing of Asian character sets. If the spider is downloading a "
"document and Asian character sets are disallowed, then it will check for "
"the Content-Encoding field in the mime of the HTTP reply from the web "
"server. If the name of the character set is one of the Asian character "
"sets that Gigablast recognizes, then the document will NOT be indexed. "
"This control covers some of the more popular "
"characters sets in Asia, but if the character set is not recognized by "
"Gigablast it will be indexed as if it were the Latin-1 character set. "
"<tr><td><center><b><font color=#ffffff size=+1>Rolling the New Index"
"</td></tr></table>\n"
"<br>\n"
"Just because you have indexed a lot of pages does not mean those pages are being searched. If the <b>restrict indexdb for queries</b> switch on the <a href=\"/admin/spider\">Spider Controls</a> page is on for your collection then any query you do may not be searching some of the more recently indexed data. You have two options:"
"<br><br>\n"
"<b>1.</b>You can turn this switch off which will tell Gigablast to search all the files in the index which will give you a realtime search, but, if <indexdbMinFilesToMerge> is set to <i>X</i> in the <a href=#config>gb.conf</a> file, then Gigablast may have to search X files for every query term. So if X is 40 this can destroy your performance. But high X values are indeed useful for speeding up the build time. Typically, I set X to 4 on gigablast.com, but for doing initial builds I will set it to 40."
"<br><br>\n"
"<b>2.</b>The second option you have for making the newer data searchable is to do a <i>tight merge</i> of indexdb. This tells Gigablast to combine the X files into one. Tight merges typically take about 2-4 minutes for every gigabyte of data that is merged. So if all of your indexdb* files are about 50 gigabytes, plan on waiting about 150 minutes for the merge to complete."
"<br><br>\n"
"<b>IMPORTANT</b>: Before you do the tight merge you should do a <b>disk dump</b> which tells Gigablast to dump all data in memory to disk so that it can be merged. In this way you ensure your final merged file will contain *all* your data. You may have to wait a while for the disk dump to complete because it may have to do some merging right after the dump to keep the number of files below <indexdbMinFilesToMerge>."
""
"<br><br>\n"
"Now if you are <a href=#input>interfacing to Gigablast</a> from another program you can use the <b>&rt=[0|1]</b> real time search cgi parameter. If you set this to 0 then Gigablast will only search the first file in the index, otherwise it will search all files."
"<td>Messages printed when a document was not indexed because the document quota specified in the ruleset was breeched. Also, urls that were truncated because they were too long. Or a robots.txt file was too big and was truncated.</td>\n"
"Gigablast is a fairly sophisticated database that has a few things you can tweak to increase query performance or indexing performance.\n"
"<br><br>\n"
"\n"
"<b>Query Optimizations:</b>\n"
"\n"
"<ul>\n"
"<li> Set <b>restrict indexdb for queries</b> on the \n"
"<a href=\"/admin/spider\">Spider Controls</a> page to YES.\n"
"This parameter can also be controlled on a per query basis using the \n"
"<a href=#rt><b>rt=X</b></a> cgi parm.\n"
"This will decrease freshness of results but typically use\n"
"3 or 4 times less disk seeks.\n"
"\n"
"<li> If you want to spider at the same time, then you should ensure\n"
"that the <b>max spider disk threads</b> parameter on the\n"
"<a href=\"/master\">Master Controls</a> page is set to around 1 \n"
"so the indexing/spidering processes do not hog the disk.\n"
"\n"
"<li> Set Gigablast to read-only mode to true to prevent Gigablast from using \n"
"memory to hold newly indexed data, so that this memory can \n"
"be used for caches. Just set the <b><readOnlyMode></b> parameter in your config file to 1.\n"
"\n"
"<li> Increase the indexdb cache size. The <b><indexdbMaxCacheMem></b> \n"
"parameter in\n"
"your config file is how many bytes Gigablast uses to store <i>index lists</i>.\n"
"Each word has an associated index list which is loaded from disk when that\n"
"word is part of a query. The more common the word, the bigger its index list.\n"
"By enabling a large indexdb cache you can save some fairly large disk reads.\n"
"\n"
"<li> Increase the clusterdb cache size. The <b><clusterdbMaxCacheMem></b>\n"
"parameter in\n"
"your config file is how many bytes Gigablast uses to store cluster records.\n"
"Cluster records are used for site clustering and duplicate removal. Every\n"
"URL in the index has a corresponding cluster record. When a url appears as a \n"
"search result its cluster record must be loaded from disk. Each cluster \n"
"record is about 12 to 16 bytes so by keeping these all in memory you can\n"
"save around 10 disk seeks every query.\n"
"\n"
"<li> Disable site clustering and dup removal. By specifying <i>&sc=0&dr=0</i>\n"
"in your query's URL you ensure that these two services are avoided and no\n"
"cluster records are loaded. You can also turn them off by default on the\n"
"<a href=\"/admin/spiderdb\">Spider Controls</a> page. But if someone explicitly\n"
"specifies <i>&sc=1</i> or <i>&dr=1</i> in their query URL then they will\n"
"override that switch.\n"
"\n"
"<li>If you are experiencing a high average query latency under a high query throughput then consider adding more twins to your architecture. If you do not have any twins, and are serving a large query volume, then data requests tend to clump up onto one particular server at random, slowing everybody else down. If that server has one or more twins available, then its load will be evened out through Gigablast's dynamic load balancing and the average query latency will decrease."
"\n"
"</ul>\n"
"\n"
"<br>\n"
"\n"
"<b>Build Optimizations:</b>\n"
"<ul>\n"
"<li> Set <b>restrict indexdb for spidering</b> on the \n"
"<li> Disable dup checking. Gigablast will not allow any duplicate pages\n"
"from the same domain into the index when this is enabled. This means that\n"
"Gigablast must do about one disk seek for every URL indexed to verify it is\n"
"not a duplicate. If you keep checksumdb all in memory this will not be a\n"
"problem.\n"
"<li> Disable <b>link voting</b>. Gigablast performs at least one disk seek\n"
"to determine who link to the URL being indexed. If it does have some linkers\n"
"then the Cached Copy of each linker (up to 200) is loaded and the corresponding\n"
"link text is extracted. Most pages do not have many linkers so the disk\n"
"load is not too bad. Furthermore, if you do enable link voting, you can\n"
"restrict it to the first file of indexdb, <b>restrict indexdb for \n"
"spidering</b>, to ensure that about one seek is used to determine the linkers.\n"
"<li> Enable <b>use IfModifiedSince</b>. This tells the spider not to do \n"
"anything if it finds that a page being reindexed is unchanged since the last\n"
"time it was indexed. Some web servers do not support the IfModifiedSince tag,\n"
"so Gigablast will compare the old page with the new one to see if anything\n"
"changed. This backup method is not quite as efficient as the first, \n"
"but it can still save ample disk resources.\n"
"<!--<li> Don't let Linux's bdflush flush the write buffer to disk whenever it \n"
"wants. Gigablast needs to control this so it won't perform a lot of reads\n"
"when a write is going on. Try performing a 'echo 1 > /proc/sys/vm/bdflush'\n"
"to make bdflush more bursty. More information about bdflush is available\n"
"in the Linux kernel source Documentation directory in the proc.txt file.-->\n"
"</ul>\n"
"\n"
"<br>\n"
"\n"
"<b>General Optimizations:</b>\n"
"<ul>\n"
"<li> Prevent Linux from unnecessary swapping. Linux will often swap out\n"
"Gigablast pages to satisfy Linux's disk cache. By using the swapoff command\n"
"to turn off swap you can increase performance, but if the computer runs out\n"
"of memory it will start killing processes withouth giving them a chance\n"
"to save their data.\n"
"<!--using Rik van Riel's\n"
"patch, rc6-rmap15j, applied to kernel 2.4.21, you can add the \n"
"/proc/sys/vm/pagecache control file. By doing a \n"
"'echo 1 1 > /proc/sys/vm/pagecache' you tell the kernel to only use 1%% of\n"
"the swap space, so swapping is effectively minimized.-->\n"
"Every gb process uses the same hosts.conf file. The hosts.conf file describes the hosts (gb processes) participating in the network.\n"
"Each line in this file is a host entry. The number of participating hosts must be a power of 2. Each host entry uses the following fields: <br><br>\n"
"<table cellpadding=3>\n"
"<tr><td><b>ID</b></td><td>Each host has a unique id. The ids must be contiguous.</td></tr>\n"
"<tr><td><b>IP</b></td><td>Each host has an IP. If you are running multiple hosts on the same computer they may all use the same IP.</td></tr>\n"
"<tr><td><b>LINKIP</b></td><td>This is the IP of this host as viewed externally. It may or may not be different from the internal IP. It is only used for generating absolute (non-relative) links for <a href> tags on dynamic HTML pages.\n"
"<tr><td><b>UDP1</b></td><td>This is the low priority udp port used by the host. Hosts on the same computer must have different ports. Port numbers must be above 2000 or so, because only root has permission to use those ports.</td></tr>\n"
"<tr><td><b>UDP2</b></td><td>This is the high priority udp port used by the host. Hosts on the same computer must have different ports. Port numbers must be above 2000 or so, because only root has permission to use those ports.</td></tr>\n"
"<tr><td><b>DNS</b></td><td>This is the client port we use locally when talking to the dns server.</td></tr>\n"
"<tr><td><b>HTTP</b></td><td>This is the HTTP port used by the host. To avoid conflicts, hosts on the same computer must have different ports. Port numbers must be above 2000 or so, because only root has permission to use those ports.</td></tr>\n"
"<tr><td><b>IDE</b></td><td>The IDE channel number that the host uses. Hosts on the same computer that share the same IDE bus must have this number be the same.</td></tr>\n"
"<tr><td><b>GRP</b></td><td>The redundancy group number to which the host belongs. Hosts that are mirror images (twins) of each other have the same redundancy group number.</td></tr>\n"
"A <b>ruleset</b> is a set of rules used for spidering and indexing the content of a URL. This <a href=\"#classifying\">section</a> talks about how to assign a ruleset to a URL. Each ruleset is a file in Gigablast's working directory with a file name like tagdb*.xml, where '*' is a number.\n"
"<br><br>\n"
"<b>IMPORTANT:</b> Do not change the <a href=\"#indexingsection\">indexing section</a> or the <linksUnbanned>, <linksClean> or <linksDirty> tags of a ruleset file if some documents in the index were indexed with that ruleset file. To do so might create some unrepairable data corruption.\n"
"<br><br>\n"
"The following is an example ruleset for a particular URL (\"the URL\"):\n"
"\n"
"<pre>\n"
"\n"
"# This is the unique name of the ruleset which is used for \n"
"# display in drop-down menus in administrative, web-based GUIs.\n"
"<b><name>default</></b>\n"
"<a name=\"qualitysection\"></a>\n"
"\n"
"# This is the accompanying description displayed on the Sitedb tool and\n"
"# URL Filters pages.\n"
"<b><description>This is the default ruleset used for most urls.</></b>\n"
"This simple script is used to start up all the gb hosts (processes) native to a particular computer. It also redirects the gb programs standard error to a log file. Notice that the gb executable takes the gb.conf filename as its first argument."