886 lines
13 KiB
HTML
886 lines
13 KiB
HTML
<html>
|
|
<!--TODO: make interactive comparison-->
|
|
|
|
<title>Comparison of Gigablast vs SOLR Open Source Search Engine</title>
|
|
|
|
|
|
<h2>Comparing Gigablast to SOLR</h2>
|
|
|
|
|
|
<table cellspacing=10 border=1>
|
|
<tr>
|
|
|
|
<td style=max-width:100px;min-width:10%;></td>
|
|
|
|
<td style=min-width:30%><b><a href=http://www.gigablast.com/>Gigablast</a></b></td>
|
|
|
|
<td style=min-width:30%><b><a href=http://lucene.apache.org/solr/>Solr</a></b></td>
|
|
<!--
|
|
<td><b><a href=http://www.elasticsearch.org/>ElasticSearch</a></b></td>-->
|
|
</tr>
|
|
|
|
<tr valign=top>
|
|
<td><b>Package Installation</b></td>
|
|
|
|
<!-- gb install -->
|
|
<td>
|
|
<a href=/admin.html#quickstart>Download packages for Ubuntu or RedHat</a>
|
|
</td>
|
|
|
|
<!-- solr install-->
|
|
<td>
|
|
|
|
<a href=http://wiki.apache.org/solr/SolrInstall>Instructions</a>
|
|
|
|
</td>
|
|
|
|
<!-- elastic search install-->
|
|
<!--
|
|
<td>
|
|
<ul>
|
|
<li>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.zip
|
|
<li>unzip elasticsearch-0.90.3.zip
|
|
<li>cd elasticsearch-0.90.3
|
|
<li>cd bin
|
|
<li>./elasticsearch -f
|
|
<li>curl -X GET http://localhost:9200/
|
|
</ul>
|
|
</td>
|
|
-->
|
|
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>Source Language</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<font color=green><b>
|
|
C/C++
|
|
</b></font>
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
Java
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>Runs on Linux</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
Yes.
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
Yes.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr valign=top>
|
|
<td><b>Runs on Windows</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
Yes with Virtual Box. Soon natively.
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
Yes.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>License</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
Apache Open Source License 2
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
Apache Open Source License 2
|
|
</td>
|
|
</tr>
|
|
|
|
<tr valign=top>
|
|
<td><b>Release Date</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
2000
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
2007
|
|
</td>
|
|
</tr>
|
|
|
|
<tr valign=top>
|
|
<td><b>Scalability</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<font color=green><b>
|
|
Has scaled to over 12 billion unique web pages.
|
|
Can scale to over 100 billion pages in a single collection.
|
|
</b></font>
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
Good luck!
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>HTTP API</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<a href=/admin/api>here</a>
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
<a href=http://wiki.apache.org/solr/SchemaRESTAPI>here</a>
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>Search Results</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<a href=http://www.google.com/search?q=gigablast>here</a>
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
<a href=http://www.google.com/search?q=solr>here</a>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr valign=top>
|
|
<td><b>Source Repository</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<a href=https://github.com/gigablast/open-source-search-engine>github</a>
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
<a href=https://github.com/apache/lucene-solr>github</a>
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>Github Star Ratings</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<a href=https://github.com/gigablast/open-source-search-engine>326</a> (8/2/2014)
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
<a href=https://github.com/apache/lucene-solr>767</a> (8/2/2014)
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
<tr valign=top>
|
|
<td><b>Source Installation</b></td>
|
|
<!-- gb install -->
|
|
<td>
|
|
<font color=green><b>
|
|
Just a <a href=/admin.html#src>few simple steps</a>
|
|
</b></font>
|
|
</td>
|
|
<!-- solr install-->
|
|
<td>
|
|
|
|
<a href=http://lucene.apache.org/solr/downloads.html>Source download instructions</a>
|
|
|
|
</td>
|
|
<!-- elastic search install-->
|
|
<!--
|
|
<td>
|
|
<ul>
|
|
<li>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.zip
|
|
<li>unzip elasticsearch-0.90.3.zip
|
|
<li>cd elasticsearch-0.90.3
|
|
<li>cd bin
|
|
<li>./elasticsearch -f
|
|
<li>curl -X GET http://localhost:9200/
|
|
</ul>
|
|
</td>
|
|
-->
|
|
</tr>
|
|
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Complete Web GUI</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
<font color=green><b>
|
|
Yes.
|
|
</b></font>
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Operating Layout</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
<font color=green><b>
|
|
A single binary containing web server, database, admin tools, spider logic, etc.
|
|
</b></font>
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
Many different packages quilted together. Apache, MySQL, Lucene, Tika, Zookeeper, Solr, Nutch, ...
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Indexing a Single File Containing Multiple Documents via cmdline</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
<font color=green><b>
|
|
Use curl using args (including <i>delim</i>) listed <a href=/admin/api#/admin/inject>here</a>
|
|
</b></font>
|
|
<br>
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
unsupported
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Indexing an Individual File via cmdline</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
Use curl to post the content of the file with args listed
|
|
<a href=/admin/api#/admin/inject>here</a>
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
You can index individual local files as such:
|
|
<b>curl "http://127.0.0.1:8080/solr/update" --data-binary @myfile.html -H 'Content-type: text/html'</b>
|
|
but it does not seem to work unless your HTML meets stringent requirements for some reason.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Indexing an Individual URL via cmdline</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
Use curl to inject the url with args listed
|
|
<a href=/admin/api#/admin/inject>here</a>
|
|
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Indexing a File of URLs via cmdline</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
Use one curl command for each url, using the interface described
|
|
<a href=/admin/api#/admin/inject>here</a></b>
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
<tr>
|
|
<td>
|
|
<b>Deleting Documents via cmdline</b>
|
|
</td>
|
|
<!--gigablast-->
|
|
<td>
|
|
Use curl command to delete a url, using the interface described
|
|
<a href=/admin/api#/admin/inject>here</a></b>
|
|
</td>
|
|
<!--solr-->
|
|
<td>
|
|
You can delete individual documents by specifying queries that match just those documents:
|
|
<b>java -Dcommit....</b>
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
<tr>
|
|
<td><b>Getting Results via cmdline</b></td>
|
|
<td>
|
|
Use curl command to do a search, using the interface described
|
|
<a href=/admin/api#/search>here</a></b>
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Facets</b></td>
|
|
<td>
|
|
Yes. Basic support. See gbfacet operators in the <a href=/help.html>help file</a>.
|
|
</td>
|
|
<td>
|
|
Yes.
|
|
</font>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Search Result Limitations Based on Facet Value Counts</b></td>
|
|
<td>
|
|
Coming soon.
|
|
</td>
|
|
<td>
|
|
<font color=green><b>
|
|
Yes.
|
|
</b></font>
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Numeric Fields</b></td>
|
|
<td>
|
|
You can forward/reverse sort by and constrain by numeric fields.
|
|
</td>
|
|
<td>
|
|
You can forward/reverse sort by and constrain by numeric fields.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Boolean Search</b></td>
|
|
<td>
|
|
Fully nested boolean search with AND OR NOT.
|
|
</td>
|
|
<td>
|
|
Fully nested boolean search with AND OR NOT.
|
|
</td>
|
|
</tr>
|
|
|
|
<!-- title: inurl: -->
|
|
<tr>
|
|
<td><b>Searchable Fields</b></td>
|
|
<td>
|
|
Yes. Any meta tag, or if indexing JSON or XML.
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
<!-- CTS -->
|
|
<tr>
|
|
<td><b>Site Restricted Searches</b></td>
|
|
<td>
|
|
Yes. Using the site: query operator. Or use &sites=... to constrain your search up to 500 sites.
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Spell Checker</b></td>
|
|
<td>
|
|
Yes. But currently disabled until improved.
|
|
</td>
|
|
<td>
|
|
<font color=green><b>
|
|
Yes.
|
|
</b></font>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Language Identification</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Yes. On a per word level for searching purposes.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
Yes. Not on a per word level for searching purposes.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Index Multiple Languages</b></td>
|
|
<td>
|
|
Yes. Can expand words in many languages to all their different forms. More forms coming soon, too.
|
|
</td>
|
|
<td>
|
|
Yes, but stemming/expansion may be limited.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Show Images in Search Results</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Yes.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
No.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
<!-- gigabits -->
|
|
<tr>
|
|
<td><b>Related Concepts</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Yes. Called <i>Gigabits</i>.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
No.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Query Expansion (Synonyms)</b></td>
|
|
<td>
|
|
Yes. And also uses mysynonyms.txt file to add your own expansion terms.
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Cached Pages</b></td>
|
|
<td>
|
|
Yes.
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>RESTful/XML/JSON APIs</b></td>
|
|
<td>
|
|
Yes, XML. JSON coming soon.
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Schemas</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
You do not need to define schemas to begin indexing files and urls.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
You have to define annoying schemas.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
<tr>
|
|
<td><b>Spidering</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Gigablast has a complete distributed web spider with powerful controls.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
SOLR has no spider. You can try to integrate Nutch.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Document Filters</b></td>
|
|
<td>
|
|
antiword (for Microsoft Word)<br>
|
|
pdftohtml (for PDF)
|
|
xlstohtml (for Excel)
|
|
ppthtml (for power point)
|
|
pstotext (for PostScript)
|
|
</td>
|
|
<td>
|
|
uses Apache Tika for several formats.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Scalability</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Highly scalable. Has scaled to over
|
|
12 billion pages while serving millions
|
|
of queries per day. Can easily add new servers to the
|
|
hosts.conf file and click <i>rebalance shards</i> to
|
|
rebalance the data.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
Has not scaled nearly as high to our knowledge. Not originally built for more than one server.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Cluster Administration</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Built into the web GUI.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
Requires separate Zookeeper package installation.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
<tr>
|
|
<td><b>Performance</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
High performance. Written in C/C++.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
Slower. Written in Java. Has garbage collection, etc.
|
|
</td>
|
|
</tr>
|
|
|
|
<!--
|
|
<tr>
|
|
<td><b>Configuration Files and Descriptions</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Duplicate Content</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Duplicate Sections</b></td>
|
|
<td>
|
|
Can remove duplicate content at spider time
|
|
or query time.
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Section Classification</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
-->
|
|
<!--
|
|
<tr>
|
|
<td><b>Phrases</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Query Weighting</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
-->
|
|
|
|
<!--
|
|
<tr>
|
|
<td><b>Index Layout</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
-->
|
|
|
|
<tr>
|
|
<td><b>Ranking Algorithm</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Custom query term proximity based algorithm. Superior to TF/IDF or Cosine methods.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
Old school TF/IDF based on simple statistics.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Scoring Explanations</b></td>
|
|
<td>
|
|
Complete scoring information provided.
|
|
</td>
|
|
<td>
|
|
Complete scoring information provided.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Inlink Text</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Indexed incoming link text, compensates for link spam.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
None. Not geared for web search.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Page Rank</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Uses <i>Site Rank</i> based on number of incoming links to a site
|
|
from other sites. Detects link spam and compensates accordingly.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
None. Not geared for web search.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>On-Page Spam</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Demotes terms deemed spammy on a page.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
None.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Reliability</b></td>
|
|
<td>
|
|
Pretty good.
|
|
</td>
|
|
<td>
|
|
Pretty good.
|
|
</td>
|
|
</tr>
|
|
|
|
<!--
|
|
<tr>
|
|
<td><b>Administration</b></td>
|
|
<td>
|
|
Simple web-based GUI and API.
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
-->
|
|
|
|
<!--
|
|
<tr>
|
|
<td><b>File Descriptions</b></td>
|
|
<td>
|
|
</td>
|
|
<td>
|
|
</td>
|
|
</tr>
|
|
-->
|
|
|
|
<tr>
|
|
<td><b>Developer Documentation</b></td>
|
|
<td>
|
|
Yes. <a href=/developer.html>Here</a>.
|
|
</td>
|
|
<td>
|
|
Yes. Lots of documentation.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Graphing</b></td>
|
|
<td>
|
|
Graphs performance of various subroutines and query times.
|
|
</td>
|
|
<td>
|
|
Unknown.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Monitoring</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Monitors drive temperature, disk space, query latency and shard uptime. Sends email alerts.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
None known.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Geospatial</b></td>
|
|
<td>
|
|
Can use with numeric gbminint: gbmaxint: query operators on lat/lon fields.
|
|
See <a href=/help.html>help file</a> for examples using these operators.
|
|
</td>
|
|
<td>
|
|
Yes.
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Dynamic Summaries</b></td>
|
|
<td>
|
|
Yes. Contain query terms.
|
|
</td>
|
|
<td>
|
|
Yes. Contain query terms.
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
<tr>
|
|
<td><b>Site Clustering</b></td>
|
|
<td>
|
|
Yes.
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>More Like This</b></td>
|
|
<td>
|
|
Coming soon.
|
|
</td>
|
|
<td>
|
|
<font color=green><b>
|
|
Yes.
|
|
</b></font>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Sort by Date</b></td>
|
|
<td>
|
|
<i>gbsortbyint:gbspiderdate</i><br>
|
|
<i>gbsortbyint:gbindexdate</i><br>
|
|
<i>gbrevsortbyint:gbspiderdate</i><br>
|
|
<i>gbrevsortbyint:gbindexdate</i><br>
|
|
See <a href=/help.html>help file</a> for examples using these operators.
|
|
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Query Completion</b></td>
|
|
<td>
|
|
Coming soon.
|
|
</td>
|
|
<td>
|
|
<font color=green><b>
|
|
Available with additional module.
|
|
</b></font>
|
|
</td>
|
|
</tr>
|
|
|
|
<tr>
|
|
<td><b>Document Collections</b></td>
|
|
<td>
|
|
<font color=green><b>
|
|
Supports tens of thousands of separate collections,
|
|
and federated search across them.
|
|
</b></font>
|
|
</td>
|
|
<td>
|
|
???
|
|
</td>
|
|
</tr>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
</table>
|
|
<br><br><br>
|