Files
open-source-search-engine/html/compare.html
2014-06-21 07:27:45 -07:00

648 lines
8.7 KiB
HTML

<html>
<title>LIVE Interactive Comparison of Gigablast vs SOLR Open Source Search Engine</title>
<h2>Comparing Gigablast to SOLR</h2>
<table cellspacing=10 border=1>
<tr>
<td style=max-width:100px;min-width:10%;></td>
<td style=min-width:30%><b><a href=http://www.gigablast.com/>Gigablast</a></b></td>
<td style=min-width:30%><b><a href=http://lucene.apache.org/solr/>SOLR</a></b></td>
<!--
<td><b><a href=http://www.elasticsearch.org/>ElasticSearch</a></b></td>-->
</tr>
<tr valign=top>
<td><b>Package Installation</b></td>
<!-- gb install -->
<td>
<a href=/admin.html#quickstart>Download packages for Ubuntu or RedHat</a>
</td>
<!-- solr install-->
<td>
<a href=http://wiki.apache.org/solr/SolrInstall>Instructions</a>
</td>
<!-- elastic search install-->
<!--
<td>
<ul>
<li>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.zip
<li>unzip elasticsearch-0.90.3.zip
<li>cd elasticsearch-0.90.3
<li>cd bin
<li>./elasticsearch -f
<li>curl -X GET http://localhost:9200/
</ul>
</td>
-->
</tr>
<tr valign=top>
<td><b>Source Installation</b></td>
<!-- gb install -->
<td>
Just a <a href=/admin.html#src>few simple steps</a>
</td>
<!-- solr install-->
<td>
<a href=http://lucene.apache.org/solr/downloads.html>Source download</a>
</td>
<!-- elastic search install-->
<!--
<td>
<ul>
<li>wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-0.90.3.zip
<li>unzip elasticsearch-0.90.3.zip
<li>cd elasticsearch-0.90.3
<li>cd bin
<li>./elasticsearch -f
<li>curl -X GET http://localhost:9200/
</ul>
</td>
-->
</tr>
<tr>
<td>
<b>Complete Web GUI</b>
</td>
<!--gigablast-->
<td>
<font color=green><b>
Yes.
</b></font>
</td>
<!--solr-->
<td>
</td>
</tr>
<tr>
<td>
<b>Indexing a Single File Containing Multiple Documents via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl using args listed <a href=/api.html#/admin/inject>here</a>
<br>
</td>
<!--solr-->
<td>
unsupported
</td>
</tr>
<tr>
<td>
<b>Indexing an Individual File via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl to post the content of the file with args listed
<a href=/api.html#/admin/inject>here</a>
</td>
<!--solr-->
<td>
You can index individual local files as such:
<b>curl "http://127.0.0.1:8080/solr/update" --data-binary @myfile.html -H 'Content-type: text/html'</b>
but it does not seem to work unless your HTML meets stringent requirements for some reason.
</td>
</tr>
<tr>
<td>
<b>Indexing an Individual URL via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl to inject the url with args listed
<a href=/api.html#/admin/inject>here</a>
</td>
<!--solr-->
<td>
</td>
</tr>
<tr>
<td>
<b>Indexing a File of URLs via cmdline</b>
</td>
<!--gigablast-->
<td>
Use one curl command for each url, using the interface described
<a href=/api.html#/admin/inject>here</a></b>
</td>
<!--solr-->
<td>
</td>
</tr>
<tr>
<td>
<b>Deleting Documents via cmdline</b>
</td>
<!--gigablast-->
<td>
Use curl command to delete a url, using the interface described
<a href=/api.html#/admin/inject>here</a></b>
</td>
<!--solr-->
<td>
You can delete individual documents by specifying queries that match just those documents:
<b>java -Dcommit....</b>
</td>
</tr>
<tr>
<td><b>Getting Results via cmdline</b></td>
<td>
Use curl command to do a search, using the interface described
<a href=/api.html#/search>here</a></b>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Faceted Search</b></td>
<td>
Coming soon.
</td>
<td>
Yes.
</td>
</tr>
<tr>
<td><b>Numeric Fields</b></td>
<td>
You can forward/reverse sort by and constrain by numeric fields.
</td>
<td>
You can forward/reverse sort by and constrain by numeric fields.
</td>
</tr>
<tr>
<td><b>Boolean Search</b></td>
<td>
Fully nested boolean search with AND OR NOT.
</td>
<td>
Fully nested boolean search with AND OR NOT.
</td>
</tr>
<!-- title: inurl: -->
<tr>
<td><b>Searchable Fields</b></td>
<td>
Yes. Any meta tag, or if indexing JSON or XML.
</td>
<td>
</td>
</tr>
<!-- CTS -->
<tr>
<td><b>Site Restricted Searches</b></td>
<td>
Yes. Using the site: query operator.
</td>
<td>
</td>
</tr>
<tr>
<td><b>Spell Checker</b></td>
<td>
Yes. But currently disabled until improved.
</td>
<td>
Yes.
</td>
</tr>
<tr>
<td><b>Language Identification</b></td>
<td>
Yes.
</td>
<td>
Yes.
</td>
</tr>
<!-- gigabits -->
<tr>
<td><b>Related Concepts</b></td>
<td>
<font color=green><b>
Yes. Called <i>Gigabits</i>.
</b></font>
</td>
<td>
No.
</td>
</tr>
<tr>
<td><b>Query Expansion (Synonyms)</b></td>
<td>
Yes. Uses mysynonyms.txt file to add your own expansion terms.
</td>
<td>
</td>
</tr>
<tr>
<td><b>Cached Pages</b></td>
<td>
Yes.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>RESTful/XML/JSON APIs</b></td>
<td>
Yes, JSON and XML.
</td>
<td>
</td>
</tr>
<tr>
<td><b>Schemas</b></td>
<td>
<font color=green><b>
You do not need to define schemas to begin indexing files and urls.
</b></font>
</td>
<td>
You have to define annoying schemas.
</td>
</tr>
<tr>
<td><b>Spidering</b></td>
<td>
<font color=green><b>
Gigablast has a complete web spider.
</b></font>
</td>
<td>
SOLR has no spider.
</td>
</tr>
<tr>
<td><b>Document Filters</b></td>
<td>
antiword (for Microsoft Word)<br>
pdftohtml (for PDF)
xlstohtml (for Excel)
ppthtml (for power point)
pstotext (for PostScript)
</td>
<td>
uses Apache Tika for several formats.
</td>
</tr>
<tr>
<td><b>Scalability</b></td>
<td>
<font color=green><b>
Highly scalable. Has scaled to over
12 billion pages while server millions
of queries per day.
</b></font>
</td>
<td>
Has not scaled nearly as high to our knowledge.
</td>
</tr>
<tr>
<td><b>Performance</b></td>
<td>
<font color=green><b>
High performance. Written in C/C++.
</b></font>
</td>
<td>
Slower. Written in Java. Has garbage collection, etc.
</td>
</tr>
<!--
<tr>
<td><b>Configuration Files and Descriptions</b></td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Duplicate Content</b></td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Duplicate Sections</b></td>
<td>
Can remove duplicate content at spider time
or query time.
</td>
<td>
</td>
</tr>
<tr>
<td><b>Section Classification</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<!--
<tr>
<td><b>Phrases</b></td>
<td>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Query Weighting</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<!--
<tr>
<td><b>Index Layout</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<tr>
<td><b>Ranking Alogrithm</b></td>
<td>
<font color=green><b>
Custom query term proximity based algorithm. Superior to TF/IDF or Cosine methods.
</b></font>
</td>
<td>
Old school TF/IDF based on simple statistics.
</td>
</tr>
<tr>
<td><b>Scoring Explanations</b></td>
<td>
Complete scoring information provided.
</td>
<td>
Complete scoring information provided.
</td>
</tr>
<tr>
<td><b>Inlink Text</b></td>
<td>
<font color=green><b>
Indexed incoming link text, compensates for link spam.
</b></font>
</td>
<td>
None. Not geared for web search.
</td>
</tr>
<tr>
<td><b>Page Rank</b></td>
<td>
<font color=green><b>
Uses <i>Site Rank</i> based on number of incoming links to a site
from other sites. Detects link spam and compensates accordingly.
</b></font>
</td>
<td>
None. Not geared for web search.
</td>
</tr>
<tr>
<td><b>On-Page Spam</b></td>
<td>
Demotes terms deemed spammy on a page.
</td>
<td>
None.
</td>
</tr>
<tr>
<td><b>Reliability</b></td>
<td>
Pretty good.
</td>
<td>
Pretty good.
</td>
</tr>
<!--
<tr>
<td><b>Administration</b></td>
<td>
Simple web-based GUI and API.
</td>
<td>
</td>
</tr>
-->
<!--
<tr>
<td><b>File Descriptions</b></td>
<td>
</td>
<td>
</td>
</tr>
-->
<tr>
<td><b>Developer Documentation</b></td>
<td>
Yes. <a href=/developer.html>Here</a>.
</td>
<td>
Yes. Lots of documentation.
</td>
</tr>
<tr>
<td><b>Graphing</b></td>
<td>
Graphs performance of various subroutines and query times.
</td>
<td>
Unknown.
</td>
</tr>
<tr>
<td><b>Monitoring</b></td>
<td>
<font color=green><b>
Monitors drive temperature, disk space, query latency and shard uptime. Sends email alerts.
</b></font>
</td>
<td>
None known.
</td>
</tr>
<tr>
<td><b>Geospatial</b></td>
<td>
Can use with numeric gbminint: gbmaxint: query operators on lat/lon fields.
</td>
<td>
Yes.
</td>
</tr>
<tr>
<td><b>Dynamic Summaries</b></td>
<td>
Yes. Contain query terms.
</td>
<td>
Yes. Contain query terms.
</td>
</tr>
<tr>
<td><b>Site Clustering</b></td>
<td>
Yes.
</td>
<td>
???
</td>
</tr>
<tr>
<td><b>More Like This</b></td>
<td>
Coming soon.
</td>
<td>
Yes.
</td>
</tr>
<tr>
<td><b>Sort by Date</b></td>
<td>
<i>gbsortbyint:gbspiderdate</i><br>
<i>gbsortbyint:gbindexdate</i><br>
<i>gbrevsortbyint:gbspiderdate</i><br>
<i>gbrevsortbyint:gbindexdate</i>
</td>
<td>
</td>
</tr>
<tr>
<td><b>Query Completion</b></td>
<td>
Coming soon.
</td>
<td>
Available with additional module.
</td>
</tr>
<tr>
<td><b>Document Collections</b></td>
<td>
<font color=green><b>
Supports tens of thousands of separate collections,
and federated search across them.
</b></font>
</td>
<td>
</td>
</tr>
</table>