Commit Graph

  • b92853ae50 update built-in gb cmd line tests for ssd performance. Matt 2015-11-30 18:47:44 -0700
  • de777e8d28 fix truncation of search results some more hopefully Matt Wells 2015-11-30 16:33:03 -0800
  • c779bdb70d Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-11-30 13:27:38 -0800
  • fc4731b11c fix a couple of cores happening on crawlbot. fix bug of a urls.csv or other streaming download being truncated because gb thinks a shard is down. even if it is down, wait for it to come back up. Matt Wells 2015-11-30 13:26:43 -0800
  • 6b696c49db Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-11-29 21:36:44 -0700
  • fc3ba95226 Fix host selection for downloading when nospider directives are present. It was always choosing the first host sequentially with spidering enabled. Now it looks at the other hosts in the shard selected. Zak Betz 2015-11-29 21:36:19 -0700
  • 50d539ab85 make gb easier to compile by removing a dynamically sized array on the stack. Matt 2015-11-28 23:40:20 -0700
  • ada7bb8eb9 Merge branch 'diffbot-testing' into testing Matt 2015-11-24 10:03:05 -0700
  • 6b2b0c7518 Merge branch 'diffbot' into diffbot-testing Matt 2015-11-24 10:02:49 -0700
  • ec5c38bab5 fix urgent merge mode bug some more? limit spiders to 5 per custom crawl coll per shard. Matt Wells 2015-11-24 08:51:18 -0800
  • add6f84b79 Merge branch 'diffbot-testing' into testing Matt Wells 2015-11-21 10:44:14 -0800
  • 398225dde1 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-11-21 10:44:05 -0800
  • d55932d0b6 fix spider proxy table bug that seemed to be the reason for the table getting so full. but in case it does get full again added a call the hashtablex::empty() so we don't freeze up any more. Matt Wells 2015-11-21 10:43:23 -0800
  • b3729ed214 tune spider proxy table flushing logic a bit Matt Wells 2015-11-21 10:29:02 -0800
  • 425fc699f8 Merge branch 'diffbot-testing' into testing Matt Wells 2015-11-21 10:21:04 -0800
  • 0964fb9715 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-11-21 10:20:49 -0800
  • 3c766451d1 try to fix the proxy load balancing table logic some more. seems to not cleanup after itself very well. Matt Wells 2015-11-21 10:20:20 -0800
  • dbe101d759 Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-11-20 10:42:53 -0700
  • a46c5b8f86 Fix anomalous link text detector to take into consideration the total number of inlinkers instead of just counting the matched link texts. Zak Betz 2015-11-20 10:42:46 -0700
  • aeaca04df3 fix bug of losing the line waiter header in linkdb.cpp for incoming msg25 requests. start to show more info in sockets table by parsing the request. Matt 2015-11-19 19:40:30 -0700
  • bdceec1796 Merge branch 'master' into testing Matt 2015-11-19 16:24:45 -0700
  • 7bc27a521e fix compiler error on 32bit arches Matt 2015-11-19 16:24:29 -0700
  • 6e7e267cfb Merge branch 'master' into testing Matt 2015-11-19 16:14:24 -0700
  • cd875f4ab9 fix empty url condition in add url. Matt 2015-11-19 16:14:12 -0700
  • b4ef9ca29f Merge branch 'diffbot-testing' into testing Matt 2015-11-19 16:11:54 -0700
  • eb57f0a8c3 Merge branch 'master' into testing Matt 2015-11-19 16:11:38 -0700
  • 87af33db66 Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing Zak Betz 2015-11-19 12:25:33 -0700
  • 0bc50deb42 Filter link text anomalies at query time. If a search result only has a few matches for a term in link text, then don't return it in search results for that query. Zak Betz 2015-11-19 12:25:25 -0700
  • 7783c4655e Merge pull request #66 from alc-privacore/master Gigablast 2015-11-18 23:39:49 -0700
  • 68f41bd22a debug why we don't dump core sometimes. Matt 2015-11-18 16:11:27 -0700
  • e0f4ba65c1 remove fixme log comment Matt 2015-11-18 08:11:45 -0700
  • feff30b6dc Merge branch 'diffbot' into testing Matt 2015-11-17 11:04:56 -0700
  • b8d57dcd3a fix bug of dumping too many files to disk and not being able to merge, and corrupting RdbBase::m_files[] array and associated arrays. Matt Wells 2015-11-17 09:52:41 -0800
  • 690b4c5069 fix core from bogus url some more. Matt 2015-11-16 12:51:18 -0700
  • 1a3c69af6b fix core dump from empty url Matt 2015-11-16 12:08:16 -0700
  • 296651d416 fix getLeastLoadedInShard() to only return the appropriate nospider/noquery hosts when using nospider/noquery in hosts.conf. Matt 2015-11-16 09:53:40 -0700
  • 1b60cbd46e fix core in Url.cpp Matt 2015-11-16 09:29:08 -0700
  • 6e12f96aea Merge branch 'testing' Matt 2015-11-14 10:57:27 -0700
  • 9ff387a898 More fixes to prevent spider traffic from hitting hosts with nospider directive. Bug fix for msg20 lookups always being directed away from noquery hosts. Zak Betz 2015-11-13 15:03:02 -0700
  • 8b84297392 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-11-11 08:27:25 -0800
  • 6cf6abf3d9 fix spider proxy clean up algo a little so it won't freeze up Matt Wells 2015-11-11 08:27:09 -0800
  • 5f1695fab8 fix url.cpp Matt 2015-11-10 00:29:42 -0700
  • 5061e5d7b5 normalize utf8 url paths into url encoded sequences. Matt 2015-11-09 13:54:32 -0700
  • 80991c943f complete merge of ia code into testing. make indexing warcs/arcs a switch in spider controls. Matt 2015-11-09 12:46:06 -0700
  • fe448173d5 Merge branch 'ia' into testing Matt 2015-11-09 11:14:00 -0700
  • 37cc4f2ba8 Merge branch 'diffbot-testing' into testing Matt 2015-11-09 11:13:42 -0700
  • 1351d9f994 Code cleanup. Zak Betz 2015-11-09 09:01:20 -0700
  • dbe93c2ccf fix bug of not always dumping core? Matt 2015-11-08 08:54:46 -0700
  • 3db9ae5d4d rebuild fix Matt Wells 2015-11-07 13:14:38 -0800
  • 93ec3138c5 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-11-06 13:31:02 -0800
  • 44e3b0ca19 try to fix spider proxy load table pruning bug. Matt Wells 2015-11-06 13:30:42 -0800
  • c1bbd0207d Don't bias tagdb lookups to a single host, use the host with the lowest number of outstanding requests. The original reasoning was that one host would handle all lookups for a site and that lookup would remain in cache. Given that there are mega hubs like youtube and facebook there should be as many hosts as possible handling requests for these sites and the tagdb entries should stay in cache in all of the hosts that have the key. Zak Betz 2015-11-04 15:37:49 -0700
  • afbedba858 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-11-03 18:49:49 -0800
  • c8c29db56b fix core Matt 2015-11-03 18:49:42 -0800
  • baa817b51d Fix load balance of msg22s to use the udp slots in pinginfo. Fix sigchild interrupting popen, when pdftohtml segfaults popen was hanging forever. Fix another bug when content length in http header was one off. Zak Betz 2015-11-03 11:51:19 -0700
  • 7608c5c29c default I/O error detection to enabled so we see hosts with I/O errors in the hosts table. Matt 2015-11-03 11:33:16 -0700
  • 95d70b110e fix bug in rebuild pipeline. need to merge the files lest we max the # of files out. Matt 2015-11-03 11:12:39 -0700
  • 6526accb50 Fix coredump when using add URL Ai Lin Chia 2015-11-02 17:23:01 +0100
  • 08b6fa67d7 improve spider performance when we have lots of collections. fix core from corrupt titledb rec of some sort. automatically turn off profiler when you get data back for simplicity. Matt Wells 2015-11-01 20:23:18 -0800
  • ff6caf79a2 Increase time to mark item as stale in warc injector. Zak Betz 2015-11-01 19:45:29 -0700
  • cc305eb73a fix so we can generate posdb map for headless data files. Matt 2015-11-01 14:56:39 -0800
  • 23d376f6c7 fix core from a bad title rec fetch Matt 2015-10-29 19:43:02 -0700
  • aeca57e9f4 Pass in the buffer size of an injection request so that if the content length header field is bigger than the actual buffer we won't index random memory. Fixes bug with truncated warc captures. Zak Betz 2015-10-28 00:38:08 -0600
  • 555844ce1f Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream Zak Betz 2015-10-26 22:17:03 -0600
  • f7bb617b85 Fixes for bad content lengths when injecting warcs. Zak Betz 2015-10-26 22:15:03 -0600
  • 18e0b9ea9c Fix warc injection so that pdfs, xls, ps, docs work. Crank up max warc rec size to 5mb because pdfs are rarely < 1MB. Zak Betz 2015-10-25 23:09:43 -0600
  • 66145e4396 fix core when exiting while merging Matt Wells 2015-10-24 12:50:57 -0700
  • 776b94396e a new ban msg for http status 503 Matt Wells 2015-10-22 13:23:02 -0700
  • 488db03f60 do not send summary requests to non queryable hosts Matt 2015-10-22 11:46:13 -0600
  • 998c25e29b spider proxy fixes for negative ports Matt Wells 2015-10-21 15:32:58 -0700
  • b2af4a00ae remove old code preventing proxies form being passed to diffbot Matt Wells 2015-10-21 14:46:38 -0700
  • 5f965c2c9a reset proxy table every hour Matt Wells 2015-10-21 13:30:48 -0700
  • 5241f2e1c7 Fix double call of gotSummary when computing facets in msg40. Fixes missing results on page > 1 when searching for facets. Zak Betz 2015-10-20 17:21:37 -0600
  • 089b36e050 Injector fixes. Zak Betz 2015-10-20 17:01:05 -0600
  • 51d68c4b3d pass proxy info back to diffbot Matt 2015-10-20 15:53:16 -0600
  • 2d8c84b29c fix bug of not shutting down right away Matt Wells 2015-10-20 13:26:24 -0700
  • b0b716010e turn off proxyauth stuff for now Matt Wells 2015-10-20 13:06:59 -0700
  • 771f4d7799 Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-10-20 11:48:44 -0700
  • 928511f036 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-10-20 09:45:15 -0700
  • b9451a1f6f fix token expired bug Matt Wells 2015-10-20 09:44:50 -0700
  • 925fea29f4 Bug fix for search with facets with s=N | N > 0 Make warc injector more resillient to advancedsearch.php failure. Zak Betz 2015-10-19 18:28:15 -0600
  • e6d2cb5962 backwards compatible fix Matt Wells 2015-10-19 16:12:41 -0600
  • 3afd768a32 make rel no follow a separate switch, but still just use the robots.txt switch for diffbot crawls. Matt Wells 2015-10-19 15:34:57 -0600
  • 2df573acd8 enable diffbot proxyauth stuff for http urls only Matt 2015-10-19 10:18:20 -0600
  • 667e65ce01 Progress bar for warc injector. Zak Betz 2015-10-19 10:08:04 -0600
  • 5d07e24c01 use rel no follow switch support. Matt 2015-10-19 10:05:46 -0600
  • ea139a65e6 Warc stream busy loop fixes. Load balance msg22 to the one with the least outstanding requests. Zak Betz 2015-10-15 22:30:07 -0600
  • 75b72cc233 fix add url seg fault Matt 2015-10-14 13:57:47 -0600
  • e57e3481b4 fix innerloop strangeness when counting keys in buckets Matt 2015-10-14 13:52:42 -0600
  • 3e19d43aa5 fix core Matt 2015-10-14 12:03:12 -0600
  • a4901431be a couple little fixes to pass smokes Matt 2015-10-14 11:53:05 -0600
  • c37ab2697e Merge branch 'ia' into testing Matt 2015-10-12 10:40:16 -0600
  • 08877b6334 Merge branch 'diffbot-testing' into testing Matt 2015-10-12 10:39:35 -0600
  • 6a40315237 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into warc-stream Zak Betz 2015-10-12 00:32:52 -0600
  • ac25435b54 Warc pipe fixes. Fix arcs not processing https. Fix nulls being left in warc read buffer causing second pass to fail. Zak Betz 2015-10-12 00:30:28 -0600
  • d045138c22 Merge branch 'ia' into ia-zak Matt 2015-10-10 14:15:46 -0600
  • 4d7d5b12a2 Merge branch 'diffbot-testing' into ia Matt 2015-10-10 14:15:36 -0600
  • fa691bf06c also fix for numbers like for facet termlists Matt 2015-10-10 14:15:09 -0600
  • 43c8949841 Merge branch 'ia' into ia-zak Matt 2015-10-10 14:07:15 -0600
  • 298ae7e7b2 Merge branch 'diffbot-testing' into ia Matt 2015-10-10 14:05:27 -0600