Commit Graph

  • d4e16a4dab pass a crawlbotnightly smoke Matt Wells 2016-03-04 13:14:28 -0800
  • e75d80abbe ignore meta redirect tags in html comment tags. Matt Wells 2016-02-22 12:41:03 -0800
  • 412b04bbd4 fix neverending crawl rounds by only trying each url once per round. updated url filters. Matt Wells 2016-02-22 09:28:46 -0800
  • da9949f462 try to fix a couple more core dumps. Matt Wells 2016-02-19 08:54:48 -0800
  • c7696a69eb fix core from a federated query and null msg20 Matt Wells 2016-02-18 10:53:20 -0800
  • f649944573 if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal. Matt Wells 2016-02-16 12:25:49 -0800
  • 8748cd06ac Merge pull request #73 from AppChecker/master Gigablast 2016-02-13 23:00:37 -0700
  • f11595efc3 fix core dump from deleting an active/dumping collection Matt Wells 2016-02-12 16:54:03 -0800
  • bf4bdd6bfd Merge branch 'diffbot-testing' into testing Matt Wells 2016-02-10 09:50:53 -0800
  • e68406f073 fix core in posdbtable from docid of 0. no idea why docid was 0, but why core? Matt Wells 2016-02-09 22:43:09 -0800
  • e376b97814 let's generalize it. if a redirect sets cookies then follow it through, don't stop in the middle because we think it is 'simplified'. Matt Wells 2016-02-09 13:47:12 -0800
  • d7a6a0a1ff fix gap.com redirects that require us setting multiple cookies in the spider request. Matt Wells 2016-02-09 13:38:59 -0800
  • 92934bbd5c use http/1.0 since we dont support chunked transfer encoding Matt 2016-02-09 12:04:05 -0700
  • 7ca0b1f738 Merge branch 'diffbot-testing' into testing Matt 2016-02-09 10:39:04 -0700
  • ef97462acc thanks for the bug fix, ivan! Matt 2016-02-09 10:38:46 -0700
  • 1d2dfe1456 bring back max doc len parms. index gbssIsContentTruncated field. fix 30-day wait for >= 3 errors. fix gbss formatting some more. Matt Wells 2016-02-08 14:10:04 -0800
  • 97c2f09225 move gbssFirstIndexed up so the new shard fields dont mess up printing of the human readable dates Matt Wells 2016-02-08 10:26:02 -0800
  • cdb8a5f86a fix core from treating corrupted titlerecs as non-existent. Matt Wells 2016-02-08 09:49:04 -0800
  • c1a72213d7 watch out for negative datasize spider requests in doledb when calling xmldoc::set4 so we don't core any more. Matt Wells 2016-02-08 09:18:33 -0800
  • 3b6704d318 also show what shardnum stores the docid so we can track down titlerec corruption easier. Matt Wells 2016-02-08 08:58:37 -0800
  • d183525db6 treat corrupted titlerecs as not founds so spidering can continue despite it. update gbss result display to have human readable dates and a link to the docid. added gbssSpideredByHostId so we can track down issues faster. Matt Wells 2016-02-08 08:42:04 -0800
  • 9247e15210 make spider use HTTP/1.1 not 1.0 since some sites have been found to return 406 unacceptable periodically because of it. Matt Wells 2016-02-05 10:14:09 -0800
  • 701b14ada7 Fix: possible double free appchecker 2016-02-05 16:11:53 +0300
  • 4eec8eb5b7 fix invalid <base href=/> tag Matt Wells 2016-02-02 12:52:45 -0800
  • ef8dc25b55 fix compiler warning Matt 2016-01-30 14:31:01 -0700
  • 19de1d2eae Merge branch 'diffbot-testing' into testing Matt 2016-01-30 14:27:08 -0700
  • 1a0378b76e do not report edocunchanged for bulk jobs ever. more skipped shards bug fixes. remove some log spam. Matt Wells 2016-01-30 11:14:12 -0800
  • 3e9ee2f6d0 bulk robots hack fix Matt Wells 2016-01-28 09:14:38 -0800
  • 04bdda20cf fix for empty queries saying a shard is down Matt Wells 2016-01-28 08:51:04 -0800
  • 4c7f969988 fix critical spider issue of an IP corking an entire spider priority. also exit faster on 'save & exit' if in evalIpLoop(). Matt Wells 2016-01-23 10:32:14 -0800
  • 7e13b147e4 updated dmoz docs Matt 2016-01-23 08:54:35 -0700
  • fb3b179666 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2016-01-21 09:58:38 -0800
  • a652d60b2b detect more corrupted spider records caused by saving memory to disk after a segv corrupts the memory. Matt Wells 2016-01-21 09:57:43 -0800
  • 4d88ca5d28 improve spierrequest::isCorrupt() Matt Wells 2016-01-19 22:24:52 -0800
  • 1e248218da added 4 more diffbot errors so hopefully no more 'unknown diffbot error' error codes in crawlbot. Matt 2016-01-11 16:12:33 -0800
  • 032f597a16 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2016-01-11 15:30:53 -0800
  • 33a40480b4 try to fix Unknown Diffbot Error error. Matt 2016-01-11 15:28:30 -0800
  • 422ffae8e3 fix a core from dumping doledb out to stdout Matt Wells 2016-01-11 10:32:47 -0800
  • 4fb900da03 fix strange corruption in doledb core Matt Wells 2016-01-11 09:40:55 -0800
  • 344732ac19 Don't try to match implicit non-required phrases when verifying doc has query terms. Zak Betz 2016-01-08 10:09:34 -0700
  • 008b21ee6b Fix query "the" and "the" not matching all of the terms. Zak Betz 2016-01-07 15:30:45 -0700
  • 6d73b57243 fix core dump from bad langid of 99 Matt Wells 2016-01-05 14:32:02 -0800
  • 03aad9db54 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2016-01-05 13:50:18 -0700
  • 5804e28230 fix bug of losing the hopcount 0 spiderrequest because it gets overridden by a link to itself then it becomes hopcount 1. Matt 2016-01-05 13:49:54 -0700
  • bb316b0e88 respider all diffbot urls even if 404 or 500. that sometimes happens TEPMORARILY to a website. Matt Wells 2016-01-05 12:47:04 -0800
  • 953d636448 added trivial link on cached page to gb root page Matt Wells 2016-01-03 11:27:24 -0800
  • 60e0306d0d fix xmldoc::getRootXmlDoc() to put an http:// or https:// in front of the root url before putting into spiderrequest so it is not perceived as corrupted. Matt Wells 2015-12-29 08:22:14 -0800
  • dd46a82bb7 fix again Matt Wells 2015-12-28 14:50:11 -0800
  • 1091c5c4ea fix core Matt Wells 2015-12-28 14:48:03 -0800
  • 21c337be27 fix the spider status msg fix some more Matt Wells 2015-12-28 12:03:50 -0800
  • d6a9db35d2 instead of showing SP_ROUNDDONE, show SP_MAXROUNDS if necessary so we can pass the crawlbot nightly smoke tests. Matt Wells 2015-12-28 11:50:56 -0800
  • 5d049213b9 try to fix host 31 from coring. possible corrupt spider request. Matt 2015-12-28 12:22:39 -0700
  • 9b423690ee fix compiler error for maxpp Matt Wells 2015-12-28 11:21:36 -0800
  • 17f21b9498 PageResults.cpp: Taking out maxpp requires taking it out of log entry. Just leave it in for now? MikeLx 2015-12-24 00:55:37 -0500
  • 40ff79c8ff Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-12-23 22:18:49 -0700
  • 7be4224817 change try agains recvd to try agains sent in the hosts table. Matt 2015-12-23 22:18:24 -0700
  • da5f0c8d76 Merge branch 'testing' into diffbot-testing Matt Wells 2015-12-23 14:23:05 -0800
  • 9147d6bb02 fix some diffbot crawls. do not spider pages at the hopcount limit when 'only spider urls if new' is enabled. meaning only spider each url once. (unless there is a temporary error) fix malformed url bug some more. added some commented out code for indexing spider replies (gbss docs) for certain fatal/critical errors, in which case they are not being indexed. Matt Wells 2015-12-23 13:49:21 -0800
  • a5e6a12ff8 added support for TLS SNI (Server name identification) Matt 2015-12-23 13:30:49 -0700
  • 6f9f177d7c another fix for the parms not getting updated fix. Matt 2015-12-17 14:30:35 -0700
  • 770e94b4cc fix it so we don't call the page handler before all parms have been digested. Matt Wells 2015-12-17 13:11:57 -0800
  • a07c840a6a join with threads when exiting -- to no avail exit status is still foobar. Matt Wells 2015-12-17 10:15:39 -0800
  • c5f21a721f zero out crazy local spider stats. corrupted from saving after memory got corrupted. Matt Wells 2015-12-17 09:43:41 -0800
  • f2be319dcd try to fix exiting w/ pthreads some more (part 2) Matt 2015-12-17 08:38:12 -0700
  • 73e9ed0719 try to fix cores not being dropped. Matt 2015-12-17 06:44:25 -0700
  • ede7a78594 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-12-17 06:08:47 -0700
  • 83ade5c58d fix spider req corruption Matt Wells 2015-12-16 20:33:45 -0800
  • becc244e12 add new link to page crawlbot to see spider attempt gbss docs. Matt Wells 2015-12-15 16:22:53 -0800
  • 9a20a4387f improve spiderreq corruption detection Matt Wells 2015-12-15 14:23:21 -0800
  • 879ef32db0 fix for all urls getting malformed url (EBADURL) while spidering. had to add 'errorcode==' to urlfilters to just redo those pages otherwise having that error, a non-temporary error, would have barred them from being retried in the future. Matt Wells 2015-12-15 10:06:06 -0800
  • 23a20ff639 more fixes for ebadurl bug Matt Wells 2015-12-14 17:49:06 -0800
  • e1cfeb4c82 have diffbot retry non tmp errors to make up for bug of calling valid urls malformed EBADURLs Matt Wells 2015-12-14 17:30:41 -0800
  • 4b9240e42e fix a fix Matt Wells 2015-12-14 17:06:58 -0800
  • 6d30f21ad9 use small dgrams to avoid splitting at the kernel level down to the mtu. increase 0xc1 msg request delay from 3 to 20 secs. need to make it linear order. Matt Wells 2015-12-14 10:47:07 -0800
  • cb0d75b343 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-12-12 19:21:06 -0800
  • c161f17196 fix punycode bug in firsturl of bad punycode. Matt Wells 2015-12-12 19:18:28 -0800
  • dc505aefa8 Merge branch 'testing' into diffbot-testing Matt 2015-12-11 12:35:15 -0700
  • f548b6a728 smaller dgram size works on more networks and over the internet. Matt 2015-12-11 12:34:51 -0700
  • b92763ebc7 Merge branch 'testing' into diffbot-testing Matt 2015-12-09 23:11:37 -0700
  • 657c27a0ee Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-12-09 21:35:30 -0800
  • 2b60faf5df Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-12-09 21:34:54 -0800
  • a5b8276299 Merge branch 'diffbot' into diffbot-testing Matt 2015-12-09 17:56:44 -0700
  • 15d1ac7b5b do not dedup if &links is in diffbot api url (or ?links) Matt Wells 2015-12-09 16:52:11 -0800
  • 27e49df739 fix for www.gov.uk having iswwwdup bug. because .gov.uk is a tld and so is .uk. also added code to handle seg fault signals better and run the default handler after saving rather than calling abort(). hopefully a core will be dumped all the time now. Matt Wells 2015-12-09 15:54:53 -0800
  • 256fadf294 use https for robots.txt if that was the protocol of the original url. Matt 2015-12-09 16:51:29 -0700
  • ea71deefea Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-12-08 10:31:03 -0800
  • 58eb71e4c1 added FIXBUG code to fix seg fault from deleting a collection while it is renaming/unlinking files after a merge. Matt Wells 2015-12-08 10:30:16 -0800
  • bea2c8ab00 try calling original sigsegv handler to guarantee we dump core. Matt 2015-12-07 16:11:55 -0700
  • 9c3644cef2 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-12-05 15:59:01 -0700
  • a3a7635dcf Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-12-04 09:03:16 -0800
  • b049554531 added some more quickpolls. improved heartbeat log msg. timed pthread_join. brought back max heart beat delay parm. Matt Wells 2015-12-04 09:02:03 -0800
  • 24dc98bd9c do not do spider-time deuping if &links is in the diffbot api url. Matt 2015-12-02 13:09:42 -0700
  • 929ba33a5f Revert "hash the normalized outlinks in the diffbot reply" Matt 2015-12-02 13:04:56 -0700
  • 5f6872041a Revert "fix possible core" Matt 2015-12-02 13:04:54 -0700
  • de0f995239 fix possible core Matt 2015-12-02 12:38:52 -0700
  • 908dfa602a hash the normalized outlinks in the diffbot reply as part of the page crc that we use for de-duping. that way if all pages are rendered they want appear to be dups of one another. Matt 2015-12-02 12:30:04 -0700
  • a34bce91e5 Merge branch 'diffbot' into diffbot-testing Matt 2015-12-02 11:48:37 -0700
  • 72017925fb fix for RdbList::constrain() core Matt Wells 2015-12-01 17:09:34 -0700
  • 58bb17d5ae undo a change that didn't work. Matt Wells 2015-12-01 09:03:59 -0800
  • 34b33f478a added gb rwtest and exposed seektest and thrutest in gb -h. use -o sync when mounting ssds to avoid really slow and spiky linux file/page cache. allow launching of more than 1 non-disk thread again. should help with unlinking, intersects, etc. Matt Wells 2015-11-30 21:29:17 -0700