d4e16a4dabpass a crawlbotnightly smoke
Matt Wells
2016-03-04 13:14:28 -0800
e75d80abbeignore meta redirect tags in html comment tags.
Matt Wells
2016-02-22 12:41:03 -0800
412b04bbd4fix neverending crawl rounds by only trying each url once per round. updated url filters.
Matt Wells
2016-02-22 09:28:46 -0800
da9949f462try to fix a couple more core dumps.
Matt Wells
2016-02-19 08:54:48 -0800
c7696a69ebfix core from a federated query and null msg20
Matt Wells
2016-02-18 10:53:20 -0800
f649944573if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal.
Matt Wells
2016-02-16 12:25:49 -0800
8748cd06acMerge pull request #73 from AppChecker/master
Gigablast
2016-02-13 23:00:37 -0700
f11595efc3fix core dump from deleting an active/dumping collection
Matt Wells
2016-02-12 16:54:03 -0800
bf4bdd6bfdMerge branch 'diffbot-testing' into testing
Matt Wells
2016-02-10 09:50:53 -0800
e68406f073fix core in posdbtable from docid of 0. no idea why docid was 0, but why core?
Matt Wells
2016-02-09 22:43:09 -0800
e376b97814let's generalize it. if a redirect sets cookies then follow it through, don't stop in the middle because we think it is 'simplified'.
Matt Wells
2016-02-09 13:47:12 -0800
d7a6a0a1fffix gap.com redirects that require us setting multiple cookies in the spider request.
Matt Wells
2016-02-09 13:38:59 -0800
92934bbd5cuse http/1.0 since we dont support chunked transfer encoding
Matt
2016-02-09 12:04:05 -0700
7ca0b1f738Merge branch 'diffbot-testing' into testing
Matt
2016-02-09 10:39:04 -0700
ef97462accthanks for the bug fix, ivan!
Matt
2016-02-09 10:38:46 -0700
1d2dfe1456bring back max doc len parms. index gbssIsContentTruncated field. fix 30-day wait for >= 3 errors. fix gbss formatting some more.
Matt Wells
2016-02-08 14:10:04 -0800
97c2f09225move gbssFirstIndexed up so the new shard fields dont mess up printing of the human readable dates
Matt Wells
2016-02-08 10:26:02 -0800
cdb8a5f86afix core from treating corrupted titlerecs as non-existent.
Matt Wells
2016-02-08 09:49:04 -0800
c1a72213d7watch out for negative datasize spider requests in doledb when calling xmldoc::set4 so we don't core any more.
Matt Wells
2016-02-08 09:18:33 -0800
3b6704d318also show what shardnum stores the docid so we can track down titlerec corruption easier.
Matt Wells
2016-02-08 08:58:37 -0800
d183525db6treat corrupted titlerecs as not founds so spidering can continue despite it. update gbss result display to have human readable dates and a link to the docid. added gbssSpideredByHostId so we can track down issues faster.
Matt Wells
2016-02-08 08:42:04 -0800
9247e15210make spider use HTTP/1.1 not 1.0 since some sites have been found to return 406 unacceptable periodically because of it.
Matt Wells
2016-02-05 10:14:09 -0800
701b14ada7Fix: possible double free
appchecker
2016-02-05 16:11:53 +0300
4eec8eb5b7fix invalid <base href=/> tag
Matt Wells
2016-02-02 12:52:45 -0800
ef8dc25b55fix compiler warning
Matt
2016-01-30 14:31:01 -0700
19de1d2eaeMerge branch 'diffbot-testing' into testing
Matt
2016-01-30 14:27:08 -0700
1a0378b76edo not report edocunchanged for bulk jobs ever. more skipped shards bug fixes. remove some log spam.
Matt Wells
2016-01-30 11:14:12 -0800
3e9ee2f6d0bulk robots hack fix
Matt Wells
2016-01-28 09:14:38 -0800
04bdda20cffix for empty queries saying a shard is down
Matt Wells
2016-01-28 08:51:04 -0800
4c7f969988fix critical spider issue of an IP corking an entire spider priority. also exit faster on 'save & exit' if in evalIpLoop().
Matt Wells
2016-01-23 10:32:14 -0800
7e13b147e4updated dmoz docs
Matt
2016-01-23 08:54:35 -0700
fb3b179666Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2016-01-21 09:58:38 -0800
a652d60b2bdetect more corrupted spider records caused by saving memory to disk after a segv corrupts the memory.
Matt Wells
2016-01-21 09:57:43 -0800
4d88ca5d28improve spierrequest::isCorrupt()
Matt Wells
2016-01-19 22:24:52 -0800
1e248218daadded 4 more diffbot errors so hopefully no more 'unknown diffbot error' error codes in crawlbot.
Matt
2016-01-11 16:12:33 -0800
032f597a16Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2016-01-11 15:30:53 -0800
33a40480b4try to fix Unknown Diffbot Error error.
Matt
2016-01-11 15:28:30 -0800
422ffae8e3fix a core from dumping doledb out to stdout
Matt Wells
2016-01-11 10:32:47 -0800
4fb900da03fix strange corruption in doledb core
Matt Wells
2016-01-11 09:40:55 -0800
344732ac19Don't try to match implicit non-required phrases when verifying doc has query terms.
Zak Betz
2016-01-08 10:09:34 -0700
008b21ee6bFix query "the" and "the" not matching all of the terms.
Zak Betz
2016-01-07 15:30:45 -0700
6d73b57243fix core dump from bad langid of 99
Matt Wells
2016-01-05 14:32:02 -0800
03aad9db54Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2016-01-05 13:50:18 -0700
5804e28230fix bug of losing the hopcount 0 spiderrequest because it gets overridden by a link to itself then it becomes hopcount 1.
Matt
2016-01-05 13:49:54 -0700
bb316b0e88respider all diffbot urls even if 404 or 500. that sometimes happens TEPMORARILY to a website.
Matt Wells
2016-01-05 12:47:04 -0800
953d636448added trivial link on cached page to gb root page
Matt Wells
2016-01-03 11:27:24 -0800
60e0306d0dfix xmldoc::getRootXmlDoc() to put an http:// or https:// in front of the root url before putting into spiderrequest so it is not perceived as corrupted.
Matt Wells
2015-12-29 08:22:14 -0800
dd46a82bb7fix again
Matt Wells
2015-12-28 14:50:11 -0800
1091c5c4eafix core
Matt Wells
2015-12-28 14:48:03 -0800
21c337be27fix the spider status msg fix some more
Matt Wells
2015-12-28 12:03:50 -0800
d6a9db35d2instead of showing SP_ROUNDDONE, show SP_MAXROUNDS if necessary so we can pass the crawlbot nightly smoke tests.
Matt Wells
2015-12-28 11:50:56 -0800
5d049213b9try to fix host 31 from coring. possible corrupt spider request.
Matt
2015-12-28 12:22:39 -0700
9b423690eefix compiler error for maxpp
Matt Wells
2015-12-28 11:21:36 -0800
17f21b9498PageResults.cpp: Taking out maxpp requires taking it out of log entry. Just leave it in for now?
MikeLx
2015-12-24 00:55:37 -0500
40ff79c8ffMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-12-23 22:18:49 -0700
7be4224817change try agains recvd to try agains sent in the hosts table.
Matt
2015-12-23 22:18:24 -0700
da5f0c8d76Merge branch 'testing' into diffbot-testing
Matt Wells
2015-12-23 14:23:05 -0800
9147d6bb02fix some diffbot crawls. do not spider pages at the hopcount limit when 'only spider urls if new' is enabled. meaning only spider each url once. (unless there is a temporary error) fix malformed url bug some more. added some commented out code for indexing spider replies (gbss docs) for certain fatal/critical errors, in which case they are not being indexed.
Matt Wells
2015-12-23 13:49:21 -0800
a5e6a12ff8added support for TLS SNI (Server name identification)
Matt
2015-12-23 13:30:49 -0700
6f9f177d7canother fix for the parms not getting updated fix.
Matt
2015-12-17 14:30:35 -0700
770e94b4ccfix it so we don't call the page handler before all parms have been digested.
Matt Wells
2015-12-17 13:11:57 -0800
a07c840a6ajoin with threads when exiting -- to no avail exit status is still foobar.
Matt Wells
2015-12-17 10:15:39 -0800
c5f21a721fzero out crazy local spider stats. corrupted from saving after memory got corrupted.
Matt Wells
2015-12-17 09:43:41 -0800
f2be319dcdtry to fix exiting w/ pthreads some more (part 2)
Matt
2015-12-17 08:38:12 -0700
73e9ed0719try to fix cores not being dropped.
Matt
2015-12-17 06:44:25 -0700
ede7a78594Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-12-17 06:08:47 -0700
83ade5c58dfix spider req corruption
Matt Wells
2015-12-16 20:33:45 -0800
becc244e12add new link to page crawlbot to see spider attempt gbss docs.
Matt Wells
2015-12-15 16:22:53 -0800
9a20a4387fimprove spiderreq corruption detection
Matt Wells
2015-12-15 14:23:21 -0800
879ef32db0fix for all urls getting malformed url (EBADURL) while spidering. had to add 'errorcode==' to urlfilters to just redo those pages otherwise having that error, a non-temporary error, would have barred them from being retried in the future.
Matt Wells
2015-12-15 10:06:06 -0800
23a20ff639more fixes for ebadurl bug
Matt Wells
2015-12-14 17:49:06 -0800
e1cfeb4c82have diffbot retry non tmp errors to make up for bug of calling valid urls malformed EBADURLs
Matt Wells
2015-12-14 17:30:41 -0800
4b9240e42efix a fix
Matt Wells
2015-12-14 17:06:58 -0800
6d30f21ad9use small dgrams to avoid splitting at the kernel level down to the mtu. increase 0xc1 msg request delay from 3 to 20 secs. need to make it linear order.
Matt Wells
2015-12-14 10:47:07 -0800
cb0d75b343Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-12-12 19:21:06 -0800
c161f17196fix punycode bug in firsturl of bad punycode.
Matt Wells
2015-12-12 19:18:28 -0800
dc505aefa8Merge branch 'testing' into diffbot-testing
Matt
2015-12-11 12:35:15 -0700
f548b6a728smaller dgram size works on more networks and over the internet.
Matt
2015-12-11 12:34:51 -0700
b92763ebc7Merge branch 'testing' into diffbot-testing
Matt
2015-12-09 23:11:37 -0700
657c27a0eeMerge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-12-09 21:35:30 -0800
2b60faf5dfMerge branch 'diffbot' into diffbot-testing
Matt Wells
2015-12-09 21:34:54 -0800
a5b8276299Merge branch 'diffbot' into diffbot-testing
Matt
2015-12-09 17:56:44 -0700
15d1ac7b5bdo not dedup if &links is in diffbot api url (or ?links)
Matt Wells
2015-12-09 16:52:11 -0800
27e49df739fix for www.gov.uk having iswwwdup bug. because .gov.uk is a tld and so is .uk. also added code to handle seg fault signals better and run the default handler after saving rather than calling abort(). hopefully a core will be dumped all the time now.
Matt Wells
2015-12-09 15:54:53 -0800
256fadf294use https for robots.txt if that was the protocol of the original url.
Matt
2015-12-09 16:51:29 -0700
ea71deefeaMerge branch 'diffbot' into diffbot-testing
Matt Wells
2015-12-08 10:31:03 -0800
58eb71e4c1added FIXBUG code to fix seg fault from deleting a collection while it is renaming/unlinking files after a merge.
Matt Wells
2015-12-08 10:30:16 -0800
bea2c8ab00try calling original sigsegv handler to guarantee we dump core.
Matt
2015-12-07 16:11:55 -0700
9c3644cef2Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-12-05 15:59:01 -0700
a3a7635dcfMerge branch 'diffbot' into diffbot-testing
Matt Wells
2015-12-04 09:03:16 -0800
b049554531added some more quickpolls. improved heartbeat log msg. timed pthread_join. brought back max heart beat delay parm.
Matt Wells
2015-12-04 09:02:03 -0800
24dc98bd9cdo not do spider-time deuping if &links is in the diffbot api url.
Matt
2015-12-02 13:09:42 -0700
929ba33a5fRevert "hash the normalized outlinks in the diffbot reply"
Matt
2015-12-02 13:04:56 -0700
5f6872041aRevert "fix possible core"
Matt
2015-12-02 13:04:54 -0700
de0f995239fix possible core
Matt
2015-12-02 12:38:52 -0700
908dfa602ahash the normalized outlinks in the diffbot reply as part of the page crc that we use for de-duping. that way if all pages are rendered they want appear to be dups of one another.
Matt
2015-12-02 12:30:04 -0700
a34bce91e5Merge branch 'diffbot' into diffbot-testing
Matt
2015-12-02 11:48:37 -0700
72017925fbfix for RdbList::constrain() core
Matt Wells
2015-12-01 17:09:34 -0700
58bb17d5aeundo a change that didn't work.
Matt Wells
2015-12-01 09:03:59 -0800
34b33f478aadded gb rwtest and exposed seektest and thrutest in gb -h. use -o sync when mounting ssds to avoid really slow and spiky linux file/page cache. allow launching of more than 1 non-disk thread again. should help with unlinking, intersects, etc.
Matt Wells
2015-11-30 21:29:17 -0700