c7c8c9e5ad
Merge branch 'diffbot-testing' into testing
Matt
2016-03-16 00:54:49 -06:00
0b5f417349
if old title rec was corrupted we would get a random docid when re-spidering the url causing some chaos. now things should return to normal and we should overwrite the corrupted titlerec on the next spidering. also, no longer do robots.txt titlerec lookups. silly.
Matt Wells
2016-03-15 23:26:57 -07:00
58993dbbf9
do not allow crawlbot seeds to be deduped out
Matt Wells
2016-03-15 20:42:28 -07:00
bf45db6f48
Merge branch 'diffbot-testing' into testing
Matt Wells
2016-03-15 15:55:55 -07:00
8a65d21371
fix the source of lots of corruption in spiderdb and titledb. rdbmem.cpp was storing in secondary mem which got reset when dump completed. also do not add keys that are in collnum and key range of list currently being dumped, return ETRYAGAIN. added verify writes parm. clean out tree of titledb and spiderdb corruption on startup.
Matt Wells
2016-03-15 15:54:12 -07:00
0fdbaa4196
makefile optimizations
Matt Wells
2016-03-14 16:34:24 -07:00
0dbc304bbf
fix to allow us to gather ip-only url outlinks again
Matt
2016-03-14 10:56:33 -06:00
2c167aada7
fix redirect to self bug that requires setting cookie
Matt
2016-03-14 10:33:05 -06:00
d6fe684b99
fix another core caused by deleted coll
Matt Wells
2016-03-07 10:20:25 -08:00
d4e16a4dab
pass a crawlbotnightly smoke
Matt Wells
2016-03-04 13:14:28 -08:00
e75d80abbe
ignore meta redirect tags in html comment tags.
Matt Wells
2016-02-22 12:41:03 -08:00
412b04bbd4
fix neverending crawl rounds by only trying each url once per round. updated url filters.
Matt Wells
2016-02-22 09:28:46 -08:00
da9949f462
try to fix a couple more core dumps.
Matt Wells
2016-02-19 08:54:48 -08:00
c7696a69eb
fix core from a federated query and null msg20
Matt Wells
2016-02-18 10:53:20 -08:00
f649944573
if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal.
Matt Wells
2016-02-16 12:25:49 -08:00
f11595efc3
fix core dump from deleting an active/dumping collection
Matt Wells
2016-02-12 16:54:03 -08:00
bf4bdd6bfd
Merge branch 'diffbot-testing' into testing
Matt Wells
2016-02-10 09:50:53 -08:00
e68406f073
fix core in posdbtable from docid of 0. no idea why docid was 0, but why core?
Matt Wells
2016-02-09 22:43:09 -08:00
e376b97814
let's generalize it. if a redirect sets cookies then follow it through, don't stop in the middle because we think it is 'simplified'.
Matt Wells
2016-02-09 13:47:12 -08:00
d7a6a0a1ff
fix gap.com redirects that require us setting multiple cookies in the spider request.
Matt Wells
2016-02-09 13:38:59 -08:00
92934bbd5c
use http/1.0 since we dont support chunked transfer encoding
Matt
2016-02-09 12:04:05 -07:00
7ca0b1f738
Merge branch 'diffbot-testing' into testing
Matt
2016-02-09 10:39:04 -07:00
ef97462acc
thanks for the bug fix, ivan!
Matt
2016-02-09 10:38:46 -07:00
1d2dfe1456
bring back max doc len parms. index gbssIsContentTruncated field. fix 30-day wait for >= 3 errors. fix gbss formatting some more.
Matt Wells
2016-02-08 14:10:04 -08:00
97c2f09225
move gbssFirstIndexed up so the new shard fields dont mess up printing of the human readable dates
Matt Wells
2016-02-08 10:26:02 -08:00
cdb8a5f86a
fix core from treating corrupted titlerecs as non-existent.
Matt Wells
2016-02-08 09:49:04 -08:00
c1a72213d7
watch out for negative datasize spider requests in doledb when calling xmldoc::set4 so we don't core any more.
Matt Wells
2016-02-08 09:18:33 -08:00
3b6704d318
also show what shardnum stores the docid so we can track down titlerec corruption easier.
Matt Wells
2016-02-08 08:58:37 -08:00
d183525db6
treat corrupted titlerecs as not founds so spidering can continue despite it. update gbss result display to have human readable dates and a link to the docid. added gbssSpideredByHostId so we can track down issues faster.
Matt Wells
2016-02-08 08:42:04 -08:00
9247e15210
make spider use HTTP/1.1 not 1.0 since some sites have been found to return 406 unacceptable periodically because of it.
Matt Wells
2016-02-05 10:14:09 -08:00
701b14ada7
Fix: possible double free
appchecker
2016-02-05 16:11:53 +03:00
4eec8eb5b7
fix invalid <base href=/> tag
Matt Wells
2016-02-02 12:52:45 -08:00
ef8dc25b55
fix compiler warning
Matt
2016-01-30 14:31:01 -07:00
19de1d2eae
Merge branch 'diffbot-testing' into testing
Matt
2016-01-30 14:27:08 -07:00
1a0378b76e
do not report edocunchanged for bulk jobs ever. more skipped shards bug fixes. remove some log spam.
Matt Wells
2016-01-30 11:14:12 -08:00
3e9ee2f6d0
bulk robots hack fix
Matt Wells
2016-01-28 09:14:38 -08:00
04bdda20cf
fix for empty queries saying a shard is down
Matt Wells
2016-01-28 08:51:04 -08:00
4c7f969988
fix critical spider issue of an IP corking an entire spider priority. also exit faster on 'save & exit' if in evalIpLoop().
Matt Wells
2016-01-23 10:32:14 -08:00
7e13b147e4
updated dmoz docs
Matt
2016-01-23 08:54:35 -07:00
fb3b179666
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2016-01-21 09:58:38 -08:00
a652d60b2b
detect more corrupted spider records caused by saving memory to disk after a segv corrupts the memory.
Matt Wells
2016-01-21 09:57:43 -08:00
4d88ca5d28
improve spierrequest::isCorrupt()
Matt Wells
2016-01-19 22:24:52 -08:00
1e248218da
added 4 more diffbot errors so hopefully no more 'unknown diffbot error' error codes in crawlbot.
Matt
2016-01-11 16:12:33 -08:00
032f597a16
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2016-01-11 15:30:53 -08:00
33a40480b4
try to fix Unknown Diffbot Error error.
Matt
2016-01-11 15:28:30 -08:00
422ffae8e3
fix a core from dumping doledb out to stdout
Matt Wells
2016-01-11 10:32:47 -08:00
4fb900da03
fix strange corruption in doledb core
Matt Wells
2016-01-11 09:40:55 -08:00
344732ac19
Don't try to match implicit non-required phrases when verifying doc has query terms.
Zak Betz
2016-01-08 10:09:34 -07:00
008b21ee6b
Fix query "the" and "the" not matching all of the terms.
Zak Betz
2016-01-07 15:30:45 -07:00
6d73b57243
fix core dump from bad langid of 99
Matt Wells
2016-01-05 14:32:02 -08:00
03aad9db54
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2016-01-05 13:50:18 -07:00
5804e28230
fix bug of losing the hopcount 0 spiderrequest because it gets overridden by a link to itself then it becomes hopcount 1.
Matt
2016-01-05 13:49:54 -07:00
bb316b0e88
respider all diffbot urls even if 404 or 500. that sometimes happens TEPMORARILY to a website.
Matt Wells
2016-01-05 12:47:04 -08:00
953d636448
added trivial link on cached page to gb root page
Matt Wells
2016-01-03 11:27:24 -08:00
60e0306d0d
fix xmldoc::getRootXmlDoc() to put an http:// or https:// in front of the root url before putting into spiderrequest so it is not perceived as corrupted.
Matt Wells
2015-12-29 08:22:14 -08:00
dd46a82bb7
fix again
Matt Wells
2015-12-28 14:50:11 -08:00
1091c5c4ea
fix core
Matt Wells
2015-12-28 14:48:03 -08:00
21c337be27
fix the spider status msg fix some more
Matt Wells
2015-12-28 12:03:50 -08:00
d6a9db35d2
instead of showing SP_ROUNDDONE, show SP_MAXROUNDS if necessary so we can pass the crawlbot nightly smoke tests.
Matt Wells
2015-12-28 11:50:56 -08:00
5d049213b9
try to fix host 31 from coring. possible corrupt spider request.
Matt
2015-12-28 12:22:39 -07:00
9b423690ee
fix compiler error for maxpp
Matt Wells
2015-12-28 11:21:36 -08:00
40ff79c8ff
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-12-23 22:18:49 -07:00
7be4224817
change try agains recvd to try agains sent in the hosts table.
Matt
2015-12-23 22:18:24 -07:00
da5f0c8d76
Merge branch 'testing' into diffbot-testing
Matt Wells
2015-12-23 14:23:05 -08:00
9147d6bb02
fix some diffbot crawls. do not spider pages at the hopcount limit when 'only spider urls if new' is enabled. meaning only spider each url once. (unless there is a temporary error) fix malformed url bug some more. added some commented out code for indexing spider replies (gbss docs) for certain fatal/critical errors, in which case they are not being indexed.
Matt Wells
2015-12-23 13:49:21 -08:00
a5e6a12ff8
added support for TLS SNI (Server name identification)
Matt
2015-12-23 13:30:49 -07:00
6f9f177d7c
another fix for the parms not getting updated fix.
Matt
2015-12-17 14:30:35 -07:00
770e94b4cc
fix it so we don't call the page handler before all parms have been digested.
Matt Wells
2015-12-17 13:11:57 -08:00
a07c840a6a
join with threads when exiting -- to no avail exit status is still foobar.
Matt Wells
2015-12-17 10:15:39 -08:00
c5f21a721f
zero out crazy local spider stats. corrupted from saving after memory got corrupted.
Matt Wells
2015-12-17 09:43:41 -08:00
f2be319dcd
try to fix exiting w/ pthreads some more (part 2)
Matt
2015-12-17 08:38:12 -07:00
73e9ed0719
try to fix cores not being dropped.
Matt
2015-12-17 06:44:25 -07:00
ede7a78594
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-12-17 06:08:47 -07:00
83ade5c58d
fix spider req corruption
Matt Wells
2015-12-16 20:33:45 -08:00
becc244e12
add new link to page crawlbot to see spider attempt gbss docs.
Matt Wells
2015-12-15 16:22:53 -08:00
9a20a4387f
improve spiderreq corruption detection
Matt Wells
2015-12-15 14:23:21 -08:00
879ef32db0
fix for all urls getting malformed url (EBADURL) while spidering. had to add 'errorcode==' to urlfilters to just redo those pages otherwise having that error, a non-temporary error, would have barred them from being retried in the future.
Matt Wells
2015-12-15 10:06:06 -08:00
23a20ff639
more fixes for ebadurl bug
Matt Wells
2015-12-14 17:49:06 -08:00
e1cfeb4c82
have diffbot retry non tmp errors to make up for bug of calling valid urls malformed EBADURLs
Matt Wells
2015-12-14 17:30:41 -08:00
4b9240e42e
fix a fix
Matt Wells
2015-12-14 17:06:58 -08:00
6d30f21ad9
use small dgrams to avoid splitting at the kernel level down to the mtu. increase 0xc1 msg request delay from 3 to 20 secs. need to make it linear order.
Matt Wells
2015-12-14 10:47:07 -08:00
cb0d75b343
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-12-12 19:21:06 -08:00
c161f17196
fix punycode bug in firsturl of bad punycode.
Matt Wells
2015-12-12 19:18:28 -08:00
dc505aefa8
Merge branch 'testing' into diffbot-testing
Matt
2015-12-11 12:35:15 -07:00
f548b6a728
smaller dgram size works on more networks and over the internet.
Matt
2015-12-11 12:34:51 -07:00
b92763ebc7
Merge branch 'testing' into diffbot-testing
Matt
2015-12-09 23:11:37 -07:00
657c27a0ee
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-12-09 21:35:30 -08:00
2b60faf5df
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-12-09 21:34:54 -08:00
a5b8276299
Merge branch 'diffbot' into diffbot-testing
Matt
2015-12-09 17:56:44 -07:00
15d1ac7b5b
do not dedup if &links is in diffbot api url (or ?links)
Matt Wells
2015-12-09 16:52:11 -08:00
27e49df739
fix for www.gov.uk having iswwwdup bug. because .gov.uk is a tld and so is .uk. also added code to handle seg fault signals better and run the default handler after saving rather than calling abort(). hopefully a core will be dumped all the time now.
Matt Wells
2015-12-09 15:54:53 -08:00
256fadf294
use https for robots.txt if that was the protocol of the original url.
Matt
2015-12-09 16:51:29 -07:00
ea71deefea
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-12-08 10:31:03 -08:00
58eb71e4c1
added FIXBUG code to fix seg fault from deleting a collection while it is renaming/unlinking files after a merge.
Matt Wells
2015-12-08 10:30:16 -08:00
bea2c8ab00
try calling original sigsegv handler to guarantee we dump core.
Matt
2015-12-07 16:11:55 -07:00
9c3644cef2
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-12-05 15:59:01 -07:00
a3a7635dcf
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-12-04 09:03:16 -08:00
b049554531
added some more quickpolls. improved heartbeat log msg. timed pthread_join. brought back max heart beat delay parm.
Matt Wells
2015-12-04 09:02:03 -08:00
24dc98bd9c
do not do spider-time deuping if &links is in the diffbot api url.
Matt
2015-12-02 13:09:42 -07:00