6eb802054b
fix some corruption in spider data after deleting a collection.
Matt Wells
2017-02-24 09:59:28 -08:00
88120414b0
import more stuck job fixes.
Matt Wells
2016-11-16 10:59:55 -08:00
f1b6f73719
empty &seeds= fix to not reset seeds
Matt Wells
2016-11-16 10:34:25 -08:00
53ee1039b8
import final hung crawl fix into old gb. xor the firstip into the doledb key this time. seems to avoid all collisions now so we don't overwrite nodes in the doledb tree.
Matt Wells
2016-11-07 09:11:27 -08:00
3d248732d0
fix to shut up app checker.
Matt
2016-11-04 17:28:26 -06:00
c0b2cdb60a
hide the verify disk writes parm, seems to be causing cores when activated. and shouldn't really need to be used. is for debugging disk issues.
Matt
2016-11-04 17:09:15 -06:00
4b889e0ddd
quick fix
Matt Wells
2016-11-01 11:39:15 -07:00
93d5752ab7
import fix for jobs hanging from pro.
Matt Wells
2016-11-01 11:18:40 -07:00
1542f7c57f
do not save doledb on exit to prevent corruption being propagated and in case we change default spider priorities in the url filters code which could cause hung jobs.
Matt Wells
2016-10-24 14:23:13 -07:00
d5bc775ce5
fix errCount bug in url filters from errCount wrapping over to negative numbers.
Matt
2016-09-20 17:02:51 -06:00
f23daf3e5e
more fixes for 'zeroing out' code.
Matt Wells
2016-09-14 10:32:01 -07:00
93f878fbca
wow this thing is really being persnickety
Matt Wells
2016-09-14 10:18:48 -07:00
5d54fde09c
more zerout fixes
Matt Wells
2016-09-14 10:10:03 -07:00
16ec2a6963
fix stack related bug.
Matt Wells
2016-09-14 10:00:18 -07:00
4ac26f9c9b
rotate log file at 1GB.
Matt Wells
2016-09-14 09:37:23 -07:00
8a5891f2ec
fix memory leak in Parms.cpp
Matt
2016-09-12 13:41:07 -06:00
e9f0d067be
zero out docs that we do not process to save disk space. record content hash in 'zeroed out' content so we preserve it for deduping.
Matt Wells
2016-08-30 11:06:59 -07:00
616bfeea86
show corrupt collection numbers for spiderdb corrupted recs.
Matt Wells
2016-08-15 12:55:22 -07:00
7e77b900a6
fix a core from corrupted spider request in doledb in rdb::reclaimMemFromDeletedTreeNodes()
Matt Wells
2016-06-21 11:54:09 -07:00
b03571c1a1
fix infinite loop from corrupt spider request
Matt Wells
2016-06-06 08:06:20 -07:00
14e2f1f579
make trash subdir in case missing.
Matt Wells
2016-05-30 19:37:03 -07:00
bd5618f1b7
fix spider request corruption detection in doledb
Matt Wells
2016-05-28 08:56:36 -07:00
fe21fb1cc6
fix spider detection of corrupted requests. fix deduping so it doesn't core on docid based spider requests.
Matt Wells
2016-05-28 08:40:59 -07:00
7bd2344f41
increase regular page download timeout from 30 seconds to 60 seconds to accommodate some slower websites.
Matt Wells
2016-05-18 10:05:02 -07:00
494f7ca645
reduce diffbot timeout from 18000 secs to 240 secs
Matt Wells
2016-05-18 09:55:15 -07:00
b0e015b97d
fix dns lookup bug that was causing us to get incorrect ips sometimes.
Matt Wells
2016-05-17 11:57:21 -07:00
77dc78d122
fix empty file bug again
Matt Wells
2016-05-13 14:49:54 -07:00
39e621f655
trash files of length 0 that are holding up a merge. if we can't merge files we end up stockpiling them and things get slow fast.
Matt Wells
2016-05-13 13:21:43 -07:00
d4dc85bf18
Fixes for new tlds. They can now contain '-' and numbers. Fix punycode url encoding: set max length before encoding each url chunk.
Zak Betz
2016-05-12 16:04:21 -06:00
0ee1e6164c
Add input validation to regexs before crawlbot collections are created. Add ./gb egrep command to test regexes.
Zak Betz
2016-05-11 14:03:45 -06:00
89f3344be5
Updated tld list with the most current list.
Zak Betz
2016-05-04 11:17:51 -06:00
8c3eacc338
fix cores from XmlDoc::getLinkInfo1() returning -1 because of its call to getFirstIp() presumably.
Matt Wells
2016-04-18 09:55:00 -07:00
c5de65a78a
more core dump fixes concerning -1 being returned for XmlDoc::getLinkInfo1()
Matt Wells
2016-04-17 18:50:23 -07:00
65856e3b6a
tell malloc to trim 100MB at a time to prevent kernel destroying the cpu by defraging/compacting memory. fix core in Title.cpp.
Matt Wells
2016-04-17 09:16:03 -07:00
95a3a261db
fix so host 8 doesn't jam things up so much. host 8 was too busy merging a large spiderdb for blouartinfo and unable to tend to other smaller merges and therefore the # of files was getting out of hand causing slowdowns. so merge spiderdb much less aggressively.
Matt
2016-04-15 13:34:41 -06:00
165f724fd7
thanks to isj for the puny code fixes
Matt
2016-04-06 11:10:42 -06:00
74cfde3e53
fix calling doneSendingNotification() with a just-freed memory ptr bug.
Matt
2016-04-06 10:53:08 -06:00
d2983747b8
fix sending back reply that has some stuff on the stack that it references when XmDoc::getMsg20Reply() returns. thanks to isj for this fix.
Matt
2016-04-06 10:43:43 -06:00
8891100c2a
fix add url on root page to set collnum properly. fix Summary::getBestWindow() underrun bug.
Matt
2016-04-06 10:31:04 -06:00
70ca2fe48c
update ./gb -h desc for ./gb inject.
Matt
2016-04-05 21:06:38 -06:00
33e76af1d1
Merge branch 'testing'
Matt
2016-03-29 04:11:30 -06:00
816d69b34c
a lot of bug fixes thanks to isj.
Matt
2016-03-29 04:08:17 -06:00
5072e851b7
fix misspelling
Matt
2016-03-28 17:26:40 -06:00
5935619eb2
hack on parentUrlDocId to the json object dump of diffbot objects.
Matt
2016-03-28 12:39:48 -06:00
cab6d5c519
fix keysize==8 bug in keycmp
Matt
2016-03-28 09:17:01 -06:00
b65a16caee
Merge branch 'diffbot-testing' into testing
Matt
2016-03-22 16:25:21 -06:00
3c743a7d0e
allow more docids to be downloaded/served in search results.
Matt Wells
2016-03-22 15:24:33 -07:00
04a8433256
show gbssParentDocId in status doc for children docs, like diffbot object docs.
Matt Wells
2016-03-22 09:00:10 -07:00
483d69d7f7
added httprequest debug line
Matt Wells
2016-03-21 14:46:10 -07:00
136d23816c
fix hashbang properly
Matt Wells
2016-03-21 09:29:55 -07:00
48398d0cd7
Merge branch 'diffbot-testing' into testing
Matt
2016-03-20 23:14:26 -06:00
136b8842db
fix more data corruption bugs. hopefully will dump out all the collections this time and not leave any in the tree, otherwise, especially if there are a lot left behind, they get corrupted.
Matt Wells
2016-03-20 21:04:01 -07:00
61ef806dea
hash bang fix. detect more corruption. don't dump titledb and spiderdb at same time, seems to reduce corruption in rdbmem.
Matt Wells
2016-03-20 12:50:43 -07:00
fc495a5bf5
fix dump core when collection deleted while dumping
Matt Wells
2016-03-18 06:46:38 -07:00
8922b8e69c
Merge branch 'diffbot-testing' into testing
Matt
2016-03-17 14:31:22 -06:00
56bde4c3ef
fix the data corruption fix
Matt Wells
2016-03-17 13:22:56 -07:00
8bc653c31c
after dump completes scan tree to ensure all nodes reference secondary mem ptr so they don't get their data overwritten.
Matt Wells
2016-03-17 10:09:49 -07:00
0caf345850
if running ./gb start and another gb is already bound on the port then quickly exit(0) and have the bash keep alive loop exit the loop based on that return value. we can't use ./cleanexit file because it doesn't get remove and will mess up the main process that is running.
Matt Wells
2016-03-16 16:56:48 -07:00
36fdbf2f5a
rename log files in the gb main.cpp code not in the bash loop. do not rename the log file if failed to start gb because socket was already bound. prevents us from double starts moving the log file, which is annoying.
Matt Wells
2016-03-16 16:08:08 -07:00
a2e8a3a1fd
use ./cleanexit file to ensure gb doesn't restart after a graceful exit in the bash keep alive loop.
Matt Wells
2016-03-16 14:57:19 -07:00
7396e57660
show docids of corrupted title recs found. show key range of each dump to disk. fix 'sentToDiffbot' bug for unchanged docs in status docs. make sure firstKeyInQueue is set properly from current key, so reset list ptr before doing that in RdbDump.cpp.
Matt Wells
2016-03-16 13:53:08 -07:00
5e8c47adfd
Merge branch 'diffbot-testing' into testing
Matt
2016-03-16 01:14:37 -06:00
1faff50f5a
if msg22a never called to get docid, then error out.
Matt Wells
2016-03-16 00:14:02 -07:00
c7c8c9e5ad
Merge branch 'diffbot-testing' into testing
Matt
2016-03-16 00:54:49 -06:00
0b5f417349
if old title rec was corrupted we would get a random docid when re-spidering the url causing some chaos. now things should return to normal and we should overwrite the corrupted titlerec on the next spidering. also, no longer do robots.txt titlerec lookups. silly.
Matt Wells
2016-03-15 23:26:57 -07:00
58993dbbf9
do not allow crawlbot seeds to be deduped out
Matt Wells
2016-03-15 20:42:28 -07:00
bf45db6f48
Merge branch 'diffbot-testing' into testing
Matt Wells
2016-03-15 15:55:55 -07:00
8a65d21371
fix the source of lots of corruption in spiderdb and titledb. rdbmem.cpp was storing in secondary mem which got reset when dump completed. also do not add keys that are in collnum and key range of list currently being dumped, return ETRYAGAIN. added verify writes parm. clean out tree of titledb and spiderdb corruption on startup.
Matt Wells
2016-03-15 15:54:12 -07:00
0fdbaa4196
makefile optimizations
Matt Wells
2016-03-14 16:34:24 -07:00
0dbc304bbf
fix to allow us to gather ip-only url outlinks again
Matt
2016-03-14 10:56:33 -06:00
2c167aada7
fix redirect to self bug that requires setting cookie
Matt
2016-03-14 10:33:05 -06:00
d6fe684b99
fix another core caused by deleted coll
Matt Wells
2016-03-07 10:20:25 -08:00
d4e16a4dab
pass a crawlbotnightly smoke
Matt Wells
2016-03-04 13:14:28 -08:00
e75d80abbe
ignore meta redirect tags in html comment tags.
Matt Wells
2016-02-22 12:41:03 -08:00
412b04bbd4
fix neverending crawl rounds by only trying each url once per round. updated url filters.
Matt Wells
2016-02-22 09:28:46 -08:00
da9949f462
try to fix a couple more core dumps.
Matt Wells
2016-02-19 08:54:48 -08:00
c7696a69eb
fix core from a federated query and null msg20
Matt Wells
2016-02-18 10:53:20 -08:00
f649944573
if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal.
Matt Wells
2016-02-16 12:25:49 -08:00