6eb802054b
fix some corruption in spider data after deleting a collection.
Matt Wells
2017-02-24 09:59:28 -08:00
88120414b0
import more stuck job fixes.
Matt Wells
2016-11-16 10:59:55 -08:00
f1b6f73719
empty &seeds= fix to not reset seeds
Matt Wells
2016-11-16 10:34:25 -08:00
53ee1039b8
import final hung crawl fix into old gb. xor the firstip into the doledb key this time. seems to avoid all collisions now so we don't overwrite nodes in the doledb tree.
Matt Wells
2016-11-07 09:11:27 -08:00
3d248732d0
fix to shut up app checker.
Matt
2016-11-04 17:28:26 -06:00
c0b2cdb60a
hide the verify disk writes parm, seems to be causing cores when activated. and shouldn't really need to be used. is for debugging disk issues.
Matt
2016-11-04 17:09:15 -06:00
4b889e0ddd
quick fix
Matt Wells
2016-11-01 11:39:15 -07:00
93d5752ab7
import fix for jobs hanging from pro.
Matt Wells
2016-11-01 11:18:40 -07:00
1542f7c57f
do not save doledb on exit to prevent corruption being propagated and in case we change default spider priorities in the url filters code which could cause hung jobs.
Matt Wells
2016-10-24 14:23:13 -07:00
d5bc775ce5
fix errCount bug in url filters from errCount wrapping over to negative numbers.
Matt
2016-09-20 17:02:51 -06:00
f23daf3e5e
more fixes for 'zeroing out' code.
Matt Wells
2016-09-14 10:32:01 -07:00
93f878fbca
wow this thing is really being persnickety
Matt Wells
2016-09-14 10:18:48 -07:00
5d54fde09c
more zerout fixes
Matt Wells
2016-09-14 10:10:03 -07:00
16ec2a6963
fix stack related bug.
Matt Wells
2016-09-14 10:00:18 -07:00
4ac26f9c9b
rotate log file at 1GB.
Matt Wells
2016-09-14 09:37:23 -07:00
8a5891f2ec
fix memory leak in Parms.cpp
Matt
2016-09-12 13:41:07 -06:00
e9f0d067be
zero out docs that we do not process to save disk space. record content hash in 'zeroed out' content so we preserve it for deduping.
Matt Wells
2016-08-30 11:06:59 -07:00
616bfeea86
show corrupt collection numbers for spiderdb corrupted recs.
Matt Wells
2016-08-15 12:55:22 -07:00
7e77b900a6
fix a core from corrupted spider request in doledb in rdb::reclaimMemFromDeletedTreeNodes()
Matt Wells
2016-06-21 11:54:09 -07:00
b03571c1a1
fix infinite loop from corrupt spider request
Matt Wells
2016-06-06 08:06:20 -07:00
14e2f1f579
make trash subdir in case missing.
Matt Wells
2016-05-30 19:37:03 -07:00
bd5618f1b7
fix spider request corruption detection in doledb
Matt Wells
2016-05-28 08:56:36 -07:00
fe21fb1cc6
fix spider detection of corrupted requests. fix deduping so it doesn't core on docid based spider requests.
Matt Wells
2016-05-28 08:40:59 -07:00
7bd2344f41
increase regular page download timeout from 30 seconds to 60 seconds to accommodate some slower websites.
Matt Wells
2016-05-18 10:05:02 -07:00
494f7ca645
reduce diffbot timeout from 18000 secs to 240 secs
Matt Wells
2016-05-18 09:55:15 -07:00
b0e015b97d
fix dns lookup bug that was causing us to get incorrect ips sometimes.
Matt Wells
2016-05-17 11:57:21 -07:00
77dc78d122
fix empty file bug again
Matt Wells
2016-05-13 14:49:54 -07:00
39e621f655
trash files of length 0 that are holding up a merge. if we can't merge files we end up stockpiling them and things get slow fast.
Matt Wells
2016-05-13 13:21:43 -07:00
d4dc85bf18
Fixes for new tlds. They can now contain '-' and numbers. Fix punycode url encoding: set max length before encoding each url chunk.
Zak Betz
2016-05-12 16:04:21 -06:00
0ee1e6164c
Add input validation to regexs before crawlbot collections are created. Add ./gb egrep command to test regexes.
Zak Betz
2016-05-11 14:03:45 -06:00
89f3344be5
Updated tld list with the most current list.
Zak Betz
2016-05-04 11:17:51 -06:00
8c3eacc338
fix cores from XmlDoc::getLinkInfo1() returning -1 because of its call to getFirstIp() presumably.
Matt Wells
2016-04-18 09:55:00 -07:00
c5de65a78a
more core dump fixes concerning -1 being returned for XmlDoc::getLinkInfo1()
Matt Wells
2016-04-17 18:50:23 -07:00
65856e3b6a
tell malloc to trim 100MB at a time to prevent kernel destroying the cpu by defraging/compacting memory. fix core in Title.cpp.
Matt Wells
2016-04-17 09:16:03 -07:00
95a3a261db
fix so host 8 doesn't jam things up so much. host 8 was too busy merging a large spiderdb for blouartinfo and unable to tend to other smaller merges and therefore the # of files was getting out of hand causing slowdowns. so merge spiderdb much less aggressively.
Matt
2016-04-15 13:34:41 -06:00
165f724fd7
thanks to isj for the puny code fixes
Matt
2016-04-06 11:10:42 -06:00
74cfde3e53
fix calling doneSendingNotification() with a just-freed memory ptr bug.
Matt
2016-04-06 10:53:08 -06:00
d2983747b8
fix sending back reply that has some stuff on the stack that it references when XmDoc::getMsg20Reply() returns. thanks to isj for this fix.
Matt
2016-04-06 10:43:43 -06:00
8891100c2a
fix add url on root page to set collnum properly. fix Summary::getBestWindow() underrun bug.
Matt
2016-04-06 10:31:04 -06:00
70ca2fe48c
update ./gb -h desc for ./gb inject.
Matt
2016-04-05 21:06:38 -06:00
33e76af1d1
Merge branch 'testing'
Matt
2016-03-29 04:11:30 -06:00
816d69b34c
a lot of bug fixes thanks to isj.
Matt
2016-03-29 04:08:17 -06:00
5072e851b7
fix misspelling
Matt
2016-03-28 17:26:40 -06:00
5935619eb2
hack on parentUrlDocId to the json object dump of diffbot objects.
Matt
2016-03-28 12:39:48 -06:00
cab6d5c519
fix keysize==8 bug in keycmp
Matt
2016-03-28 09:17:01 -06:00
b65a16caee
Merge branch 'diffbot-testing' into testing
Matt
2016-03-22 16:25:21 -06:00
3c743a7d0e
allow more docids to be downloaded/served in search results.
Matt Wells
2016-03-22 15:24:33 -07:00
04a8433256
show gbssParentDocId in status doc for children docs, like diffbot object docs.
Matt Wells
2016-03-22 09:00:10 -07:00
483d69d7f7
added httprequest debug line
Matt Wells
2016-03-21 14:46:10 -07:00
136d23816c
fix hashbang properly
Matt Wells
2016-03-21 09:29:55 -07:00
48398d0cd7
Merge branch 'diffbot-testing' into testing
Matt
2016-03-20 23:14:26 -06:00
136b8842db
fix more data corruption bugs. hopefully will dump out all the collections this time and not leave any in the tree, otherwise, especially if there are a lot left behind, they get corrupted.
Matt Wells
2016-03-20 21:04:01 -07:00
61ef806dea
hash bang fix. detect more corruption. don't dump titledb and spiderdb at same time, seems to reduce corruption in rdbmem.
Matt Wells
2016-03-20 12:50:43 -07:00
fc495a5bf5
fix dump core when collection deleted while dumping
Matt Wells
2016-03-18 06:46:38 -07:00
8922b8e69c
Merge branch 'diffbot-testing' into testing
Matt
2016-03-17 14:31:22 -06:00
56bde4c3ef
fix the data corruption fix
Matt Wells
2016-03-17 13:22:56 -07:00
8bc653c31c
after dump completes scan tree to ensure all nodes reference secondary mem ptr so they don't get their data overwritten.
Matt Wells
2016-03-17 10:09:49 -07:00
0caf345850
if running ./gb start and another gb is already bound on the port then quickly exit(0) and have the bash keep alive loop exit the loop based on that return value. we can't use ./cleanexit file because it doesn't get remove and will mess up the main process that is running.
Matt Wells
2016-03-16 16:56:48 -07:00
36fdbf2f5a
rename log files in the gb main.cpp code not in the bash loop. do not rename the log file if failed to start gb because socket was already bound. prevents us from double starts moving the log file, which is annoying.
Matt Wells
2016-03-16 16:08:08 -07:00
a2e8a3a1fd
use ./cleanexit file to ensure gb doesn't restart after a graceful exit in the bash keep alive loop.
Matt Wells
2016-03-16 14:57:19 -07:00
7396e57660
show docids of corrupted title recs found. show key range of each dump to disk. fix 'sentToDiffbot' bug for unchanged docs in status docs. make sure firstKeyInQueue is set properly from current key, so reset list ptr before doing that in RdbDump.cpp.
Matt Wells
2016-03-16 13:53:08 -07:00
5e8c47adfd
Merge branch 'diffbot-testing' into testing
Matt
2016-03-16 01:14:37 -06:00
1faff50f5a
if msg22a never called to get docid, then error out.
Matt Wells
2016-03-16 00:14:02 -07:00