6eb802054bfix some corruption in spider data after deleting a collection.
Matt Wells
2017-02-24 09:59:28 -08:00
88120414b0import more stuck job fixes.
Matt Wells
2016-11-16 10:59:55 -08:00
f1b6f73719empty &seeds= fix to not reset seeds
Matt Wells
2016-11-16 10:34:25 -08:00
53ee1039b8import final hung crawl fix into old gb. xor the firstip into the doledb key this time. seems to avoid all collisions now so we don't overwrite nodes in the doledb tree.
Matt Wells
2016-11-07 09:11:27 -08:00
3d248732d0fix to shut up app checker.
Matt
2016-11-04 17:28:26 -06:00
c0b2cdb60ahide the verify disk writes parm, seems to be causing cores when activated. and shouldn't really need to be used. is for debugging disk issues.
Matt
2016-11-04 17:09:15 -06:00
4b889e0dddquick fix
Matt Wells
2016-11-01 11:39:15 -07:00
93d5752ab7import fix for jobs hanging from pro.
Matt Wells
2016-11-01 11:18:40 -07:00
1542f7c57fdo not save doledb on exit to prevent corruption being propagated and in case we change default spider priorities in the url filters code which could cause hung jobs.
Matt Wells
2016-10-24 14:23:13 -07:00
d5bc775ce5fix errCount bug in url filters from errCount wrapping over to negative numbers.
Matt
2016-09-20 17:02:51 -06:00
f23daf3e5emore fixes for 'zeroing out' code.
Matt Wells
2016-09-14 10:32:01 -07:00
93f878fbcawow this thing is really being persnickety
Matt Wells
2016-09-14 10:18:48 -07:00
5d54fde09cmore zerout fixes
Matt Wells
2016-09-14 10:10:03 -07:00
16ec2a6963fix stack related bug.
Matt Wells
2016-09-14 10:00:18 -07:00
4ac26f9c9brotate log file at 1GB.
Matt Wells
2016-09-14 09:37:23 -07:00
8a5891f2ecfix memory leak in Parms.cpp
Matt
2016-09-12 13:41:07 -06:00
e9f0d067bezero out docs that we do not process to save disk space. record content hash in 'zeroed out' content so we preserve it for deduping.
Matt Wells
2016-08-30 11:06:59 -07:00
616bfeea86show corrupt collection numbers for spiderdb corrupted recs.
Matt Wells
2016-08-15 12:55:22 -07:00
7e77b900a6fix a core from corrupted spider request in doledb in rdb::reclaimMemFromDeletedTreeNodes()
Matt Wells
2016-06-21 11:54:09 -07:00
b03571c1a1fix infinite loop from corrupt spider request
Matt Wells
2016-06-06 08:06:20 -07:00
14e2f1f579make trash subdir in case missing.
Matt Wells
2016-05-30 19:37:03 -07:00
bd5618f1b7fix spider request corruption detection in doledb
Matt Wells
2016-05-28 08:56:36 -07:00
fe21fb1cc6fix spider detection of corrupted requests. fix deduping so it doesn't core on docid based spider requests.
Matt Wells
2016-05-28 08:40:59 -07:00
7bd2344f41increase regular page download timeout from 30 seconds to 60 seconds to accommodate some slower websites.
Matt Wells
2016-05-18 10:05:02 -07:00
494f7ca645reduce diffbot timeout from 18000 secs to 240 secs
Matt Wells
2016-05-18 09:55:15 -07:00
b0e015b97dfix dns lookup bug that was causing us to get incorrect ips sometimes.
Matt Wells
2016-05-17 11:57:21 -07:00
77dc78d122fix empty file bug again
Matt Wells
2016-05-13 14:49:54 -07:00
39e621f655trash files of length 0 that are holding up a merge. if we can't merge files we end up stockpiling them and things get slow fast.
Matt Wells
2016-05-13 13:21:43 -07:00
dfca68ec46Merge pull request #99 from vonbetz/regexdebug
Gigablast
2016-05-13 10:06:24 -06:00
c89d963d46Merge pull request #100 from vonbetz/tldfixes
Gigablast
2016-05-13 10:01:01 -06:00
d4dc85bf18Fixes for new tlds. They can now contain '-' and numbers. Fix punycode url encoding: set max length before encoding each url chunk.
Zak Betz
2016-05-12 16:04:21 -06:00
0ee1e6164cAdd input validation to regexs before crawlbot collections are created. Add ./gb egrep command to test regexes.
Zak Betz
2016-05-11 14:03:45 -06:00
a246289238Merge pull request #96 from vonbetz/diffbot-testing
Gigablast
2016-05-11 09:32:21 -06:00
89f3344be5Updated tld list with the most current list.
Zak Betz
2016-05-04 11:17:51 -06:00
8c3eacc338fix cores from XmlDoc::getLinkInfo1() returning -1 because of its call to getFirstIp() presumably.
Matt Wells
2016-04-18 09:55:00 -07:00
c5de65a78amore core dump fixes concerning -1 being returned for XmlDoc::getLinkInfo1()
Matt Wells
2016-04-17 18:50:23 -07:00
65856e3b6atell malloc to trim 100MB at a time to prevent kernel destroying the cpu by defraging/compacting memory. fix core in Title.cpp.
Matt Wells
2016-04-17 09:16:03 -07:00
95a3a261dbfix so host 8 doesn't jam things up so much. host 8 was too busy merging a large spiderdb for blouartinfo and unable to tend to other smaller merges and therefore the # of files was getting out of hand causing slowdowns. so merge spiderdb much less aggressively.
Matt
2016-04-15 13:34:41 -06:00
165f724fd7thanks to isj for the puny code fixes
Matt
2016-04-06 11:10:42 -06:00
74cfde3e53fix calling doneSendingNotification() with a just-freed memory ptr bug.
Matt
2016-04-06 10:53:08 -06:00
d2983747b8fix sending back reply that has some stuff on the stack that it references when XmDoc::getMsg20Reply() returns. thanks to isj for this fix.
Matt
2016-04-06 10:43:43 -06:00
8891100c2afix add url on root page to set collnum properly. fix Summary::getBestWindow() underrun bug.
Matt
2016-04-06 10:31:04 -06:00
70ca2fe48cupdate ./gb -h desc for ./gb inject.
Matt
2016-04-05 21:06:38 -06:00
f5d0045b43Merge pull request #82 from vonbetz/testing
Gigablast
2016-03-29 13:12:56 -06:00
cf7ec13de6Fix international domain printing bug.
Zak Betz
2016-03-29 12:41:34 -06:00
33e76af1d1Merge branch 'testing'
Matt
2016-03-29 04:11:30 -06:00
816d69b34ca lot of bug fixes thanks to isj.
Matt
2016-03-29 04:08:17 -06:00
5072e851b7fix misspelling
Matt
2016-03-28 17:26:40 -06:00
5935619eb2hack on parentUrlDocId to the json object dump of diffbot objects.
Matt
2016-03-28 12:39:48 -06:00
cab6d5c519fix keysize==8 bug in keycmp
Matt
2016-03-28 09:17:01 -06:00
b65a16caeeMerge branch 'diffbot-testing' into testing
Matt
2016-03-22 16:25:21 -06:00
3c743a7d0eallow more docids to be downloaded/served in search results.
Matt Wells
2016-03-22 15:24:33 -07:00
04a8433256show gbssParentDocId in status doc for children docs, like diffbot object docs.
Matt Wells
2016-03-22 09:00:10 -07:00
483d69d7f7added httprequest debug line
Matt Wells
2016-03-21 14:46:10 -07:00
136d23816cfix hashbang properly
Matt Wells
2016-03-21 09:29:55 -07:00
48398d0cd7Merge branch 'diffbot-testing' into testing
Matt
2016-03-20 23:14:26 -06:00
136b8842dbfix more data corruption bugs. hopefully will dump out all the collections this time and not leave any in the tree, otherwise, especially if there are a lot left behind, they get corrupted.
Matt Wells
2016-03-20 21:04:01 -07:00
61ef806deahash bang fix. detect more corruption. don't dump titledb and spiderdb at same time, seems to reduce corruption in rdbmem.
Matt Wells
2016-03-20 12:50:43 -07:00
fc495a5bf5fix dump core when collection deleted while dumping
Matt Wells
2016-03-18 06:46:38 -07:00
8922b8e69cMerge branch 'diffbot-testing' into testing
Matt
2016-03-17 14:31:22 -06:00
56bde4c3effix the data corruption fix
Matt Wells
2016-03-17 13:22:56 -07:00
8bc653c31cafter dump completes scan tree to ensure all nodes reference secondary mem ptr so they don't get their data overwritten.
Matt Wells
2016-03-17 10:09:49 -07:00
0caf345850if running ./gb start and another gb is already bound on the port then quickly exit(0) and have the bash keep alive loop exit the loop based on that return value. we can't use ./cleanexit file because it doesn't get remove and will mess up the main process that is running.
Matt Wells
2016-03-16 16:56:48 -07:00
36fdbf2f5arename log files in the gb main.cpp code not in the bash loop. do not rename the log file if failed to start gb because socket was already bound. prevents us from double starts moving the log file, which is annoying.
Matt Wells
2016-03-16 16:08:08 -07:00
a2e8a3a1fduse ./cleanexit file to ensure gb doesn't restart after a graceful exit in the bash keep alive loop.
Matt Wells
2016-03-16 14:57:19 -07:00
7396e57660show docids of corrupted title recs found. show key range of each dump to disk. fix 'sentToDiffbot' bug for unchanged docs in status docs. make sure firstKeyInQueue is set properly from current key, so reset list ptr before doing that in RdbDump.cpp.
Matt Wells
2016-03-16 13:53:08 -07:00
5e8c47adfdMerge branch 'diffbot-testing' into testing
Matt
2016-03-16 01:14:37 -06:00
1faff50f5aif msg22a never called to get docid, then error out.
Matt Wells
2016-03-16 00:14:02 -07:00
c7c8c9e5adMerge branch 'diffbot-testing' into testing
Matt
2016-03-16 00:54:49 -06:00
0b5f417349if old title rec was corrupted we would get a random docid when re-spidering the url causing some chaos. now things should return to normal and we should overwrite the corrupted titlerec on the next spidering. also, no longer do robots.txt titlerec lookups. silly.
Matt Wells
2016-03-15 23:26:57 -07:00
58993dbbf9do not allow crawlbot seeds to be deduped out
Matt Wells
2016-03-15 20:42:28 -07:00
bf45db6f48Merge branch 'diffbot-testing' into testing
Matt Wells
2016-03-15 15:55:55 -07:00
8a65d21371fix the source of lots of corruption in spiderdb and titledb. rdbmem.cpp was storing in secondary mem which got reset when dump completed. also do not add keys that are in collnum and key range of list currently being dumped, return ETRYAGAIN. added verify writes parm. clean out tree of titledb and spiderdb corruption on startup.
Matt Wells
2016-03-15 15:54:12 -07:00
0fdbaa4196makefile optimizations
Matt Wells
2016-03-14 16:34:24 -07:00
0dbc304bbffix to allow us to gather ip-only url outlinks again
Matt
2016-03-14 10:56:33 -06:00
2c167aada7fix redirect to self bug that requires setting cookie
Matt
2016-03-14 10:33:05 -06:00
d6fe684b99fix another core caused by deleted coll
Matt Wells
2016-03-07 10:20:25 -08:00
d4e16a4dabpass a crawlbotnightly smoke
Matt Wells
2016-03-04 13:14:28 -08:00
e75d80abbeignore meta redirect tags in html comment tags.
Matt Wells
2016-02-22 12:41:03 -08:00
412b04bbd4fix neverending crawl rounds by only trying each url once per round. updated url filters.
Matt Wells
2016-02-22 09:28:46 -08:00
da9949f462try to fix a couple more core dumps.
Matt Wells
2016-02-19 08:54:48 -08:00
c7696a69ebfix core from a federated query and null msg20
Matt Wells
2016-02-18 10:53:20 -08:00
f649944573if spidered time is in future, consider the spiderreply corrupt and ignore it. if you set back the OS clock then you might end up ignoring some spider replies but hopefully it won't be such a big deal.
Matt Wells
2016-02-16 12:25:49 -08:00
8748cd06acMerge pull request #73 from AppChecker/master
Gigablast
2016-02-13 23:00:37 -07:00
f11595efc3fix core dump from deleting an active/dumping collection
Matt Wells
2016-02-12 16:54:03 -08:00
bf4bdd6bfdMerge branch 'diffbot-testing' into testing
Matt Wells
2016-02-10 09:50:53 -08:00
e68406f073fix core in posdbtable from docid of 0. no idea why docid was 0, but why core?
Matt Wells
2016-02-09 22:43:09 -08:00
e376b97814let's generalize it. if a redirect sets cookies then follow it through, don't stop in the middle because we think it is 'simplified'.
Matt Wells
2016-02-09 13:47:12 -08:00