3832 Commits

Author SHA1 Message Date
f9d607665b forgot comma. 2017-02-24 10:01:26 -08:00
6eb802054b fix some corruption in spider data after deleting a collection. 2017-02-24 09:59:28 -08:00
88120414b0 import more stuck job fixes. 2016-11-16 10:59:55 -08:00
f1b6f73719 empty &seeds= fix to not reset seeds 2016-11-16 10:34:25 -08:00
53ee1039b8 import final hung crawl fix into old gb.
xor the firstip into the doledb key this time.
seems to avoid all collisions now so we don't
overwrite nodes in the doledb tree.
2016-11-07 09:11:27 -08:00
4b889e0ddd quick fix 2016-11-01 11:39:15 -07:00
93d5752ab7 import fix for jobs hanging from pro. 2016-11-01 11:18:40 -07:00
1542f7c57f do not save doledb on exit to prevent corruption
being propagated and in case we change default spider priorities
in the url filters code which could cause hung jobs.
2016-10-24 14:23:13 -07:00
d5bc775ce5 fix errCount bug in url filters from
errCount wrapping over to negative numbers.
2016-09-20 17:02:51 -06:00
f23daf3e5e more fixes for 'zeroing out' code. 2016-09-14 10:32:01 -07:00
93f878fbca wow this thing is really being persnickety 2016-09-14 10:18:48 -07:00
5d54fde09c more zerout fixes 2016-09-14 10:10:03 -07:00
16ec2a6963 fix stack related bug. 2016-09-14 10:00:18 -07:00
4ac26f9c9b rotate log file at 1GB. 2016-09-14 09:37:23 -07:00
8a5891f2ec fix memory leak in Parms.cpp 2016-09-12 13:41:07 -06:00
e9f0d067be zero out docs that we do not process to save disk space.
record content hash in 'zeroed out' content so we
preserve it for deduping.
2016-08-30 11:06:59 -07:00
616bfeea86 show corrupt collection numbers for spiderdb corrupted recs. 2016-08-15 12:55:22 -07:00
7e77b900a6 fix a core from corrupted spider request
in doledb in rdb::reclaimMemFromDeletedTreeNodes()
2016-06-21 11:54:09 -07:00
b03571c1a1 fix infinite loop from corrupt spider request 2016-06-06 08:06:20 -07:00
14e2f1f579 make trash subdir in case missing. 2016-05-30 19:37:03 -07:00
bd5618f1b7 fix spider request corruption detection in doledb 2016-05-28 08:56:36 -07:00
fe21fb1cc6 fix spider detection of corrupted requests.
fix deduping so it doesn't core on docid based
spider requests.
2016-05-28 08:40:59 -07:00
7bd2344f41 increase regular page download timeout from 30
seconds to 60 seconds to accommodate some slower websites.
2016-05-18 10:05:02 -07:00
494f7ca645 reduce diffbot timeout from 18000 secs to 240 secs 2016-05-18 09:55:15 -07:00
b0e015b97d fix dns lookup bug that was causing us
to get incorrect ips sometimes.
2016-05-17 11:57:21 -07:00
77dc78d122 fix empty file bug again 2016-05-13 14:49:54 -07:00
39e621f655 trash files of length 0 that are holding up a merge.
if we can't merge files we end up stockpiling them
and things get slow fast.
2016-05-13 13:21:43 -07:00
dfca68ec46 Merge pull request from vonbetz/regexdebug
Regexdebug
2016-05-13 10:06:24 -06:00
c89d963d46 Merge pull request from vonbetz/tldfixes
Fixes for new tlds.
2016-05-13 10:01:01 -06:00
d4dc85bf18 Fixes for new tlds.
They can now contain '-' and numbers.
Fix punycode url encoding: set max length before encoding each url chunk.
2016-05-12 16:18:53 -06:00
0074f2ec73 Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine into regexdebug 2016-05-11 14:07:41 -06:00
0ee1e6164c Add input validation to regexs before crawlbot collections are created.
Add ./gb egrep command to test regexes.
2016-05-11 14:03:45 -06:00
a246289238 Merge pull request from vonbetz/diffbot-testing
Updated tld list with the most current list.
2016-05-11 09:32:21 -06:00
89f3344be5 Updated tld list with the most current list.
Should fix getDomainFast returning NULL and causing msg22 lookups to fail.
List taken from
https://data.iana.org/TLD/tlds-alpha-by-domain.txt
2016-05-04 11:17:51 -06:00
8c3eacc338 fix cores from XmlDoc::getLinkInfo1() returning -1
because of its call to getFirstIp() presumably.
2016-04-18 09:55:00 -07:00
c5de65a78a more core dump fixes concerning -1 being returned
for XmlDoc::getLinkInfo1()
2016-04-17 18:50:23 -07:00
65856e3b6a tell malloc to trim 100MB at a time to prevent kernel
destroying the cpu by defraging/compacting memory. fix
core in Title.cpp.
2016-04-17 09:16:03 -07:00
95a3a261db fix so host 8 doesn't jam things up so much.
host 8 was too busy merging a large spiderdb for
blouartinfo and unable to tend to other smaller merges
and therefore the # of files was getting out of hand
causing slowdowns. so merge spiderdb much less aggressively.
2016-04-15 13:34:41 -06:00
165f724fd7 thanks to isj for the puny code fixes 2016-04-06 11:10:42 -06:00
74cfde3e53 fix calling doneSendingNotification() with a
just-freed memory ptr bug.
2016-04-06 10:53:08 -06:00
d2983747b8 fix sending back reply that has some stuff on the stack
that it references when XmDoc::getMsg20Reply() returns.
thanks to isj for this fix.
2016-04-06 10:43:43 -06:00
8891100c2a fix add url on root page to set collnum properly.
fix Summary::getBestWindow() underrun bug.
2016-04-06 10:31:04 -06:00
70ca2fe48c update ./gb -h desc for ./gb inject. 2016-04-05 21:06:38 -06:00
f5d0045b43 Merge pull request from vonbetz/testing
Fix for сацминэнергорф --> сацминэнерго.рф in getDisplayUrl(...)
2016-03-29 13:12:56 -06:00
3c140b87aa Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing 2016-03-29 12:42:05 -06:00
cf7ec13de6 Fix international domain printing bug. 2016-03-29 12:41:34 -06:00
33e76af1d1 Merge branch 'testing' 2016-03-29 04:11:30 -06:00
816d69b34c a lot of bug fixes thanks to isj. 2016-03-29 04:08:17 -06:00
5072e851b7 fix misspelling 2016-03-28 17:26:40 -06:00
5935619eb2 hack on parentUrlDocId to the json object dump
of diffbot objects.
2016-03-28 12:39:48 -06:00