f9d607665b
forgot comma.
2017-02-24 10:01:26 -08:00
6eb802054b
fix some corruption in spider data after deleting a collection.
2017-02-24 09:59:28 -08:00
88120414b0
import more stuck job fixes.
2016-11-16 10:59:55 -08:00
f1b6f73719
empty &seeds= fix to not reset seeds
2016-11-16 10:34:25 -08:00
53ee1039b8
import final hung crawl fix into old gb.
...
xor the firstip into the doledb key this time.
seems to avoid all collisions now so we don't
overwrite nodes in the doledb tree.
2016-11-07 09:11:27 -08:00
4b889e0ddd
quick fix
2016-11-01 11:39:15 -07:00
93d5752ab7
import fix for jobs hanging from pro.
2016-11-01 11:18:40 -07:00
1542f7c57f
do not save doledb on exit to prevent corruption
...
being propagated and in case we change default spider priorities
in the url filters code which could cause hung jobs.
2016-10-24 14:23:13 -07:00
d5bc775ce5
fix errCount bug in url filters from
...
errCount wrapping over to negative numbers.
2016-09-20 17:02:51 -06:00
f23daf3e5e
more fixes for 'zeroing out' code.
2016-09-14 10:32:01 -07:00
93f878fbca
wow this thing is really being persnickety
2016-09-14 10:18:48 -07:00
5d54fde09c
more zerout fixes
2016-09-14 10:10:03 -07:00
16ec2a6963
fix stack related bug.
2016-09-14 10:00:18 -07:00
4ac26f9c9b
rotate log file at 1GB.
2016-09-14 09:37:23 -07:00
8a5891f2ec
fix memory leak in Parms.cpp
2016-09-12 13:41:07 -06:00
e9f0d067be
zero out docs that we do not process to save disk space.
...
record content hash in 'zeroed out' content so we
preserve it for deduping.
2016-08-30 11:06:59 -07:00
616bfeea86
show corrupt collection numbers for spiderdb corrupted recs.
2016-08-15 12:55:22 -07:00
7e77b900a6
fix a core from corrupted spider request
...
in doledb in rdb::reclaimMemFromDeletedTreeNodes()
2016-06-21 11:54:09 -07:00
b03571c1a1
fix infinite loop from corrupt spider request
2016-06-06 08:06:20 -07:00
14e2f1f579
make trash subdir in case missing.
2016-05-30 19:37:03 -07:00
bd5618f1b7
fix spider request corruption detection in doledb
2016-05-28 08:56:36 -07:00
fe21fb1cc6
fix spider detection of corrupted requests.
...
fix deduping so it doesn't core on docid based
spider requests.
2016-05-28 08:40:59 -07:00
7bd2344f41
increase regular page download timeout from 30
...
seconds to 60 seconds to accommodate some slower websites.
2016-05-18 10:05:02 -07:00
494f7ca645
reduce diffbot timeout from 18000 secs to 240 secs
2016-05-18 09:55:15 -07:00
b0e015b97d
fix dns lookup bug that was causing us
...
to get incorrect ips sometimes.
2016-05-17 11:57:21 -07:00
77dc78d122
fix empty file bug again
2016-05-13 14:49:54 -07:00
39e621f655
trash files of length 0 that are holding up a merge.
...
if we can't merge files we end up stockpiling them
and things get slow fast.
2016-05-13 13:21:43 -07:00
dfca68ec46
Merge pull request #99 from vonbetz/regexdebug
...
Regexdebug
2016-05-13 10:06:24 -06:00
c89d963d46
Merge pull request #100 from vonbetz/tldfixes
...
Fixes for new tlds.
2016-05-13 10:01:01 -06:00
d4dc85bf18
Fixes for new tlds.
...
They can now contain '-' and numbers.
Fix punycode url encoding: set max length before encoding each url chunk.
2016-05-12 16:18:53 -06:00
0074f2ec73
Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine into regexdebug
2016-05-11 14:07:41 -06:00
0ee1e6164c
Add input validation to regexs before crawlbot collections are created.
...
Add ./gb egrep command to test regexes.
2016-05-11 14:03:45 -06:00
a246289238
Merge pull request #96 from vonbetz/diffbot-testing
...
Updated tld list with the most current list.
2016-05-11 09:32:21 -06:00
89f3344be5
Updated tld list with the most current list.
...
Should fix getDomainFast returning NULL and causing msg22 lookups to fail.
List taken from
https://data.iana.org/TLD/tlds-alpha-by-domain.txt
2016-05-04 11:17:51 -06:00
8c3eacc338
fix cores from XmlDoc::getLinkInfo1() returning -1
...
because of its call to getFirstIp() presumably.
2016-04-18 09:55:00 -07:00
c5de65a78a
more core dump fixes concerning -1 being returned
...
for XmlDoc::getLinkInfo1()
2016-04-17 18:50:23 -07:00
65856e3b6a
tell malloc to trim 100MB at a time to prevent kernel
...
destroying the cpu by defraging/compacting memory. fix
core in Title.cpp.
2016-04-17 09:16:03 -07:00
95a3a261db
fix so host 8 doesn't jam things up so much.
...
host 8 was too busy merging a large spiderdb for
blouartinfo and unable to tend to other smaller merges
and therefore the # of files was getting out of hand
causing slowdowns. so merge spiderdb much less aggressively.
2016-04-15 13:34:41 -06:00
165f724fd7
thanks to isj for the puny code fixes
2016-04-06 11:10:42 -06:00
74cfde3e53
fix calling doneSendingNotification() with a
...
just-freed memory ptr bug.
2016-04-06 10:53:08 -06:00
d2983747b8
fix sending back reply that has some stuff on the stack
...
that it references when XmDoc::getMsg20Reply() returns.
thanks to isj for this fix.
2016-04-06 10:43:43 -06:00
8891100c2a
fix add url on root page to set collnum properly.
...
fix Summary::getBestWindow() underrun bug.
2016-04-06 10:31:04 -06:00
70ca2fe48c
update ./gb -h desc for ./gb inject.
2016-04-05 21:06:38 -06:00
f5d0045b43
Merge pull request #82 from vonbetz/testing
...
Fix for сацминэнергорф --> сацминэнерго.рф in getDisplayUrl(...)
2016-03-29 13:12:56 -06:00
3c140b87aa
Merge branch 'testing' of https://github.com/gigablast/open-source-search-engine into testing
2016-03-29 12:42:05 -06:00
cf7ec13de6
Fix international domain printing bug.
2016-03-29 12:41:34 -06:00
33e76af1d1
Merge branch 'testing'
2016-03-29 04:11:30 -06:00
816d69b34c
a lot of bug fixes thanks to isj.
2016-03-29 04:08:17 -06:00
5072e851b7
fix misspelling
2016-03-28 17:26:40 -06:00
5935619eb2
hack on parentUrlDocId to the json object dump
...
of diffbot objects.
2016-03-28 12:39:48 -06:00