929ba33a5f
Revert "hash the normalized outlinks in the diffbot reply"
Matt
2015-12-02 13:04:56 -07:00
5f6872041a
Revert "fix possible core"
Matt
2015-12-02 13:04:54 -07:00
de0f995239
fix possible core
Matt
2015-12-02 12:38:52 -07:00
908dfa602a
hash the normalized outlinks in the diffbot reply as part of the page crc that we use for de-duping. that way if all pages are rendered they want appear to be dups of one another.
Matt
2015-12-02 12:30:04 -07:00
a34bce91e5
Merge branch 'diffbot' into diffbot-testing
Matt
2015-12-02 11:48:37 -07:00
72017925fb
fix for RdbList::constrain() core
Matt Wells
2015-12-01 17:09:34 -07:00
58bb17d5ae
undo a change that didn't work.
Matt Wells
2015-12-01 09:03:59 -08:00
34b33f478a
added gb rwtest and exposed seektest and thrutest in gb -h. use -o sync when mounting ssds to avoid really slow and spiky linux file/page cache. allow launching of more than 1 non-disk thread again. should help with unlinking, intersects, etc.
Matt Wells
2015-11-30 21:29:17 -07:00
b92853ae50
update built-in gb cmd line tests for ssd performance.
Matt
2015-11-30 18:47:44 -07:00
de777e8d28
fix truncation of search results some more hopefully
Matt Wells
2015-11-30 16:33:03 -08:00
c779bdb70d
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-11-30 13:27:38 -08:00
fc4731b11c
fix a couple of cores happening on crawlbot. fix bug of a urls.csv or other streaming download being truncated because gb thinks a shard is down. even if it is down, wait for it to come back up.
Matt Wells
2015-11-30 13:26:43 -08:00
fc3ba95226
Fix host selection for downloading when nospider directives are present. It was always choosing the first host sequentially with spidering enabled. Now it looks at the other hosts in the shard selected.
Zak Betz
2015-11-29 21:36:19 -07:00
50d539ab85
make gb easier to compile by removing a dynamically sized array on the stack.
Matt
2015-11-28 23:40:20 -07:00
ada7bb8eb9
Merge branch 'diffbot-testing' into testing
Matt
2015-11-24 10:03:05 -07:00
6b2b0c7518
Merge branch 'diffbot' into diffbot-testing
Matt
2015-11-24 10:02:49 -07:00
ec5c38bab5
fix urgent merge mode bug some more? limit spiders to 5 per custom crawl coll per shard.
Matt Wells
2015-11-24 08:51:18 -08:00
add6f84b79
Merge branch 'diffbot-testing' into testing
Matt Wells
2015-11-21 10:44:14 -08:00
398225dde1
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-11-21 10:44:05 -08:00
d55932d0b6
fix spider proxy table bug that seemed to be the reason for the table getting so full. but in case it does get full again added a call the hashtablex::empty() so we don't freeze up any more.
Matt Wells
2015-11-21 10:43:23 -08:00
b3729ed214
tune spider proxy table flushing logic a bit
Matt Wells
2015-11-21 10:29:02 -08:00
425fc699f8
Merge branch 'diffbot-testing' into testing
Matt Wells
2015-11-21 10:21:04 -08:00
0964fb9715
Merge branch 'diffbot' into diffbot-testing
Matt Wells
2015-11-21 10:20:49 -08:00
3c766451d1
try to fix the proxy load balancing table logic some more. seems to not cleanup after itself very well.
Matt Wells
2015-11-21 10:20:20 -08:00
a46c5b8f86
Fix anomalous link text detector to take into consideration the total number of inlinkers instead of just counting the matched link texts.
Zak Betz
2015-11-20 10:42:46 -07:00
aeaca04df3
fix bug of losing the line waiter header in linkdb.cpp for incoming msg25 requests. start to show more info in sockets table by parsing the request.
Matt
2015-11-19 19:40:30 -07:00
bdceec1796
Merge branch 'master' into testing
Matt
2015-11-19 16:24:45 -07:00
7bc27a521e
fix compiler error on 32bit arches
Matt
2015-11-19 16:24:29 -07:00
6e7e267cfb
Merge branch 'master' into testing
Matt
2015-11-19 16:14:24 -07:00
cd875f4ab9
fix empty url condition in add url.
Matt
2015-11-19 16:14:12 -07:00
b4ef9ca29f
Merge branch 'diffbot-testing' into testing
Matt
2015-11-19 16:11:54 -07:00
eb57f0a8c3
Merge branch 'master' into testing
Matt
2015-11-19 16:11:38 -07:00
0bc50deb42
Filter link text anomalies at query time. If a search result only has a few matches for a term in link text, then don't return it in search results for that query.
Zak Betz
2015-11-19 12:25:25 -07:00
68f41bd22a
debug why we don't dump core sometimes.
Matt
2015-11-18 16:11:27 -07:00
e0f4ba65c1
remove fixme log comment
Matt
2015-11-18 08:11:45 -07:00
feff30b6dc
Merge branch 'diffbot' into testing
Matt
2015-11-17 11:04:56 -07:00
b8d57dcd3a
fix bug of dumping too many files to disk and not being able to merge, and corrupting RdbBase::m_files[] array and associated arrays.
Matt Wells
2015-11-17 09:52:41 -08:00
690b4c5069
fix core from bogus url some more.
Matt
2015-11-16 12:51:18 -07:00
1a3c69af6b
fix core dump from empty url
Matt
2015-11-16 12:08:16 -07:00
296651d416
fix getLeastLoadedInShard() to only return the appropriate nospider/noquery hosts when using nospider/noquery in hosts.conf.
Matt
2015-11-16 09:53:40 -07:00
1b60cbd46e
fix core in Url.cpp
Matt
2015-11-16 09:29:08 -07:00
6e12f96aea
Merge branch 'testing'
Matt
2015-11-14 10:57:27 -07:00
9ff387a898
More fixes to prevent spider traffic from hitting hosts with nospider directive. Bug fix for msg20 lookups always being directed away from noquery hosts.
Zak Betz
2015-11-13 15:03:02 -07:00
8b84297392
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-11-11 08:27:25 -08:00
6cf6abf3d9
fix spider proxy clean up algo a little so it won't freeze up
Matt Wells
2015-11-11 08:27:09 -08:00
5f1695fab8
fix url.cpp
Matt
2015-11-10 00:29:42 -07:00
5061e5d7b5
normalize utf8 url paths into url encoded sequences.
Matt
2015-11-09 13:54:32 -07:00
80991c943f
complete merge of ia code into testing. make indexing warcs/arcs a switch in spider controls.
Matt
2015-11-09 12:46:06 -07:00
fe448173d5
Merge branch 'ia' into testing
Matt
2015-11-09 11:14:00 -07:00
37cc4f2ba8
Merge branch 'diffbot-testing' into testing
Matt
2015-11-09 11:13:42 -07:00
dbe93c2ccf
fix bug of not always dumping core?
Matt
2015-11-08 08:54:46 -07:00
3db9ae5d4d
rebuild fix
Matt Wells
2015-11-07 13:14:38 -08:00
93ec3138c5
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-11-06 13:31:02 -08:00
44e3b0ca19
try to fix spider proxy load table pruning bug.
Matt Wells
2015-11-06 13:30:42 -08:00
c1bbd0207d
Don't bias tagdb lookups to a single host, use the host with the lowest number of outstanding requests. The original reasoning was that one host would handle all lookups for a site and that lookup would remain in cache. Given that there are mega hubs like youtube and facebook there should be as many hosts as possible handling requests for these sites and the tagdb entries should stay in cache in all of the hosts that have the key.
Zak Betz
2015-11-04 15:37:49 -07:00
afbedba858
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt
2015-11-03 18:49:49 -08:00
c8c29db56b
fix core
Matt
2015-11-03 18:49:42 -08:00
baa817b51d
Fix load balance of msg22s to use the udp slots in pinginfo. Fix sigchild interrupting popen, when pdftohtml segfaults popen was hanging forever. Fix another bug when content length in http header was one off.
Zak Betz
2015-11-03 11:51:19 -07:00
7608c5c29c
default I/O error detection to enabled so we see hosts with I/O errors in the hosts table.
Matt
2015-11-03 11:33:16 -07:00
95d70b110e
fix bug in rebuild pipeline. need to merge the files lest we max the # of files out.
Matt
2015-11-03 11:12:39 -07:00
6526accb50
Fix coredump when using add URL
Ai Lin Chia
2015-11-02 17:23:01 +01:00
08b6fa67d7
improve spider performance when we have lots of collections. fix core from corrupt titledb rec of some sort. automatically turn off profiler when you get data back for simplicity.
Matt Wells
2015-11-01 20:23:18 -08:00
ff6caf79a2
Increase time to mark item as stale in warc injector.
Zak Betz
2015-11-01 19:45:29 -07:00
cc305eb73a
fix so we can generate posdb map for headless data files.
Matt
2015-11-01 14:56:39 -08:00
23d376f6c7
fix core from a bad title rec fetch
Matt
2015-10-29 19:43:02 -07:00
aeca57e9f4
Pass in the buffer size of an injection request so that if the content length header field is bigger than the actual buffer we won't index random memory. Fixes bug with truncated warc captures.
Zak Betz
2015-10-28 00:38:08 -06:00
f7bb617b85
Fixes for bad content lengths when injecting warcs.
Zak Betz
2015-10-26 22:15:03 -06:00
18e0b9ea9c
Fix warc injection so that pdfs, xls, ps, docs work. Crank up max warc rec size to 5mb because pdfs are rarely < 1MB.
Zak Betz
2015-10-25 23:09:43 -06:00
66145e4396
fix core when exiting while merging
Matt Wells
2015-10-24 12:50:57 -07:00
776b94396e
a new ban msg for http status 503
Matt Wells
2015-10-22 13:23:02 -07:00
488db03f60
do not send summary requests to non queryable hosts
Matt
2015-10-22 11:46:13 -06:00
998c25e29b
spider proxy fixes for negative ports
Matt Wells
2015-10-21 15:32:58 -07:00
b2af4a00ae
remove old code preventing proxies form being passed to diffbot
Matt Wells
2015-10-21 14:46:38 -07:00
5f965c2c9a
reset proxy table every hour
Matt Wells
2015-10-21 13:30:48 -07:00
5241f2e1c7
Fix double call of gotSummary when computing facets in msg40. Fixes missing results on page > 1 when searching for facets.
Zak Betz
2015-10-20 17:21:37 -06:00
51d68c4b3d
pass proxy info back to diffbot
Matt
2015-10-20 15:53:16 -06:00
2d8c84b29c
fix bug of not shutting down right away
Matt Wells
2015-10-20 13:26:24 -07:00
b0b716010e
turn off proxyauth stuff for now
Matt Wells
2015-10-20 13:06:59 -07:00
771f4d7799
Merge branch 'diffbot-testing' into diffbot
Matt Wells
2015-10-20 11:48:44 -07:00
928511f036
Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing
Matt Wells
2015-10-20 09:45:15 -07:00
b9451a1f6f
fix token expired bug
Matt Wells
2015-10-20 09:44:50 -07:00
925fea29f4
Bug fix for search with facets with s=N | N > 0 Make warc injector more resillient to advancedsearch.php failure.
Zak Betz
2015-10-19 18:28:15 -06:00
e6d2cb5962
backwards compatible fix
Matt Wells
2015-10-19 16:12:41 -06:00
3afd768a32
make rel no follow a separate switch, but still just use the robots.txt switch for diffbot crawls.
Matt Wells
2015-10-19 15:34:57 -06:00
2df573acd8
enable diffbot proxyauth stuff for http urls only
Matt
2015-10-19 10:18:20 -06:00
667e65ce01
Progress bar for warc injector.
Zak Betz
2015-10-19 10:08:04 -06:00
5d07e24c01
use rel no follow switch support.
Matt
2015-10-19 10:05:46 -06:00
ea139a65e6
Warc stream busy loop fixes. Load balance msg22 to the one with the least outstanding requests.
Zak Betz
2015-10-15 22:30:07 -06:00
75b72cc233
fix add url seg fault
Matt
2015-10-14 13:57:47 -06:00
e57e3481b4
fix innerloop strangeness when counting keys in buckets
Matt
2015-10-14 13:52:42 -06:00
3e19d43aa5
fix core
Matt
2015-10-14 12:03:12 -06:00
a4901431be
a couple little fixes to pass smokes
Matt
2015-10-14 11:53:05 -06:00
c37ab2697e
Merge branch 'ia' into testing
Matt
2015-10-12 10:40:16 -06:00