40f373c9e0
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-30 13:11:48 -08:00
95a47a776e
image updates
Matt Wells
2014-01-30 13:11:26 -08:00
8bdb9d1a3e
doc updates per john on how we dedup
Matt Wells
2014-01-30 10:57:49 -08:00
8876dae984
added and fixed support for <link ahref=xxx rel=canonical>. treat those as simplified meta redirects. updated spider dedup documentation in developer.html file.
Matt Wells
2014-01-30 10:37:59 -08:00
6a45e42128
added ability to treat <link xyz.com rel=canoical> as meta redirects. should help us dedup. added a function to do looser deduping of spider pages although current not enabled, we are still using the more strict one. added documentation on how we dedup to developer.html for jon to take a look at.
Matt Wells
2014-01-30 10:04:09 -08:00
6af9441818
change deduping logic to be first come first server, but site rank trumps. fixed bug from fix before.
Matt Wells
2014-01-29 16:14:42 -08:00
c92a9a4158
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-29 15:56:39 -08:00
b40f393f4c
fix a couple cores related to deleting collections in progress. support termlist dump with terms containing colons.
Matt Wells
2014-01-29 15:56:07 -08:00
1fb1e2af7e
fixed form input. fixed page parser submission. added ability to dump out termlist from posdb like type:json (with a colon in it) to try to debug msft seeing html in csv output.
Matt Wells
2014-01-29 14:10:08 -08:00
8aef2ba8a0
take out potentially bad robots.txt filter compression logic.
Matt Wells
2014-01-28 18:26:16 -08:00
57953f3b1b
ignore empty products (i.e. {}) when tokenizing diffbot reply
Matt Wells
2014-01-28 15:30:16 -08:00
53c2df1be1
fixed core
Matt Wells
2014-01-28 15:20:37 -08:00
7b424a6236
always use kstart. fixed restrictDomain bug of not saving parm. sped up csv download around 2x.
Matt Wells
2014-01-28 14:37:21 -08:00
239811b024
take out confusing function no longer used
Matt Wells
2014-01-28 11:10:59 -08:00
8f39c41962
just print out cached page straight, it is just the diffbot json reply pretty much verbatim, except for being tokenized. should no longer escape forward slashes.
Matt Wells
2014-01-28 11:04:53 -08:00
e9fcb9ad06
started adding redownload logic.
Matt Wells
2014-01-28 09:46:58 -08:00
a9909e189f
fix delete collection api
Matt Wells
2014-01-27 15:28:26 -08:00
726090be83
contains a hack fix to fix things at startup but now it is commented out.
Matt Wells
2014-01-25 15:07:47 -08:00
1a9a5e53a7
show if coll has urls ready to spider in html page
Matt Wells
2014-01-25 14:49:55 -08:00
268a244ee8
fix up round incrementing logic.
Matt Wells
2014-01-25 14:35:41 -08:00
3a6a271dd9
make crawl sync bug fixes. fix Puz crawl from dying out on host 9 because spider reply did not resuscitate waiting tree for its ip. fix mike's zola crawl with a repeat of 3 days from not incmreneting the round because it had maxrounds 0, which means to ignore... assume 0 means to ignore now. send out 0xc1 crawl info requests to even dead hosts so we can at least use their last known good info.
Matt Wells
2014-01-25 13:47:03 -08:00
3bdbf23f13
fix core from double free
Matt Wells
2014-01-25 11:21:15 -08:00
e3f769dffe
fixes for sudden revitilization of dead crawls.
Matt Wells
2014-01-25 11:03:15 -08:00
c207c3c456
fixed core
Matt Wells
2014-01-25 08:36:09 -08:00
bc78b21dc6
for json docs only give them a single xmlnode in the Xml.cpp class. hopefully will not get "malformed sections" error anymore. i think that was a result of the json having html tags in it and making unnested html structures which the sections class did not like. TODO: probably do this for CT_TEXT etc. as well.
Matt Wells
2014-01-25 08:17:38 -08:00
4d0a09f1e4
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-25 07:02:42 -08:00
99c6390a69
fix a core. do not get sections of non-html or non-text documents. was causing EMALFORMED sections error on diffbot json.
Matt Wells
2014-01-25 07:02:14 -08:00
29a574d85a
if indexing diffbot url and it had error, do NOT add a spider reply.
Matt Wells
2014-01-25 07:01:26 -08:00
321fc90ff6
fix some cores. NOTE: emails disabled here... need to fix.
Matt Wells
2014-01-24 12:07:28 -08:00
27b6ceffa8
fix bug of sending notification email twice for really really tiny jobs.
Matt Wells
2014-01-23 21:22:39 -08:00
c4a6ad1145
update "this round" counts to at least the total counts if round # is 0 so we do not double spider everyone's jobs! put a check in rebalance loop to see if gb is exiting so we don't get into an infinite loop. this should be in redmine now...
Matt Wells
2014-01-23 18:22:13 -08:00
77ca55f712
fix send email notification bug. increase unlink threads from 1 to 30. seemed to be going to slow after doing a ddump with like 3000 collections. it was unlink like 1 file per sec.
Matt Wells
2014-01-23 16:59:55 -08:00
dd663eb9f7
fix round based spidering some more
Matt Wells
2014-01-23 15:03:37 -08:00
edb01b0abb
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-23 13:24:20 -08:00
313cffc322
had to add per round page and process counts in case they had maxToCrawl and respider frequencies set. simplified round logic in Spider.cpp.
Matt Wells
2014-01-23 13:23:09 -08:00
4f7b00c6ce
fix core on broken pipe when calling sendChunk() and socket in streaming mode.
Matt Wells
2014-01-23 11:34:49 -08:00
9432ae870d
fix bug to pass jenkins.
Matt Wells
2014-01-23 09:38:15 -08:00
26c76a3240
fixed bug of waiting trees not saving.
Matt Wells
2014-01-23 01:04:24 -08:00
26b98a591a
fixed bug of not saving waiting trees! took out misleading Collectiondb::getNumRecs() func.! bad
Matt Wells
2014-01-23 01:02:11 -08:00
e351cb9939
free spidercolls on exit
Matt Wells
2014-01-22 23:52:23 -08:00
bc35b7d0ec
fix pagecrawlbot.cpp to support &c=token-name. cleanup mem at process exit better.
Matt Wells
2014-01-22 23:40:38 -08:00
df063dbdf2
fix a core
Matt Wells
2014-01-22 22:26:50 -08:00
488e8c8e2f
pause crawl if diffbot says token is expired.
Matt Wells
2014-01-22 20:56:52 -08:00
7cd746f567
fix msge0 msg0 overload in sockets table when all diffbot replies timed out at once at released thousands of spiders.
Matt Wells
2014-01-22 20:34:55 -08:00
8a9b1f7a19
added diffbot retry rules. added maxTotalSpiders parm for all colls to follow. tried to fix msg 0x00 socket jam up.
Matt Wells
2014-01-22 19:57:38 -08:00
061bf70a51
show EXACT diffbot url used in logs for easier replication
Matt Wells
2014-01-22 18:25:18 -08:00
5f890f5d4f
minor doc update
Matt Wells
2014-01-22 15:52:04 -08:00
25ec2b23cb
added gb.pem to required files list
Matt Wells
2014-01-22 14:47:20 -08:00
034de5039f
ignore tagdb corrupt tags in xmldoc.cpp. fix ip -1 bug when adding to waiting tree and it would prevent populateWaitingTreeFromSpiderdb() from continuing and freeze things up.
Matt Wells
2014-01-22 14:36:05 -08:00
a4be05d8d0
more shard rebalancer fixes
Matt Wells
2014-01-22 00:44:33 -08:00
066d910934
try to fix rebalancing some more.
Matt Wells
2014-01-21 22:39:01 -08:00
31cb71214c
more rdbtree fixes when invalid collections are in there
Matt Wells
2014-01-21 20:00:34 -08:00
443bb26f01
disk page cache back on
Matt Wells
2014-01-21 19:03:47 -08:00
33c5d9c07f
a lot of times rdb tree has invalid collection numbers in it so fix our counting algo in case the collection rec no longer exists!
Matt Wells
2014-01-21 19:01:44 -08:00
45cb5c9a0c
fix bugs to try to get sharding working on crawlbot today
Matt Wells
2014-01-21 13:58:21 -08:00
7065b0ae0c
fixed oops
Matt Wells
2014-01-21 13:13:16 -08:00
dba382f7f7
added max cpu merge threads parm and defaulted to 10 up from 2 for better disk reading latencies.
Matt Wells
2014-01-21 13:11:53 -08:00
9354d06493
menu updates.
Matt Wells
2014-01-21 13:01:37 -08:00
8d5e1cb547
added url download support
Matt Wells
2014-01-20 23:17:04 -08:00
41cdfcef96
inc spider limits in various places
Matt Wells
2014-01-20 18:51:15 -08:00
946a683e39
quite a few spider fixes
Matt Wells
2014-01-20 16:45:27 -08:00
5c86d8a122
simplified spiderdb.cpp scanSpiderdb() by breaking up into 4 functions. evalIpLoop(), readSpiderdbList(), ...
Matt Wells
2014-01-19 22:18:37 -08:00
e9bbc16a9f
took out pagecount table. just hafta scan twice i think because caching counts gets complicated because of adding duplicate injection requests!
Matt Wells
2014-01-19 20:34:38 -08:00
58d0c444ac
fixes for the global index quota system
Matt Wells
2014-01-19 19:38:23 -08:00
089d7f34a0
more spiderdb spider request fixes
Matt Wells
2014-01-19 18:00:56 -08:00
970d5b2488
formatting
Matt Wells
2014-01-19 16:40:22 -08:00
fa0e3f784f
formatting
Matt Wells
2014-01-19 15:06:02 -08:00
5c9b688f72
spiderdb fixes for injections
Matt Wells
2014-01-19 14:33:27 -08:00
99de2188e1
formatting
Matt Wells
2014-01-19 13:21:58 -08:00
04b0650301
formatting
Matt Wells
2014-01-19 12:37:37 -08:00
cd91130a6d
formatting
Matt Wells
2014-01-19 12:16:26 -08:00
ca816492b5
doc links
Matt Wells
2014-01-19 12:01:32 -08:00
b6c3ecc20e
more formatting
Matt Wells
2014-01-19 11:56:36 -08:00
471599e9e7
formatting
Matt Wells
2014-01-19 10:44:19 -08:00
e6eb9003b5
more formatting
Matt Wells
2014-01-19 01:09:38 -08:00
b755b4d581
formatting fixes
Matt Wells
2014-01-19 00:57:20 -08:00
fe3a879758
formatting changes
Matt Wells
2014-01-19 00:38:02 -08:00
36b93a1e92
minor cmdline fixes
Matt Wells
2014-01-18 21:26:59 -08:00
4606e88721
code cleanups. xmldoc::injectDoc(), and it'll add a SpiderRequest as well. better collectiondb init code.
Matt Wells
2014-01-18 21:19:26 -08:00
10f4443974
quite a few fixes to the quota system, cleanups etc.
Matt Wells
2014-01-18 16:23:13 -08:00
f3000e2763
set m_needsSave in collectionrec when parms updated
Matt Wells
2014-01-18 12:51:10 -08:00
8edfc2ce70
more collection fixes
Matt Wells
2014-01-18 12:09:33 -08:00
fa59c62264
more bug fixes associated with collections and site page counts in url filters.
Matt Wells
2014-01-18 11:54:58 -08:00
22aa13e34d
do not set indexcode to EFAKEFIRSTIP for INJECTED urls, just added urls. fix add url page to not always use 'main' collection. added reset/restart cmds to spider page.
Matt Wells
2014-01-18 11:09:30 -08:00
178af5f781
cleanup parms a bit. added diffbotApiUrl to all crawls whether custom or not, on spider controls page.
Matt Wells
2014-01-18 10:29:22 -08:00
9c1f6197eb
added indexbody control so i can turn it off for my special json global index.
Matt Wells
2014-01-18 10:04:33 -08:00
6fb602ae62
hash a little meta info still even if custom crawl
Matt Wells
2014-01-18 09:37:07 -08:00
f9d0a02dbe
test and get gbparenturl: query working.
Matt Wells
2014-01-18 09:28:58 -08:00
0be8a59e9e
hash content checksums for pages in custom crawls so we can do deduping.
Matt Wells
2014-01-17 21:42:02 -08:00
5b7170e8c6
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-17 21:07:08 -08:00
4e803210ee
tons of changes from live github on neo. lots of core fixes. took out ppthtml powerpoint convert, it hangs. dynamic rdbmap to save memory per coll. fixed disk page cache logic and brought it back.
Matt Wells
2014-01-17 21:01:43 -08:00
8c4ac3c514
Merge branch 'master' into diffbot
Matt Wells
2014-01-17 20:17:40 -08:00
bb51dd93c8
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-17 20:17:03 -08:00
403dca707c
do not hash body etc. into posdb if doing a custom diffbot crawl. saves a lot of disk space.
Matt Wells
2014-01-17 20:16:29 -08:00
116f90dba3
Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-17 18:39:34 -08:00
94740ed3a1
allow sleeps in main.cpp function
Matt Wells
2014-01-17 18:39:20 -08:00
3ec44c5b35
fix streaming mode for sending back json downloads/dumps.
Matt Wells
2014-01-17 18:28:17 -08:00
e09496e34e
fix parm updating logic.
Matt Wells
2014-01-17 17:48:45 -08:00