95a47a776eimage updates
Matt Wells
2014-01-30 13:11:26 -08:00
8bdb9d1a3edoc updates per john on how we dedup
Matt Wells
2014-01-30 10:57:49 -08:00
8876dae984added and fixed support for <link ahref=xxx rel=canonical>. treat those as simplified meta redirects. updated spider dedup documentation in developer.html file.
Matt Wells
2014-01-30 10:37:59 -08:00
6a45e42128added ability to treat <link xyz.com rel=canoical> as meta redirects. should help us dedup. added a function to do looser deduping of spider pages although current not enabled, we are still using the more strict one. added documentation on how we dedup to developer.html for jon to take a look at.
Matt Wells
2014-01-30 10:04:09 -08:00
6af9441818change deduping logic to be first come first server, but site rank trumps. fixed bug from fix before.
Matt Wells
2014-01-29 16:14:42 -08:00
c92a9a4158Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-29 15:56:39 -08:00
b40f393f4cfix a couple cores related to deleting collections in progress. support termlist dump with terms containing colons.
Matt Wells
2014-01-29 15:56:07 -08:00
1fb1e2af7efixed form input. fixed page parser submission. added ability to dump out termlist from posdb like type:json (with a colon in it) to try to debug msft seeing html in csv output.
Matt Wells
2014-01-29 14:10:08 -08:00
8aef2ba8a0take out potentially bad robots.txt filter compression logic.
Matt Wells
2014-01-28 18:26:16 -08:00
57953f3b1bignore empty products (i.e. {}) when tokenizing diffbot reply
Matt Wells
2014-01-28 15:30:16 -08:00
53c2df1be1fixed core
Matt Wells
2014-01-28 15:20:37 -08:00
7b424a6236always use kstart. fixed restrictDomain bug of not saving parm. sped up csv download around 2x.
Matt Wells
2014-01-28 14:37:21 -08:00
239811b024take out confusing function no longer used
Matt Wells
2014-01-28 11:10:59 -08:00
8f39c41962just print out cached page straight, it is just the diffbot json reply pretty much verbatim, except for being tokenized. should no longer escape forward slashes.
Matt Wells
2014-01-28 11:04:53 -08:00
e9fcb9ad06started adding redownload logic.
Matt Wells
2014-01-28 09:46:58 -08:00
a9909e189ffix delete collection api
Matt Wells
2014-01-27 15:28:26 -08:00
474676010cfix gb install 1-15 logic
Matt Wells
2014-01-27 14:28:48 -08:00
726090be83contains a hack fix to fix things at startup but now it is commented out.
Matt Wells
2014-01-25 15:07:47 -08:00
1a9a5e53a7show if coll has urls ready to spider in html page
Matt Wells
2014-01-25 14:49:55 -08:00
268a244ee8fix up round incrementing logic.
Matt Wells
2014-01-25 14:35:41 -08:00
3a6a271dd9make crawl sync bug fixes. fix Puz crawl from dying out on host 9 because spider reply did not resuscitate waiting tree for its ip. fix mike's zola crawl with a repeat of 3 days from not incmreneting the round because it had maxrounds 0, which means to ignore... assume 0 means to ignore now. send out 0xc1 crawl info requests to even dead hosts so we can at least use their last known good info.
Matt Wells
2014-01-25 13:47:03 -08:00
3bdbf23f13fix core from double free
Matt Wells
2014-01-25 11:21:15 -08:00
e3f769dffefixes for sudden revitilization of dead crawls.
Matt Wells
2014-01-25 11:03:15 -08:00
c207c3c456fixed core
Matt Wells
2014-01-25 08:36:09 -08:00
bc78b21dc6for json docs only give them a single xmlnode in the Xml.cpp class. hopefully will not get "malformed sections" error anymore. i think that was a result of the json having html tags in it and making unnested html structures which the sections class did not like. TODO: probably do this for CT_TEXT etc. as well.
Matt Wells
2014-01-25 08:17:38 -08:00
4d0a09f1e4Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-25 07:02:42 -08:00
99c6390a69fix a core. do not get sections of non-html or non-text documents. was causing EMALFORMED sections error on diffbot json.
Matt Wells
2014-01-25 07:02:14 -08:00
29a574d85aif indexing diffbot url and it had error, do NOT add a spider reply.
Matt Wells
2014-01-25 07:01:26 -08:00
308106673cadded debug statements for email bug
mwells
2014-01-24 14:08:27 -08:00
321fc90ff6fix some cores. NOTE: emails disabled here... need to fix.
Matt Wells
2014-01-24 12:07:28 -08:00
27b6ceffa8fix bug of sending notification email twice for really really tiny jobs.
Matt Wells
2014-01-23 21:22:39 -08:00
c4a6ad1145update "this round" counts to at least the total counts if round # is 0 so we do not double spider everyone's jobs! put a check in rebalance loop to see if gb is exiting so we don't get into an infinite loop. this should be in redmine now...
Matt Wells
2014-01-23 18:22:13 -08:00
77ca55f712fix send email notification bug. increase unlink threads from 1 to 30. seemed to be going to slow after doing a ddump with like 3000 collections. it was unlink like 1 file per sec.
Matt Wells
2014-01-23 16:59:55 -08:00
dd663eb9f7fix round based spidering some more
Matt Wells
2014-01-23 15:03:37 -08:00
edb01b0abbMerge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-23 13:24:20 -08:00
313cffc322had to add per round page and process counts in case they had maxToCrawl and respider frequencies set. simplified round logic in Spider.cpp.
Matt Wells
2014-01-23 13:23:09 -08:00
4f7b00c6cefix core on broken pipe when calling sendChunk() and socket in streaming mode.
Matt Wells
2014-01-23 11:34:49 -08:00
9432ae870dfix bug to pass jenkins.
Matt Wells
2014-01-23 09:38:15 -08:00
26c76a3240fixed bug of waiting trees not saving.
Matt Wells
2014-01-23 01:04:24 -08:00
26b98a591afixed bug of not saving waiting trees! took out misleading Collectiondb::getNumRecs() func.! bad
Matt Wells
2014-01-23 01:02:11 -08:00
e351cb9939free spidercolls on exit
Matt Wells
2014-01-22 23:52:23 -08:00
bc35b7d0ecfix pagecrawlbot.cpp to support &c=token-name. cleanup mem at process exit better.
Matt Wells
2014-01-22 23:40:38 -08:00
df063dbdf2fix a core
Matt Wells
2014-01-22 22:26:50 -08:00
488e8c8e2fpause crawl if diffbot says token is expired.
Matt Wells
2014-01-22 20:56:52 -08:00
7cd746f567fix msge0 msg0 overload in sockets table when all diffbot replies timed out at once at released thousands of spiders.
Matt Wells
2014-01-22 20:34:55 -08:00
8a9b1f7a19added diffbot retry rules. added maxTotalSpiders parm for all colls to follow. tried to fix msg 0x00 socket jam up.
Matt Wells
2014-01-22 19:57:38 -08:00
061bf70a51show EXACT diffbot url used in logs for easier replication
Matt Wells
2014-01-22 18:25:18 -08:00
5f890f5d4fminor doc update
Matt Wells
2014-01-22 15:52:04 -08:00
25ec2b23cbadded gb.pem to required files list
Matt Wells
2014-01-22 14:47:20 -08:00
034de5039fignore tagdb corrupt tags in xmldoc.cpp. fix ip -1 bug when adding to waiting tree and it would prevent populateWaitingTreeFromSpiderdb() from continuing and freeze things up.
Matt Wells
2014-01-22 14:36:05 -08:00
a4be05d8d0more shard rebalancer fixes
Matt Wells
2014-01-22 00:44:33 -08:00
066d910934try to fix rebalancing some more.
Matt Wells
2014-01-21 22:39:01 -08:00
31cb71214cmore rdbtree fixes when invalid collections are in there
Matt Wells
2014-01-21 20:00:34 -08:00
443bb26f01disk page cache back on
Matt Wells
2014-01-21 19:03:47 -08:00
33c5d9c07fa lot of times rdb tree has invalid collection numbers in it so fix our counting algo in case the collection rec no longer exists!
Matt Wells
2014-01-21 19:01:44 -08:00
45cb5c9a0cfix bugs to try to get sharding working on crawlbot today
Matt Wells
2014-01-21 13:58:21 -08:00
7065b0ae0cfixed oops
Matt Wells
2014-01-21 13:13:16 -08:00
dba382f7f7added max cpu merge threads parm and defaulted to 10 up from 2 for better disk reading latencies.
Matt Wells
2014-01-21 13:11:53 -08:00
9354d06493menu updates.
Matt Wells
2014-01-21 13:01:37 -08:00
8d5e1cb547added url download support
Matt Wells
2014-01-20 23:17:04 -08:00
41cdfcef96inc spider limits in various places
Matt Wells
2014-01-20 18:51:15 -08:00
946a683e39quite a few spider fixes
Matt Wells
2014-01-20 16:45:27 -08:00
5c86d8a122simplified spiderdb.cpp scanSpiderdb() by breaking up into 4 functions. evalIpLoop(), readSpiderdbList(), ...
Matt Wells
2014-01-19 22:18:37 -08:00
e9bbc16a9ftook out pagecount table. just hafta scan twice i think because caching counts gets complicated because of adding duplicate injection requests!
Matt Wells
2014-01-19 20:34:38 -08:00
58d0c444acfixes for the global index quota system
Matt Wells
2014-01-19 19:38:23 -08:00
089d7f34a0more spiderdb spider request fixes
Matt Wells
2014-01-19 18:00:56 -08:00
970d5b2488formatting
Matt Wells
2014-01-19 16:40:22 -08:00
fa0e3f784fformatting
Matt Wells
2014-01-19 15:06:02 -08:00
5c9b688f72spiderdb fixes for injections
Matt Wells
2014-01-19 14:33:27 -08:00
99de2188e1formatting
Matt Wells
2014-01-19 13:21:58 -08:00
04b0650301formatting
Matt Wells
2014-01-19 12:37:37 -08:00
cd91130a6dformatting
Matt Wells
2014-01-19 12:16:26 -08:00
ca816492b5doc links
Matt Wells
2014-01-19 12:01:32 -08:00
b6c3ecc20emore formatting
Matt Wells
2014-01-19 11:56:36 -08:00
471599e9e7formatting
Matt Wells
2014-01-19 10:44:19 -08:00
e6eb9003b5more formatting
Matt Wells
2014-01-19 01:09:38 -08:00
b755b4d581formatting fixes
Matt Wells
2014-01-19 00:57:20 -08:00
fe3a879758formatting changes
Matt Wells
2014-01-19 00:38:02 -08:00
36b93a1e92minor cmdline fixes
Matt Wells
2014-01-18 21:26:59 -08:00
4606e88721code cleanups. xmldoc::injectDoc(), and it'll add a SpiderRequest as well. better collectiondb init code.
Matt Wells
2014-01-18 21:19:26 -08:00
10f4443974quite a few fixes to the quota system, cleanups etc.
Matt Wells
2014-01-18 16:23:13 -08:00
f3000e2763set m_needsSave in collectionrec when parms updated
Matt Wells
2014-01-18 12:51:10 -08:00
8edfc2ce70more collection fixes
Matt Wells
2014-01-18 12:09:33 -08:00
fa59c62264more bug fixes associated with collections and site page counts in url filters.
Matt Wells
2014-01-18 11:54:58 -08:00
22aa13e34ddo not set indexcode to EFAKEFIRSTIP for INJECTED urls, just added urls. fix add url page to not always use 'main' collection. added reset/restart cmds to spider page.
Matt Wells
2014-01-18 11:09:30 -08:00
178af5f781cleanup parms a bit. added diffbotApiUrl to all crawls whether custom or not, on spider controls page.
Matt Wells
2014-01-18 10:29:22 -08:00
9c1f6197ebadded indexbody control so i can turn it off for my special json global index.
Matt Wells
2014-01-18 10:04:33 -08:00
6fb602ae62hash a little meta info still even if custom crawl
Matt Wells
2014-01-18 09:37:07 -08:00
f9d0a02dbetest and get gbparenturl: query working.
Matt Wells
2014-01-18 09:28:58 -08:00
0be8a59e9ehash content checksums for pages in custom crawls so we can do deduping.
Matt Wells
2014-01-17 21:42:02 -08:00
5b7170e8c6Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-17 21:07:08 -08:00
4e803210eetons of changes from live github on neo. lots of core fixes. took out ppthtml powerpoint convert, it hangs. dynamic rdbmap to save memory per coll. fixed disk page cache logic and brought it back.
Matt Wells
2014-01-17 21:01:43 -08:00
8c4ac3c514Merge branch 'master' into diffbot
Matt Wells
2014-01-17 20:17:40 -08:00
bb51dd93c8Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-17 20:17:03 -08:00
403dca707cdo not hash body etc. into posdb if doing a custom diffbot crawl. saves a lot of disk space.
Matt Wells
2014-01-17 20:16:29 -08:00
116f90dba3Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot
Matt Wells
2014-01-17 18:39:34 -08:00
94740ed3a1allow sleeps in main.cpp function
Matt Wells
2014-01-17 18:39:20 -08:00
3ec44c5b35fix streaming mode for sending back json downloads/dumps.
Matt Wells
2014-01-17 18:28:17 -08:00
e09496e34efix parm updating logic.
Matt Wells
2014-01-17 17:48:45 -08:00
2faba0efd1fix repeat rounds sticking bug by adding PF_REBUILDURLFILTERS flag to spiderroundastarttime parm
Matt Wells
2014-01-17 17:17:10 -08:00