Commit Graph

  • 20c31dcc78 Merge branch 'master' into diffbot-slicing Matt Wells 2014-02-04 12:28:43 -0800
  • d2cebad8e7 spidercoll deletion fixes. Matt Wells 2014-02-04 12:28:05 -0800
  • 9ded8fa091 faster spiders checkpoint Matt Wells 2014-02-04 12:26:42 -0800
  • 258e3cba0d fix maxtocrawl limit thing Matt Wells 2014-02-04 09:25:27 -0700
  • 17fff243f9 add connectips back. call them adminIps this time. if your ip is on the list then you have admin access. cookie tokens will come later/soon. Matt Wells 2014-02-03 20:47:48 -0700
  • d3b498a057 time slice checkpoint Matt Wells 2014-02-03 19:17:58 -0800
  • 5ea852dac3 fix core when thread fails to spawn. Matt Wells 2014-02-03 07:27:32 -0700
  • b46da4c192 prevent msg20/tagdb lookup socket jam up. throttle back max outstanding msg20s (summary generations) based on used udp sockets. Matt Wells 2014-02-03 07:09:29 -0700
  • 56adb2ee8c nomenclature. url filters -> spider scheduler Matt Wells 2014-02-02 17:00:11 -0700
  • 10235bb840 fix add url and cached page getting Matt Wells 2014-02-02 16:49:31 -0700
  • 7bf8a2ac49 do not let glibc do malloc checks, we do that. Matt Wells 2014-02-02 13:41:59 -0700
  • 4be68fdaa6 set safebuf::m_buf to null in destructor Matt Wells 2014-02-02 12:16:11 -0700
  • 0df697e56a fix keep alive loop code to bail out if fails to bind to socket as well as quick cores. Matt Wells 2014-02-02 12:11:18 -0700
  • f58a94a8cc fix diffbot url bug Matt Wells 2014-02-02 11:53:10 -0700
  • 93021b2f13 Merge branch 'diffbot' Matt Wells 2014-02-01 11:31:00 -0700
  • 095c47f181 Merge branch 'diffbot' Matt Wells 2014-02-01 11:28:31 -0700
  • 62802006f5 Merge 4346fcee29 into dde05446f5 DENI KUTA 2014-02-01 10:14:46 -0800
  • 4346fcee29 added recovery mode display in hosts table Matt Wells 2014-02-01 10:16:46 -0800
  • 4d2eafe39b added some repair logic for 0001.dat files. turn of spiderdb disk cache for now. Matt Wells 2014-02-01 10:14:25 -0800
  • 10d0e9f52b Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-31 14:54:23 -0800
  • 392d043bd8 undo canonical deduping. added dump round stats when uploading json files. Matt Wells 2014-01-31 14:53:49 -0800
  • 6e9b4f8ca2 fix core Matt Wells 2014-01-30 22:03:12 -0700
  • e8a6d8f345 fix another core from freening wrong byte sized crawl info reply. Matt Wells 2014-01-30 20:16:41 -0800
  • 09fd98c95b Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-30 19:57:07 -0800
  • 7107f730d0 fix another core from deleting a coll and deleting a spidercoll in progress. Matt Wells 2014-01-30 19:56:43 -0800
  • 4a1ad74f79 test fix for keep alive infinite loop bug. Matt Wells 2014-01-30 14:16:16 -0800
  • 83e291f12b fix infinite keep alive restart bug some more Matt Wells 2014-01-30 14:12:32 -0800
  • 03aa7842d0 do not enter into an inifinite keep alive restart loop. Matt Wells 2014-01-30 14:40:03 -0700
  • 40f373c9e0 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-30 13:11:48 -0800
  • 95a47a776e image updates Matt Wells 2014-01-30 13:11:26 -0800
  • 8bdb9d1a3e doc updates per john on how we dedup Matt Wells 2014-01-30 10:57:49 -0800
  • 8876dae984 added and fixed support for <link ahref=xxx rel=canonical>. treat those as simplified meta redirects. updated spider dedup documentation in developer.html file. Matt Wells 2014-01-30 10:37:59 -0800
  • 6a45e42128 added ability to treat <link xyz.com rel=canoical> as meta redirects. should help us dedup. added a function to do looser deduping of spider pages although current not enabled, we are still using the more strict one. added documentation on how we dedup to developer.html for jon to take a look at. Matt Wells 2014-01-30 10:04:09 -0800
  • 6af9441818 change deduping logic to be first come first server, but site rank trumps. fixed bug from fix before. Matt Wells 2014-01-29 16:14:42 -0800
  • c92a9a4158 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-29 15:56:39 -0800
  • b40f393f4c fix a couple cores related to deleting collections in progress. support termlist dump with terms containing colons. Matt Wells 2014-01-29 15:56:07 -0800
  • 1fb1e2af7e fixed form input. fixed page parser submission. added ability to dump out termlist from posdb like type:json (with a colon in it) to try to debug msft seeing html in csv output. Matt Wells 2014-01-29 14:10:08 -0800
  • 8aef2ba8a0 take out potentially bad robots.txt filter compression logic. Matt Wells 2014-01-28 18:26:16 -0800
  • 57953f3b1b ignore empty products (i.e. {}) when tokenizing diffbot reply Matt Wells 2014-01-28 15:30:16 -0800
  • 53c2df1be1 fixed core Matt Wells 2014-01-28 15:20:37 -0800
  • 7b424a6236 always use kstart. fixed restrictDomain bug of not saving parm. sped up csv download around 2x. Matt Wells 2014-01-28 14:37:21 -0800
  • 239811b024 take out confusing function no longer used Matt Wells 2014-01-28 11:10:59 -0800
  • 8f39c41962 just print out cached page straight, it is just the diffbot json reply pretty much verbatim, except for being tokenized. should no longer escape forward slashes. Matt Wells 2014-01-28 11:04:53 -0800
  • e9fcb9ad06 started adding redownload logic. Matt Wells 2014-01-28 09:46:58 -0800
  • a9909e189f fix delete collection api Matt Wells 2014-01-27 15:28:26 -0800
  • 474676010c fix gb install 1-15 logic Matt Wells 2014-01-27 14:28:48 -0800
  • 726090be83 contains a hack fix to fix things at startup but now it is commented out. Matt Wells 2014-01-25 15:07:47 -0800
  • 1a9a5e53a7 show if coll has urls ready to spider in html page Matt Wells 2014-01-25 14:49:55 -0800
  • 268a244ee8 fix up round incrementing logic. Matt Wells 2014-01-25 14:35:41 -0800
  • 3a6a271dd9 make crawl sync bug fixes. fix Puz crawl from dying out on host 9 because spider reply did not resuscitate waiting tree for its ip. fix mike's zola crawl with a repeat of 3 days from not incmreneting the round because it had maxrounds 0, which means to ignore... assume 0 means to ignore now. send out 0xc1 crawl info requests to even dead hosts so we can at least use their last known good info. Matt Wells 2014-01-25 13:47:03 -0800
  • 3bdbf23f13 fix core from double free Matt Wells 2014-01-25 11:21:15 -0800
  • e3f769dffe fixes for sudden revitilization of dead crawls. Matt Wells 2014-01-25 11:03:15 -0800
  • c207c3c456 fixed core Matt Wells 2014-01-25 08:36:09 -0800
  • bc78b21dc6 for json docs only give them a single xmlnode in the Xml.cpp class. hopefully will not get "malformed sections" error anymore. i think that was a result of the json having html tags in it and making unnested html structures which the sections class did not like. TODO: probably do this for CT_TEXT etc. as well. Matt Wells 2014-01-25 08:17:38 -0800
  • 4d0a09f1e4 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-25 07:02:42 -0800
  • 99c6390a69 fix a core. do not get sections of non-html or non-text documents. was causing EMALFORMED sections error on diffbot json. Matt Wells 2014-01-25 07:02:14 -0800
  • 29a574d85a if indexing diffbot url and it had error, do NOT add a spider reply. Matt Wells 2014-01-25 07:01:26 -0800
  • 308106673c added debug statements for email bug mwells 2014-01-24 14:08:27 -0800
  • 321fc90ff6 fix some cores. NOTE: emails disabled here... need to fix. Matt Wells 2014-01-24 12:07:28 -0800
  • 27b6ceffa8 fix bug of sending notification email twice for really really tiny jobs. Matt Wells 2014-01-23 21:22:39 -0800
  • c4a6ad1145 update "this round" counts to at least the total counts if round # is 0 so we do not double spider everyone's jobs! put a check in rebalance loop to see if gb is exiting so we don't get into an infinite loop. this should be in redmine now... Matt Wells 2014-01-23 18:22:13 -0800
  • 77ca55f712 fix send email notification bug. increase unlink threads from 1 to 30. seemed to be going to slow after doing a ddump with like 3000 collections. it was unlink like 1 file per sec. Matt Wells 2014-01-23 16:59:55 -0800
  • dd663eb9f7 fix round based spidering some more Matt Wells 2014-01-23 15:03:37 -0800
  • edb01b0abb Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-23 13:24:20 -0800
  • 313cffc322 had to add per round page and process counts in case they had maxToCrawl and respider frequencies set. simplified round logic in Spider.cpp. Matt Wells 2014-01-23 13:23:09 -0800
  • 4f7b00c6ce fix core on broken pipe when calling sendChunk() and socket in streaming mode. Matt Wells 2014-01-23 11:34:49 -0800
  • 9432ae870d fix bug to pass jenkins. Matt Wells 2014-01-23 09:38:15 -0800
  • 26c76a3240 fixed bug of waiting trees not saving. Matt Wells 2014-01-23 01:04:24 -0800
  • 26b98a591a fixed bug of not saving waiting trees! took out misleading Collectiondb::getNumRecs() func.! bad Matt Wells 2014-01-23 01:02:11 -0800
  • e351cb9939 free spidercolls on exit Matt Wells 2014-01-22 23:52:23 -0800
  • bc35b7d0ec fix pagecrawlbot.cpp to support &c=token-name. cleanup mem at process exit better. Matt Wells 2014-01-22 23:40:38 -0800
  • df063dbdf2 fix a core Matt Wells 2014-01-22 22:26:50 -0800
  • 488e8c8e2f pause crawl if diffbot says token is expired. Matt Wells 2014-01-22 20:56:52 -0800
  • 7cd746f567 fix msge0 msg0 overload in sockets table when all diffbot replies timed out at once at released thousands of spiders. Matt Wells 2014-01-22 20:34:55 -0800
  • 8a9b1f7a19 added diffbot retry rules. added maxTotalSpiders parm for all colls to follow. tried to fix msg 0x00 socket jam up. Matt Wells 2014-01-22 19:57:38 -0800
  • 061bf70a51 show EXACT diffbot url used in logs for easier replication Matt Wells 2014-01-22 18:25:18 -0800
  • 5f890f5d4f minor doc update Matt Wells 2014-01-22 15:52:04 -0800
  • 25ec2b23cb added gb.pem to required files list Matt Wells 2014-01-22 14:47:20 -0800
  • 034de5039f ignore tagdb corrupt tags in xmldoc.cpp. fix ip -1 bug when adding to waiting tree and it would prevent populateWaitingTreeFromSpiderdb() from continuing and freeze things up. Matt Wells 2014-01-22 14:36:05 -0800
  • a4be05d8d0 more shard rebalancer fixes Matt Wells 2014-01-22 00:44:33 -0800
  • 066d910934 try to fix rebalancing some more. Matt Wells 2014-01-21 22:39:01 -0800
  • 31cb71214c more rdbtree fixes when invalid collections are in there Matt Wells 2014-01-21 20:00:34 -0800
  • 443bb26f01 disk page cache back on Matt Wells 2014-01-21 19:03:47 -0800
  • 33c5d9c07f a lot of times rdb tree has invalid collection numbers in it so fix our counting algo in case the collection rec no longer exists! Matt Wells 2014-01-21 19:01:44 -0800
  • 45cb5c9a0c fix bugs to try to get sharding working on crawlbot today Matt Wells 2014-01-21 13:58:21 -0800
  • 7065b0ae0c fixed oops Matt Wells 2014-01-21 13:13:16 -0800
  • dba382f7f7 added max cpu merge threads parm and defaulted to 10 up from 2 for better disk reading latencies. Matt Wells 2014-01-21 13:11:53 -0800
  • 9354d06493 menu updates. Matt Wells 2014-01-21 13:01:37 -0800
  • 8d5e1cb547 added url download support Matt Wells 2014-01-20 23:17:04 -0800
  • 41cdfcef96 inc spider limits in various places Matt Wells 2014-01-20 18:51:15 -0800
  • 946a683e39 quite a few spider fixes Matt Wells 2014-01-20 16:45:27 -0800
  • 5c86d8a122 simplified spiderdb.cpp scanSpiderdb() by breaking up into 4 functions. evalIpLoop(), readSpiderdbList(), ... Matt Wells 2014-01-19 22:18:37 -0800
  • e9bbc16a9f took out pagecount table. just hafta scan twice i think because caching counts gets complicated because of adding duplicate injection requests! Matt Wells 2014-01-19 20:34:38 -0800
  • 58d0c444ac fixes for the global index quota system Matt Wells 2014-01-19 19:38:23 -0800
  • 089d7f34a0 more spiderdb spider request fixes Matt Wells 2014-01-19 18:00:56 -0800
  • 970d5b2488 formatting Matt Wells 2014-01-19 16:40:22 -0800
  • fa0e3f784f formatting Matt Wells 2014-01-19 15:06:02 -0800
  • 5c9b688f72 spiderdb fixes for injections Matt Wells 2014-01-19 14:33:27 -0800
  • 99de2188e1 formatting Matt Wells 2014-01-19 13:21:58 -0800
  • 04b0650301 formatting Matt Wells 2014-01-19 12:37:37 -0800