Commit Graph

  • 95a47a776e image updates Matt Wells 2014-01-30 13:11:26 -08:00
  • 8bdb9d1a3e doc updates per john on how we dedup Matt Wells 2014-01-30 10:57:49 -08:00
  • 8876dae984 added and fixed support for <link ahref=xxx rel=canonical>. treat those as simplified meta redirects. updated spider dedup documentation in developer.html file. Matt Wells 2014-01-30 10:37:59 -08:00
  • 6a45e42128 added ability to treat <link xyz.com rel=canoical> as meta redirects. should help us dedup. added a function to do looser deduping of spider pages although current not enabled, we are still using the more strict one. added documentation on how we dedup to developer.html for jon to take a look at. Matt Wells 2014-01-30 10:04:09 -08:00
  • 6af9441818 change deduping logic to be first come first server, but site rank trumps. fixed bug from fix before. Matt Wells 2014-01-29 16:14:42 -08:00
  • c92a9a4158 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-29 15:56:39 -08:00
  • b40f393f4c fix a couple cores related to deleting collections in progress. support termlist dump with terms containing colons. Matt Wells 2014-01-29 15:56:07 -08:00
  • 1fb1e2af7e fixed form input. fixed page parser submission. added ability to dump out termlist from posdb like type:json (with a colon in it) to try to debug msft seeing html in csv output. Matt Wells 2014-01-29 14:10:08 -08:00
  • 8aef2ba8a0 take out potentially bad robots.txt filter compression logic. Matt Wells 2014-01-28 18:26:16 -08:00
  • 57953f3b1b ignore empty products (i.e. {}) when tokenizing diffbot reply Matt Wells 2014-01-28 15:30:16 -08:00
  • 53c2df1be1 fixed core Matt Wells 2014-01-28 15:20:37 -08:00
  • 7b424a6236 always use kstart. fixed restrictDomain bug of not saving parm. sped up csv download around 2x. Matt Wells 2014-01-28 14:37:21 -08:00
  • 239811b024 take out confusing function no longer used Matt Wells 2014-01-28 11:10:59 -08:00
  • 8f39c41962 just print out cached page straight, it is just the diffbot json reply pretty much verbatim, except for being tokenized. should no longer escape forward slashes. Matt Wells 2014-01-28 11:04:53 -08:00
  • e9fcb9ad06 started adding redownload logic. Matt Wells 2014-01-28 09:46:58 -08:00
  • a9909e189f fix delete collection api Matt Wells 2014-01-27 15:28:26 -08:00
  • 474676010c fix gb install 1-15 logic Matt Wells 2014-01-27 14:28:48 -08:00
  • 726090be83 contains a hack fix to fix things at startup but now it is commented out. Matt Wells 2014-01-25 15:07:47 -08:00
  • 1a9a5e53a7 show if coll has urls ready to spider in html page Matt Wells 2014-01-25 14:49:55 -08:00
  • 268a244ee8 fix up round incrementing logic. Matt Wells 2014-01-25 14:35:41 -08:00
  • 3a6a271dd9 make crawl sync bug fixes. fix Puz crawl from dying out on host 9 because spider reply did not resuscitate waiting tree for its ip. fix mike's zola crawl with a repeat of 3 days from not incmreneting the round because it had maxrounds 0, which means to ignore... assume 0 means to ignore now. send out 0xc1 crawl info requests to even dead hosts so we can at least use their last known good info. Matt Wells 2014-01-25 13:47:03 -08:00
  • 3bdbf23f13 fix core from double free Matt Wells 2014-01-25 11:21:15 -08:00
  • e3f769dffe fixes for sudden revitilization of dead crawls. Matt Wells 2014-01-25 11:03:15 -08:00
  • c207c3c456 fixed core Matt Wells 2014-01-25 08:36:09 -08:00
  • bc78b21dc6 for json docs only give them a single xmlnode in the Xml.cpp class. hopefully will not get "malformed sections" error anymore. i think that was a result of the json having html tags in it and making unnested html structures which the sections class did not like. TODO: probably do this for CT_TEXT etc. as well. Matt Wells 2014-01-25 08:17:38 -08:00
  • 4d0a09f1e4 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-25 07:02:42 -08:00
  • 99c6390a69 fix a core. do not get sections of non-html or non-text documents. was causing EMALFORMED sections error on diffbot json. Matt Wells 2014-01-25 07:02:14 -08:00
  • 29a574d85a if indexing diffbot url and it had error, do NOT add a spider reply. Matt Wells 2014-01-25 07:01:26 -08:00
  • 308106673c added debug statements for email bug mwells 2014-01-24 14:08:27 -08:00
  • 321fc90ff6 fix some cores. NOTE: emails disabled here... need to fix. Matt Wells 2014-01-24 12:07:28 -08:00
  • 27b6ceffa8 fix bug of sending notification email twice for really really tiny jobs. Matt Wells 2014-01-23 21:22:39 -08:00
  • c4a6ad1145 update "this round" counts to at least the total counts if round # is 0 so we do not double spider everyone's jobs! put a check in rebalance loop to see if gb is exiting so we don't get into an infinite loop. this should be in redmine now... Matt Wells 2014-01-23 18:22:13 -08:00
  • 77ca55f712 fix send email notification bug. increase unlink threads from 1 to 30. seemed to be going to slow after doing a ddump with like 3000 collections. it was unlink like 1 file per sec. Matt Wells 2014-01-23 16:59:55 -08:00
  • dd663eb9f7 fix round based spidering some more Matt Wells 2014-01-23 15:03:37 -08:00
  • edb01b0abb Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-23 13:24:20 -08:00
  • 313cffc322 had to add per round page and process counts in case they had maxToCrawl and respider frequencies set. simplified round logic in Spider.cpp. Matt Wells 2014-01-23 13:23:09 -08:00
  • 4f7b00c6ce fix core on broken pipe when calling sendChunk() and socket in streaming mode. Matt Wells 2014-01-23 11:34:49 -08:00
  • 9432ae870d fix bug to pass jenkins. Matt Wells 2014-01-23 09:38:15 -08:00
  • 26c76a3240 fixed bug of waiting trees not saving. Matt Wells 2014-01-23 01:04:24 -08:00
  • 26b98a591a fixed bug of not saving waiting trees! took out misleading Collectiondb::getNumRecs() func.! bad Matt Wells 2014-01-23 01:02:11 -08:00
  • e351cb9939 free spidercolls on exit Matt Wells 2014-01-22 23:52:23 -08:00
  • bc35b7d0ec fix pagecrawlbot.cpp to support &c=token-name. cleanup mem at process exit better. Matt Wells 2014-01-22 23:40:38 -08:00
  • df063dbdf2 fix a core Matt Wells 2014-01-22 22:26:50 -08:00
  • 488e8c8e2f pause crawl if diffbot says token is expired. Matt Wells 2014-01-22 20:56:52 -08:00
  • 7cd746f567 fix msge0 msg0 overload in sockets table when all diffbot replies timed out at once at released thousands of spiders. Matt Wells 2014-01-22 20:34:55 -08:00
  • 8a9b1f7a19 added diffbot retry rules. added maxTotalSpiders parm for all colls to follow. tried to fix msg 0x00 socket jam up. Matt Wells 2014-01-22 19:57:38 -08:00
  • 061bf70a51 show EXACT diffbot url used in logs for easier replication Matt Wells 2014-01-22 18:25:18 -08:00
  • 5f890f5d4f minor doc update Matt Wells 2014-01-22 15:52:04 -08:00
  • 25ec2b23cb added gb.pem to required files list Matt Wells 2014-01-22 14:47:20 -08:00
  • 034de5039f ignore tagdb corrupt tags in xmldoc.cpp. fix ip -1 bug when adding to waiting tree and it would prevent populateWaitingTreeFromSpiderdb() from continuing and freeze things up. Matt Wells 2014-01-22 14:36:05 -08:00
  • a4be05d8d0 more shard rebalancer fixes Matt Wells 2014-01-22 00:44:33 -08:00
  • 066d910934 try to fix rebalancing some more. Matt Wells 2014-01-21 22:39:01 -08:00
  • 31cb71214c more rdbtree fixes when invalid collections are in there Matt Wells 2014-01-21 20:00:34 -08:00
  • 443bb26f01 disk page cache back on Matt Wells 2014-01-21 19:03:47 -08:00
  • 33c5d9c07f a lot of times rdb tree has invalid collection numbers in it so fix our counting algo in case the collection rec no longer exists! Matt Wells 2014-01-21 19:01:44 -08:00
  • 45cb5c9a0c fix bugs to try to get sharding working on crawlbot today Matt Wells 2014-01-21 13:58:21 -08:00
  • 7065b0ae0c fixed oops Matt Wells 2014-01-21 13:13:16 -08:00
  • dba382f7f7 added max cpu merge threads parm and defaulted to 10 up from 2 for better disk reading latencies. Matt Wells 2014-01-21 13:11:53 -08:00
  • 9354d06493 menu updates. Matt Wells 2014-01-21 13:01:37 -08:00
  • 8d5e1cb547 added url download support Matt Wells 2014-01-20 23:17:04 -08:00
  • 41cdfcef96 inc spider limits in various places Matt Wells 2014-01-20 18:51:15 -08:00
  • 946a683e39 quite a few spider fixes Matt Wells 2014-01-20 16:45:27 -08:00
  • 5c86d8a122 simplified spiderdb.cpp scanSpiderdb() by breaking up into 4 functions. evalIpLoop(), readSpiderdbList(), ... Matt Wells 2014-01-19 22:18:37 -08:00
  • e9bbc16a9f took out pagecount table. just hafta scan twice i think because caching counts gets complicated because of adding duplicate injection requests! Matt Wells 2014-01-19 20:34:38 -08:00
  • 58d0c444ac fixes for the global index quota system Matt Wells 2014-01-19 19:38:23 -08:00
  • 089d7f34a0 more spiderdb spider request fixes Matt Wells 2014-01-19 18:00:56 -08:00
  • 970d5b2488 formatting Matt Wells 2014-01-19 16:40:22 -08:00
  • fa0e3f784f formatting Matt Wells 2014-01-19 15:06:02 -08:00
  • 5c9b688f72 spiderdb fixes for injections Matt Wells 2014-01-19 14:33:27 -08:00
  • 99de2188e1 formatting Matt Wells 2014-01-19 13:21:58 -08:00
  • 04b0650301 formatting Matt Wells 2014-01-19 12:37:37 -08:00
  • cd91130a6d formatting Matt Wells 2014-01-19 12:16:26 -08:00
  • ca816492b5 doc links Matt Wells 2014-01-19 12:01:32 -08:00
  • b6c3ecc20e more formatting Matt Wells 2014-01-19 11:56:36 -08:00
  • 471599e9e7 formatting Matt Wells 2014-01-19 10:44:19 -08:00
  • e6eb9003b5 more formatting Matt Wells 2014-01-19 01:09:38 -08:00
  • b755b4d581 formatting fixes Matt Wells 2014-01-19 00:57:20 -08:00
  • fe3a879758 formatting changes Matt Wells 2014-01-19 00:38:02 -08:00
  • 36b93a1e92 minor cmdline fixes Matt Wells 2014-01-18 21:26:59 -08:00
  • 4606e88721 code cleanups. xmldoc::injectDoc(), and it'll add a SpiderRequest as well. better collectiondb init code. Matt Wells 2014-01-18 21:19:26 -08:00
  • 10f4443974 quite a few fixes to the quota system, cleanups etc. Matt Wells 2014-01-18 16:23:13 -08:00
  • f3000e2763 set m_needsSave in collectionrec when parms updated Matt Wells 2014-01-18 12:51:10 -08:00
  • 8edfc2ce70 more collection fixes Matt Wells 2014-01-18 12:09:33 -08:00
  • fa59c62264 more bug fixes associated with collections and site page counts in url filters. Matt Wells 2014-01-18 11:54:58 -08:00
  • 22aa13e34d do not set indexcode to EFAKEFIRSTIP for INJECTED urls, just added urls. fix add url page to not always use 'main' collection. added reset/restart cmds to spider page. Matt Wells 2014-01-18 11:09:30 -08:00
  • 178af5f781 cleanup parms a bit. added diffbotApiUrl to all crawls whether custom or not, on spider controls page. Matt Wells 2014-01-18 10:29:22 -08:00
  • 9c1f6197eb added indexbody control so i can turn it off for my special json global index. Matt Wells 2014-01-18 10:04:33 -08:00
  • 6fb602ae62 hash a little meta info still even if custom crawl Matt Wells 2014-01-18 09:37:07 -08:00
  • f9d0a02dbe test and get gbparenturl: query working. Matt Wells 2014-01-18 09:28:58 -08:00
  • 0be8a59e9e hash content checksums for pages in custom crawls so we can do deduping. Matt Wells 2014-01-17 21:42:02 -08:00
  • 5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-17 21:07:08 -08:00
  • 4e803210ee tons of changes from live github on neo. lots of core fixes. took out ppthtml powerpoint convert, it hangs. dynamic rdbmap to save memory per coll. fixed disk page cache logic and brought it back. Matt Wells 2014-01-17 21:01:43 -08:00
  • 8c4ac3c514 Merge branch 'master' into diffbot Matt Wells 2014-01-17 20:17:40 -08:00
  • bb51dd93c8 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-17 20:17:03 -08:00
  • 403dca707c do not hash body etc. into posdb if doing a custom diffbot crawl. saves a lot of disk space. Matt Wells 2014-01-17 20:16:29 -08:00
  • 116f90dba3 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-17 18:39:34 -08:00
  • 94740ed3a1 allow sleeps in main.cpp function Matt Wells 2014-01-17 18:39:20 -08:00
  • 3ec44c5b35 fix streaming mode for sending back json downloads/dumps. Matt Wells 2014-01-17 18:28:17 -08:00
  • e09496e34e fix parm updating logic. Matt Wells 2014-01-17 17:48:45 -08:00
  • 2faba0efd1 fix repeat rounds sticking bug by adding PF_REBUILDURLFILTERS flag to spiderroundastarttime parm Matt Wells 2014-01-17 17:17:10 -08:00