Commit Graph

  • 16f8af0d57 added awesome streaming mode support to tcpserver.cpp for sending back json objects as we get them from shards. and as we get them in small pieces so we don't go oom. made that code much simpler and more reliable in the long run. Matt Wells 2014-01-17 16:26:17 -08:00
  • 0844dbf72a added url process pattern and regex to xmldoc.cpp. Matt Wells 2014-01-17 11:08:23 -08:00
  • 01a3282020 fix problem scanning spiderdb. move dedup spiderdb code to RdbMerge.cpp where it really should be. Matt Wells 2014-01-16 17:04:08 -08:00
  • 167d2dc99f nothing. Matt Wells 2014-01-16 13:40:27 -08:00
  • 980d63632a more msg5 re-read fixes. stop re-reading if increasing minrecsizes did nothing. fix tight merges so they work over all colls. fix merge counting to be fast and not loop over all rdbbases which could be thousands. add num mirrors to rebalance.txt. fix updateCrawlInfo to wait for all replies. critical error! Matt Wells 2014-01-16 13:38:22 -08:00
  • dde05446f5 sharding fixes for 3+ stripes. Matt Wells 2014-01-16 11:20:12 -07:00
  • ae3aa445e8 rebalancer working pretty well now Matt Wells 2014-01-15 19:08:47 -08:00
  • 4b27b22949 git rebalancing working right Matt Wells 2014-01-15 17:40:17 -08:00
  • 4a04542829 rebalancer is running now Matt Wells 2014-01-15 16:09:38 -08:00
  • f8c2329bd2 rebalancer fixes Matt Wells 2014-01-15 15:42:59 -08:00
  • b8058cc16c rebalance fix Matt Wells 2014-01-15 14:49:11 -08:00
  • 883487889d make gb install only have 10 outstanding per an ip since ssh seems to close connections if you have more than 12 out. Matt Wells 2014-01-15 14:41:30 -08:00
  • d091c7e959 fix hostsinagreement bug Matt Wells 2014-01-14 11:24:32 -08:00
  • cb5b4af271 show reason spiders are not going above the spider queue page. Matt Wells 2014-01-11 21:40:45 -08:00
  • 9da106e7ca added ermergency msg box on all admin pages Matt Wells 2014-01-11 20:35:13 -08:00
  • eed606601e added emergency msg box on all admin pages Matt Wells 2014-01-11 20:14:44 -08:00
  • 92047661ae fix annoying rdbtree pos/neg key counting issue Matt Wells 2014-01-11 18:04:28 -08:00
  • 6de7abf6ba display fixes. ./gb installgb and ./gb installgb2 now install 'gb' if 'gb.new' is not present. Matt Wells 2014-01-11 17:16:20 -08:00
  • 299a208253 reduce log spam Matt Wells 2014-01-11 16:49:43 -08:00
  • 444151f185 Merge branch 'master' into diffbot Matt Wells 2014-01-11 16:10:48 -08:00
  • 8a49e87a61 got code with shard rebalancing compiling. now we store a "sharded by termid" bit in posdb key for checksums, etc keys that are not sharded by docid. save having to do disk seeks on every host in the cluster to do a dup check, etc. Matt Wells 2014-01-11 16:08:42 -08:00
  • f5071e1d88 fix docid range splitting logic. Matt Wells 2014-01-10 20:39:13 -07:00
  • 4a4eac663f fix old bug. Matt Wells 2014-01-10 18:52:47 -07:00
  • 4378c922ac localhosts.conf file takes precedence over hosts.conf file so you can put your customizations into there and easily do a gitpull without having to worry about hosts.conf being checked in and stuff. Matt Wells 2014-01-10 18:38:11 -07:00
  • 3a6f0d81e3 fix a few cores. assume any ip that matches the c-block of any host in hosts.conf file is "local". clarified specs in admin.html. Matt Wells 2014-01-10 18:34:47 -07:00
  • f64b53bfb3 almost done with rebalancing code Matt Wells 2014-01-10 14:12:58 -08:00
  • 8943106389 minor print updates Matt Wells 2014-01-09 21:23:51 -08:00
  • 1d6ba52dcd list collections in sidebar. Matt Wells 2014-01-09 21:13:41 -08:00
  • 6660dca57c default parm updates Matt Wells 2014-01-09 20:07:19 -08:00
  • c596b6c5a6 default gb.conf update Matt Wells 2014-01-09 19:59:02 -08:00
  • aa4c751de9 minor updates Matt Wells 2014-01-09 19:50:04 -08:00
  • d76e7a9c8e highlight non-default value parms. Matt Wells 2014-01-09 19:37:17 -08:00
  • 645360b730 parm simplifcations Matt Wells 2014-01-09 19:00:21 -08:00
  • 501f49c81b gui and parm updates. simplifcations. Matt Wells 2014-01-09 17:29:18 -08:00
  • 128e055120 take out datedb. no longer used. we store dates in posdb since it has larger keys than indexdb. Matt Wells 2014-01-09 13:39:28 -08:00
  • 4d7fa1eea9 pretty up url filters table Matt Wells 2014-01-09 13:34:43 -08:00
  • ebdf1f638a fix ./gb installgb2 to be semi-sequential Matt Wells 2014-01-09 13:25:45 -08:00
  • d8554bfb0f update default parm settings. Matt Wells 2014-01-09 13:22:51 -08:00
  • 47327a0c41 Merge branch 'master' into diffbot Matt Wells 2014-01-09 13:07:59 -08:00
  • 70f8c416de allow collections to be added when no colls exist. fixed gb start2 etc. to be sequential. Matt Wells 2014-01-09 13:07:16 -08:00
  • 161a5c5d6b logging cleanups Matt Wells 2014-01-09 12:38:38 -08:00
  • 482186e2af Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-09 11:35:46 -08:00
  • 6ba3936d0b various core fixes. need to fix json parser mem allocation right though. Added dynamic rdb map ptr allocation to save memory when you have thousands of collections. Matt Wells 2014-01-09 11:34:52 -08:00
  • 5007dc8e0c fix core in gb seektest Matt Wells 2014-01-09 11:17:05 -07:00
  • 4a842c1c68 fix occassional core in Mem.cpp Matt Wells 2014-01-08 01:32:24 -07:00
  • 65270c4063 fix spiders not going Matt Wells 2014-01-07 16:05:22 -08:00
  • 1bf827ed62 more spider not going fixes Matt Wells 2014-01-07 16:48:20 -07:00
  • 0dddd8fab1 fix spiders not going bug Matt Wells 2014-01-07 16:41:14 -07:00
  • 3db562f0f1 bug fixes for pages indexed and manual seeds counting. Matt Wells 2014-01-07 15:38:22 -08:00
  • f0c232803a fix page url filters editing Matt Wells 2014-01-07 14:27:58 -08:00
  • 3cc83e0c26 added .mov files to media list Matt Wells 2014-01-07 13:08:32 -08:00
  • 7bbecd4678 no more diffbot api url per url filter row. it's global now. Matt Wells 2014-01-07 12:43:02 -08:00
  • d69fc065ce just use a global diffbot api url for simplicity to avoid having to call XmlDoc::getUrlFilterNum() which is not good practice since url filters are for spiderdb reads really. Matt Wells 2014-01-07 12:31:25 -08:00
  • 909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-07 12:10:59 -08:00
  • e186b5e31c getUrlFilterNum now uses fake spider reply with only langid set. to prevent recursive loops. Matt Wells 2014-01-07 12:10:29 -08:00
  • 0be59b4c4d Merge branch 'master' into diffbot Matt Wells 2014-01-07 12:09:35 -08:00
  • e366c12470 Merge branch 'master' into diffbot Matt Wells 2014-01-07 12:09:11 -08:00
  • c529eaf1c3 fix alloc in a thread bug Matt Wells 2014-01-07 11:32:12 -07:00
  • b0457f973d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-06 15:56:18 -08:00
  • 724af442d4 do not do simplified redirs if its a bulk job Matt Wells 2014-01-06 15:55:41 -08:00
  • 7e5b9bc1e8 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-06 15:49:47 -08:00
  • ace49d0b16 fix sectiondb core Matt Wells 2014-01-06 15:49:35 -08:00
  • c472635660 do not do simplified redirs for custom crawls so clients see their original urls. Matt Wells 2014-01-06 15:46:39 -08:00
  • 50c99dd815 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-06 14:28:18 -08:00
  • 49c935cf6d added SpiderReply::m_wasIndexedValid so we know whether to cound m_wasIndexed and m_isIndexed for page counting quota purposes. Matt Wells 2014-01-06 14:27:38 -08:00
  • 599be55b81 return {} if no crawls, and just specify token. Matt Wells 2014-01-06 13:55:58 -08:00
  • 4ed30a98ec bring back checkboxes. fix issue by putting an input hidden box with value=0 before the checkbox to transmit it even if unchecked. Matt Wells 2014-01-06 11:35:17 -08:00
  • 622790d0f8 radio button fixes. make them buttons now. Matt Wells 2014-01-06 10:23:44 -08:00
  • 4f64677b4f get new global preemptive cache logic compiling, with section voting stats. Matt Wells 2014-01-05 11:51:09 -08:00
  • 258c48bb98 increase hardware requirements to 4GB from 500MB until we adjust in-memory structure mem usage dynamically. Matt Wells 2014-01-04 18:05:50 -07:00
  • 9bf49884b9 fix compiler warning mwells 2014-01-02 01:35:52 -07:00
  • 7df2111ceb fixed 'gb inject titledb-DIR newhosts.conf' command for populating an index from titledb files in DIR and transmitting to appropriate host in newhosts.conf. also prettied up the gb -h output to use a formatting function. Matt Wells 2014-01-02 01:20:08 -07:00
  • 935a4faccf fixed './gb inject titledb newhosts.conf' You have to be in working directory of the instance whose cached pages (titlerecs) you want to inject into the new cluster defined by newhosts.conf. Matt Wells 2014-01-01 22:04:26 -07:00
  • b7e9b78c21 hash gbparenturl: for getting json objects for the specified url in the search results. Matt Wells 2013-12-31 10:21:08 -08:00
  • d77ddc19c3 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-31 09:46:47 -08:00
  • 5619d2a2c8 fix initializing status msg error Matt Wells 2013-12-31 09:46:35 -08:00
  • 1919ad7f95 gb.conf spiders enabled. Matt Wells 2013-12-31 09:22:46 -08:00
  • 471fc7a50a fixed core from deleting a non-existent crawl. it tried to add it ... Matt Wells 2013-12-30 10:53:45 -08:00
  • 71982c9919 fix bad csv output Matt Wells 2013-12-30 10:39:45 -08:00
  • f92f190176 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-29 14:51:33 -08:00
  • a70b280206 nothing Matt Wells 2013-12-29 14:51:24 -08:00
  • c0447de3a1 watch out for NULL "base" after a coll delete. Matt Wells 2013-12-29 01:32:40 -08:00
  • 70fc63985b nothing Matt Wells 2013-12-28 20:32:28 -08:00
  • 1c044235be count EFAKEFIRSTIP errors when spidering as page download attempts. should fix a couple smoke tests. Matt Wells 2013-12-27 19:25:51 -08:00
  • 6aac48e487 fix crawl delay wait queue logic. if coll already exists trying to add, let it be. don't error out. Matt Wells 2013-12-27 14:35:51 -08:00
  • 5cdb73bc70 fix spider core Matt Wells 2013-12-27 15:28:44 -07:00
  • d8a9a3f4e3 fix parm sync code some more. added localhosts.conf to the 'gb install' dist. Matt Wells 2013-12-27 14:00:37 -08:00
  • bff0083538 ensured robots.txt redirects are cached as well Matt Wells 2013-12-27 13:01:01 -08:00
  • 534c9cf9db fix parm sync core Matt Wells 2013-12-27 12:09:46 -08:00
  • 958becbdf0 fix parm checksum for syncing parms. was not using gbstrlen() for strings. Matt Wells 2013-12-27 11:56:20 -08:00
  • 0181a32311 fix array count syncing. fix parms that were not syncing. Matt Wells 2013-12-26 11:51:20 -08:00
  • 100af585a6 parm sync fixes Matt Wells 2013-12-26 11:20:19 -08:00
  • 93d62a1f9e Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-26 09:34:47 -08:00
  • 9b5e3016df fix hosts.conf Matt Wells 2013-12-26 09:34:35 -08:00
  • 141a76c322 try localhosts.conf before hosts.conf Matt Wells 2013-12-26 09:32:22 -08:00
  • 7624a3db0a if url is manually added and it is simplifiedredirect then re-add with the same manually added bit set in the new spider request, otherwise seed url might not get spidered since it might not match the regex. Matt Wells 2013-12-26 08:58:56 -08:00
  • 048b715962 if coll is deleted or reset in a middle of a dump or merge then stop the dump/merge with ENOCOLLREC error. avoid calling "base->" functions since it could be NULL if deleted. Matt Wells 2013-12-25 17:12:09 -08:00
  • f9d7b9dbc7 fix core Matt Wells 2013-12-23 18:50:46 -08:00
  • 8537a02008 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-23 10:31:00 -08:00
  • 6cc69106c2 fix hosts.conf Matt Wells 2013-12-23 10:30:45 -08:00