Commit Graph

  • cd91130a6d formatting Matt Wells 2014-01-19 12:16:26 -0800
  • ca816492b5 doc links Matt Wells 2014-01-19 12:01:32 -0800
  • b6c3ecc20e more formatting Matt Wells 2014-01-19 11:56:36 -0800
  • 471599e9e7 formatting Matt Wells 2014-01-19 10:44:19 -0800
  • e6eb9003b5 more formatting Matt Wells 2014-01-19 01:09:38 -0800
  • b755b4d581 formatting fixes Matt Wells 2014-01-19 00:57:20 -0800
  • fe3a879758 formatting changes Matt Wells 2014-01-19 00:38:02 -0800
  • 36b93a1e92 minor cmdline fixes Matt Wells 2014-01-18 21:26:59 -0800
  • 4606e88721 code cleanups. xmldoc::injectDoc(), and it'll add a SpiderRequest as well. better collectiondb init code. Matt Wells 2014-01-18 21:19:26 -0800
  • 10f4443974 quite a few fixes to the quota system, cleanups etc. Matt Wells 2014-01-18 16:23:13 -0800
  • f3000e2763 set m_needsSave in collectionrec when parms updated Matt Wells 2014-01-18 12:51:10 -0800
  • 8edfc2ce70 more collection fixes Matt Wells 2014-01-18 12:09:33 -0800
  • fa59c62264 more bug fixes associated with collections and site page counts in url filters. Matt Wells 2014-01-18 11:54:58 -0800
  • 22aa13e34d do not set indexcode to EFAKEFIRSTIP for INJECTED urls, just added urls. fix add url page to not always use 'main' collection. added reset/restart cmds to spider page. Matt Wells 2014-01-18 11:09:30 -0800
  • 178af5f781 cleanup parms a bit. added diffbotApiUrl to all crawls whether custom or not, on spider controls page. Matt Wells 2014-01-18 10:29:22 -0800
  • 9c1f6197eb added indexbody control so i can turn it off for my special json global index. Matt Wells 2014-01-18 10:04:33 -0800
  • 6fb602ae62 hash a little meta info still even if custom crawl Matt Wells 2014-01-18 09:37:07 -0800
  • f9d0a02dbe test and get gbparenturl: query working. Matt Wells 2014-01-18 09:28:58 -0800
  • 0be8a59e9e hash content checksums for pages in custom crawls so we can do deduping. Matt Wells 2014-01-17 21:42:02 -0800
  • 5b7170e8c6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-17 21:07:08 -0800
  • 4e803210ee tons of changes from live github on neo. lots of core fixes. took out ppthtml powerpoint convert, it hangs. dynamic rdbmap to save memory per coll. fixed disk page cache logic and brought it back. Matt Wells 2014-01-17 21:01:43 -0800
  • 8c4ac3c514 Merge branch 'master' into diffbot Matt Wells 2014-01-17 20:17:40 -0800
  • bb51dd93c8 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-17 20:17:03 -0800
  • 403dca707c do not hash body etc. into posdb if doing a custom diffbot crawl. saves a lot of disk space. Matt Wells 2014-01-17 20:16:29 -0800
  • 116f90dba3 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-17 18:39:34 -0800
  • 94740ed3a1 allow sleeps in main.cpp function Matt Wells 2014-01-17 18:39:20 -0800
  • 3ec44c5b35 fix streaming mode for sending back json downloads/dumps. Matt Wells 2014-01-17 18:28:17 -0800
  • e09496e34e fix parm updating logic. Matt Wells 2014-01-17 17:48:45 -0800
  • 2faba0efd1 fix repeat rounds sticking bug by adding PF_REBUILDURLFILTERS flag to spiderroundastarttime parm Matt Wells 2014-01-17 17:17:10 -0800
  • 16f8af0d57 added awesome streaming mode support to tcpserver.cpp for sending back json objects as we get them from shards. and as we get them in small pieces so we don't go oom. made that code much simpler and more reliable in the long run. Matt Wells 2014-01-17 16:26:17 -0800
  • 0844dbf72a added url process pattern and regex to xmldoc.cpp. Matt Wells 2014-01-17 11:08:23 -0800
  • 01a3282020 fix problem scanning spiderdb. move dedup spiderdb code to RdbMerge.cpp where it really should be. Matt Wells 2014-01-16 17:04:08 -0800
  • 167d2dc99f nothing. Matt Wells 2014-01-16 13:40:27 -0800
  • 980d63632a more msg5 re-read fixes. stop re-reading if increasing minrecsizes did nothing. fix tight merges so they work over all colls. fix merge counting to be fast and not loop over all rdbbases which could be thousands. add num mirrors to rebalance.txt. fix updateCrawlInfo to wait for all replies. critical error! Matt Wells 2014-01-16 13:38:22 -0800
  • dde05446f5 sharding fixes for 3+ stripes. Matt Wells 2014-01-16 11:20:12 -0700
  • ae3aa445e8 rebalancer working pretty well now Matt Wells 2014-01-15 19:08:47 -0800
  • 4b27b22949 git rebalancing working right Matt Wells 2014-01-15 17:40:17 -0800
  • 4a04542829 rebalancer is running now Matt Wells 2014-01-15 16:09:38 -0800
  • f8c2329bd2 rebalancer fixes Matt Wells 2014-01-15 15:42:59 -0800
  • b8058cc16c rebalance fix Matt Wells 2014-01-15 14:49:11 -0800
  • 883487889d make gb install only have 10 outstanding per an ip since ssh seems to close connections if you have more than 12 out. Matt Wells 2014-01-15 14:41:30 -0800
  • d091c7e959 fix hostsinagreement bug Matt Wells 2014-01-14 11:24:32 -0800
  • cb5b4af271 show reason spiders are not going above the spider queue page. Matt Wells 2014-01-11 21:40:45 -0800
  • 9da106e7ca added ermergency msg box on all admin pages Matt Wells 2014-01-11 20:35:13 -0800
  • eed606601e added emergency msg box on all admin pages Matt Wells 2014-01-11 20:14:44 -0800
  • 92047661ae fix annoying rdbtree pos/neg key counting issue Matt Wells 2014-01-11 18:04:28 -0800
  • 6de7abf6ba display fixes. ./gb installgb and ./gb installgb2 now install 'gb' if 'gb.new' is not present. Matt Wells 2014-01-11 17:16:20 -0800
  • 299a208253 reduce log spam Matt Wells 2014-01-11 16:49:43 -0800
  • 444151f185 Merge branch 'master' into diffbot Matt Wells 2014-01-11 16:10:48 -0800
  • 8a49e87a61 got code with shard rebalancing compiling. now we store a "sharded by termid" bit in posdb key for checksums, etc keys that are not sharded by docid. save having to do disk seeks on every host in the cluster to do a dup check, etc. Matt Wells 2014-01-11 16:08:42 -0800
  • f5071e1d88 fix docid range splitting logic. Matt Wells 2014-01-10 20:39:13 -0700
  • 4a4eac663f fix old bug. Matt Wells 2014-01-10 18:52:47 -0700
  • 4378c922ac localhosts.conf file takes precedence over hosts.conf file so you can put your customizations into there and easily do a gitpull without having to worry about hosts.conf being checked in and stuff. Matt Wells 2014-01-10 18:38:11 -0700
  • 3a6f0d81e3 fix a few cores. assume any ip that matches the c-block of any host in hosts.conf file is "local". clarified specs in admin.html. Matt Wells 2014-01-10 18:34:47 -0700
  • f64b53bfb3 almost done with rebalancing code Matt Wells 2014-01-10 14:12:58 -0800
  • 8943106389 minor print updates Matt Wells 2014-01-09 21:23:51 -0800
  • 1d6ba52dcd list collections in sidebar. Matt Wells 2014-01-09 21:13:41 -0800
  • 6660dca57c default parm updates Matt Wells 2014-01-09 20:07:19 -0800
  • c596b6c5a6 default gb.conf update Matt Wells 2014-01-09 19:59:02 -0800
  • aa4c751de9 minor updates Matt Wells 2014-01-09 19:50:04 -0800
  • d76e7a9c8e highlight non-default value parms. Matt Wells 2014-01-09 19:37:17 -0800
  • 645360b730 parm simplifcations Matt Wells 2014-01-09 19:00:21 -0800
  • 501f49c81b gui and parm updates. simplifcations. Matt Wells 2014-01-09 17:29:18 -0800
  • 128e055120 take out datedb. no longer used. we store dates in posdb since it has larger keys than indexdb. Matt Wells 2014-01-09 13:39:28 -0800
  • 4d7fa1eea9 pretty up url filters table Matt Wells 2014-01-09 13:34:43 -0800
  • ebdf1f638a fix ./gb installgb2 to be semi-sequential Matt Wells 2014-01-09 13:25:45 -0800
  • d8554bfb0f update default parm settings. Matt Wells 2014-01-09 13:22:51 -0800
  • 47327a0c41 Merge branch 'master' into diffbot Matt Wells 2014-01-09 13:07:59 -0800
  • 70f8c416de allow collections to be added when no colls exist. fixed gb start2 etc. to be sequential. Matt Wells 2014-01-09 13:07:16 -0800
  • 161a5c5d6b logging cleanups Matt Wells 2014-01-09 12:38:38 -0800
  • 482186e2af Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-09 11:35:46 -0800
  • 6ba3936d0b various core fixes. need to fix json parser mem allocation right though. Added dynamic rdb map ptr allocation to save memory when you have thousands of collections. Matt Wells 2014-01-09 11:34:52 -0800
  • 5007dc8e0c fix core in gb seektest Matt Wells 2014-01-09 11:17:05 -0700
  • 4a842c1c68 fix occassional core in Mem.cpp Matt Wells 2014-01-08 01:32:24 -0700
  • 65270c4063 fix spiders not going Matt Wells 2014-01-07 16:05:22 -0800
  • 1bf827ed62 more spider not going fixes Matt Wells 2014-01-07 16:48:20 -0700
  • 0dddd8fab1 fix spiders not going bug Matt Wells 2014-01-07 16:41:14 -0700
  • 3db562f0f1 bug fixes for pages indexed and manual seeds counting. Matt Wells 2014-01-07 15:38:22 -0800
  • f0c232803a fix page url filters editing Matt Wells 2014-01-07 14:27:58 -0800
  • 3cc83e0c26 added .mov files to media list Matt Wells 2014-01-07 13:08:32 -0800
  • 7bbecd4678 no more diffbot api url per url filter row. it's global now. Matt Wells 2014-01-07 12:43:02 -0800
  • d69fc065ce just use a global diffbot api url for simplicity to avoid having to call XmlDoc::getUrlFilterNum() which is not good practice since url filters are for spiderdb reads really. Matt Wells 2014-01-07 12:31:25 -0800
  • 909022642d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-07 12:10:59 -0800
  • e186b5e31c getUrlFilterNum now uses fake spider reply with only langid set. to prevent recursive loops. Matt Wells 2014-01-07 12:10:29 -0800
  • 0be59b4c4d Merge branch 'master' into diffbot Matt Wells 2014-01-07 12:09:35 -0800
  • e366c12470 Merge branch 'master' into diffbot Matt Wells 2014-01-07 12:09:11 -0800
  • c529eaf1c3 fix alloc in a thread bug Matt Wells 2014-01-07 11:32:12 -0700
  • b0457f973d Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-06 15:56:18 -0800
  • 724af442d4 do not do simplified redirs if its a bulk job Matt Wells 2014-01-06 15:55:41 -0800
  • 7e5b9bc1e8 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-06 15:49:47 -0800
  • ace49d0b16 fix sectiondb core Matt Wells 2014-01-06 15:49:35 -0800
  • c472635660 do not do simplified redirs for custom crawls so clients see their original urls. Matt Wells 2014-01-06 15:46:39 -0800
  • 50c99dd815 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2014-01-06 14:28:18 -0800
  • 49c935cf6d added SpiderReply::m_wasIndexedValid so we know whether to cound m_wasIndexed and m_isIndexed for page counting quota purposes. Matt Wells 2014-01-06 14:27:38 -0800
  • 599be55b81 return {} if no crawls, and just specify token. Matt Wells 2014-01-06 13:55:58 -0800
  • 4ed30a98ec bring back checkboxes. fix issue by putting an input hidden box with value=0 before the checkbox to transmit it even if unchecked. Matt Wells 2014-01-06 11:35:17 -0800
  • 622790d0f8 radio button fixes. make them buttons now. Matt Wells 2014-01-06 10:23:44 -0800
  • 4f64677b4f get new global preemptive cache logic compiling, with section voting stats. Matt Wells 2014-01-05 11:51:09 -0800
  • 258c48bb98 increase hardware requirements to 4GB from 500MB until we adjust in-memory structure mem usage dynamically. Matt Wells 2014-01-04 18:05:50 -0700
  • 9bf49884b9 fix compiler warning mwells 2014-01-02 01:35:52 -0700