Commit Graph

  • 1cce6d510e do not be so easy to say crawling is paused because a shard is down. Matt Wells 2015-09-02 13:39:40 -07:00
  • be1d81a0db Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-09-02 09:27:47 -07:00
  • 129c9d65db fix default hosts.conf generation Matt Wells 2015-09-02 09:26:03 -07:00
  • b4a0668b21 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-02 09:11:22 -06:00
  • 1dc68de0fa more fixes to pausing spiders if too many incoming udp slots. raised limit from 200 to 300. Matt Wells 2015-09-02 07:43:18 -07:00
  • aeb5039470 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-09-02 07:21:44 -06:00
  • a1a38bd2b2 fix attempt merge some more Matt 2015-09-02 07:21:32 -06:00
  • 98b5c05c84 done not construct waiting tree if too many udp incoming requests Matt Wells 2015-09-01 13:22:27 -07:00
  • 88ddfb14da if 200+ incoming udp slots then pause spidering on that host. should make it so it calling getUrlFilterNum() will not slow down the whole network as much. Matt Wells 2015-09-01 12:58:39 -07:00
  • 586fe15cb2 parm updates Matt 2015-09-01 09:42:18 -06:00
  • 4d9c35098d update new parm Matt 2015-09-01 09:32:53 -06:00
  • e45397d9ce Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-01 09:23:23 -06:00
  • a9854394ef attempt merge clusterdb forgotten Matt 2015-09-01 09:16:18 -06:00
  • 724d1ed5b9 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-01 09:12:06 -06:00
  • a7971a4aa9 fix core again Matt 2015-09-01 09:10:30 -06:00
  • ae58ab98fb Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-09-01 08:48:38 -06:00
  • e115323e60 Merge branch 'diffbot-testing' into testing Matt Wells 2015-09-01 01:17:36 -07:00
  • 34a4068ddb fix gb start script Matt Wells 2015-09-01 01:17:14 -07:00
  • 6e0dfd5a23 fix merge attempts Matt Wells 2015-09-01 01:07:43 -07:00
  • 3f0194913d fix core Matt 2015-09-01 01:00:04 -06:00
  • b199c67355 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-08-31 23:19:45 -06:00
  • 5a7b01585d Work on graph axis autoscaling. Zak Betz 2015-08-31 23:19:28 -06:00
  • de51769e5a add switch to turn off site num inlink computation and just use sitelinks.txt for speed Matt 2015-08-31 22:29:51 -06:00
  • 0d7b465f17 fix coll getting starved by other coll on max ips limitation. Matt Wells 2015-08-31 15:03:05 -07:00
  • 9de719b050 Merge branch 'testing' into diffbot-testing Matt 2015-08-31 14:07:16 -06:00
  • c803e0906e fix </script> tag detection stuff again. Matt 2015-08-31 14:06:44 -06:00
  • efa93aad18 prevent double ./gb start calls from messing things up. Matt 2015-08-31 11:13:33 -06:00
  • ddf4ae2240 More testing on nospider, noquery. Add flags to make the nospider and noquery visible on hosts page. Zak Betz 2015-08-31 10:47:19 -06:00
  • 994bdbdd54 fix logging deadlock bug. Matt 2015-08-31 09:56:34 -06:00
  • e373f28728 update hosts.conf generation. removed old stuff. Matt 2015-08-31 09:29:28 -06:00
  • 744cd54131 Merge branch 'ia' into ia-zak Matt 2015-08-31 09:14:27 -06:00
  • 792f12587e Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-08-29 14:28:25 -07:00
  • cbf01ab77c add download new urls.csv link to crawlbot page Matt Wells 2015-08-29 14:28:01 -07:00
  • e3526fdacb bring back flush disk writes parm for experimenting with. Matt 2015-08-28 22:43:45 -06:00
  • b6b31c7be0 fixes for comments in script tags. Matt 2015-08-28 18:07:50 -06:00
  • 8299197cca comments in <script> tags are a convultion. deal with all four types and their precedence issues. all of this is to find the proper end of the </script> and not a </script> or <script> that is being printed out in the javascript in the <script> tag. Matt 2015-08-28 16:31:22 -06:00
  • a7222dcf3f nothing really. Matt 2015-08-28 13:32:40 -06:00
  • 7418d6e9e2 better parsing of <script> tags. now we use single and double quotes and comments so we ignore '</script>' or '<script>' if in a writeln statement, or comment, etc. Matt 2015-08-28 11:45:43 -06:00
  • b7e4ab9848 fix having <script> tags in a <script> tag if it is in single or double quotes. ignore escaped quotes. Matt 2015-08-28 09:08:39 -06:00
  • 41268aeba7 Changes to script to copy to back twins. Zak Betz 2015-08-28 09:06:16 -06:00
  • 3e47276950 update makefile for 32 bit compilation Matt 2015-08-25 15:01:27 -06:00
  • 60c4c5c437 Add nospider and noquery options. Zak Betz 2015-08-25 13:48:20 -06:00
  • c766e40357 set g_errno to ENOCOLLREC if getRdbBase() returns null. Matt Wells 2015-08-25 11:41:17 -07:00
  • 35a3ce14ad fix infinite loop when coll rec is deleted during a merge. Matt Wells 2015-08-25 11:15:02 -07:00
  • 962f8672bf fix getting xml status msg Matt 2015-08-25 11:27:07 -06:00
  • 1badb8cd07 fix up hammer queue table print out on sockets pages. make crawl delay link to the robots.txt. Matt 2015-08-25 11:07:54 -06:00
  • ea2c2d7190 show read buf of http sockets as well as the send buf in the tool tip. Matt 2015-08-25 10:53:16 -06:00
  • 7fcc2ab4e1 in the sockets table page, show url download requests that are queued up to prevent hammering an ip. also show the first 500 bytes of the send buf in the http server sockets table. Matt Wells 2015-08-25 09:34:45 -07:00
  • c7546cc646 save a malloc in bigfile Matt Wells 2015-08-25 08:37:53 -07:00
  • 9180c97a2c by default do not make static gb any more, not even on debian/ubuntu. we were not always detecting redhat installs correctly like on aws. Matt Wells 2015-08-25 08:33:48 -07:00
  • e140b001d8 try merging 1000 collections per call to preserve cpu Matt Wells 2015-08-25 08:25:55 -07:00
  • 01d77ee220 make umsg00 electric fence code SPECIAL Matt 2015-08-24 18:37:30 -06:00
  • 289c6b90cf ssl connect calls malloc Matt 2015-08-24 18:12:28 -06:00
  • 3db2ce8e24 errnotest.cpp fix Matt 2015-08-24 16:22:11 -06:00
  • f12d3ffd01 use useElectricFence var for clarity Matt 2015-08-24 14:35:23 -06:00
  • 9c686a40d3 fix realloc core from special umsg00 electric fence code Matt 2015-08-24 14:25:04 -06:00
  • 76bfe4a8ba elec free is delayed. Matt 2015-08-24 14:08:56 -06:00
  • 49e9f5a827 fixes for umsg00 electric fence. take out catdb/statsdb merging attempts. Matt 2015-08-24 11:35:33 -06:00
  • 65f61351ee efence only on umsg00 buffers. fix BigFile so we dont realloc file buf which could change file ptrs that are engaged in outstanding reads, and go back to using file ptrs again. Matt Wells 2015-08-24 09:23:40 -07:00
  • 94871521e7 nothing. Zak Betz 2015-08-23 21:06:19 -06:00
  • a4bfbb31f8 fix save prevention when coring in malloc/free. Matt Wells 2015-08-23 11:51:46 -07:00
  • a5a9820441 ignore tagdb tag rec bad recsize core. do not save conf if crawlbot and not host id 0 and cored in mem function, otherwise it just hangs and gb can't restart. Matt Wells 2015-08-23 09:40:11 -07:00
  • 493a816be8 cut down on the mallocs for BigFile::m_baseFilename safebuf Matt Wells 2015-08-22 17:30:09 -07:00
  • 0d320acebf do not access BigFile::m_fileBuf when in thread, it might have been reallocated or closed, etc. juse use getCloseCount_r(). Matt Wells 2015-08-22 16:32:25 -07:00
  • e252dfb088 Add docs per second stat. Fix auto update on statsdb graph. Add Stat toggles for statsdb graph. Add a unit test for indexing an array in metadata. Zak Betz 2015-08-22 12:05:20 -06:00
  • bb16341f51 try to fix core dumps. not sure how mem is getting corrupted. Matt Wells 2015-08-22 08:52:28 -07:00
  • 0f7910125b make it so we can still save coll.conf on malloc/free cores. do not call RdbMap::reduceMemFootprint() on maps that are from files being merged into and we're resuming the killed merge at startup. Matt Wells 2015-08-21 18:07:07 -07:00
  • 23c7862892 added more quickpolls. Matt Wells 2015-08-21 16:52:23 -07:00
  • 035d232673 fix another core in crawlbot Matt Wells 2015-08-21 14:30:13 -07:00
  • 74ec812959 try to fix core from adding a file that already exists. just return an error now. hopefully merge will try again later. also core if you try to write recs to an rdbmap that has already had its memory footprint reduced so we can find that overrun bug. Matt Wells 2015-08-21 14:00:40 -07:00
  • ad695c001a fix stolen fd bug. Matt Wells 2015-08-20 20:07:30 -07:00
  • d4468c8d66 File::m_filename as a safebuf was causing problems. reverted to char buf of fixed length. only autosave coll.confs if there are not too many or we are host #0. otherwise it blocks too long. Matt Wells 2015-08-19 14:30:18 -07:00
  • 1b19d53286 give safebuf buffer for File::m_filename[]. easier to save if core in malloc/free and less mallocs in general. Matt Wells 2015-08-19 10:14:11 -07:00
  • adbec58f41 fix core from asking for too many docids Matt 2015-08-19 08:53:39 -07:00
  • 6c14d659b8 move 2nd occurence of same collnum_t collection id on the same shard to the trash/ subdir. put call to syncParmsWithHost0 in a sleep loop in case host #0 has error, although the timeout is really high. Matt Wells 2015-08-18 18:59:01 -07:00
  • 9642947136 fix so host #0 will delete then re-add collections that use the same collnum but have a different name. fixed some unlabelled safebufs. fix core when deleting collnum from tree/buckets that is higher than Collectiondb.m_numRecs. fix File::m_filename safebufs that were not freed on exit. Matt Wells 2015-08-18 14:09:16 -07:00
  • dd9b4e0ca2 fix little core Matt Wells 2015-08-17 15:04:16 -07:00
  • 30693c3cf7 use setBuf() func instead Matt Wells 2015-08-16 22:19:30 -07:00
  • 28644f127e fix problem of saving rdbmap when coring in a malloc/free. Matt Wells 2015-08-16 22:14:53 -07:00
  • be1ebfbcd0 do not execute backtrace function if core was in Mem.cpp basically otherwise we don't save state. Matt Wells 2015-08-16 20:29:14 -07:00
  • 3a67480b63 for BigFile::m_fileBuf array of Files make sure to clear it for files that do not exist so File::m_calledSet is false on them. so BigFile::getFile(j) returns a File ptr whose m_calledSet is false if the file does not exist on disk. and BigFile::removePart(j) sets ((File *)m_fileBuf.m_bufStart)[j].m_calledSet = false. Matt Wells 2015-08-16 19:40:08 -07:00
  • 63c7752734 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-08-16 17:14:33 -07:00
  • e671be17ca fix log msg Matt Wells 2015-08-16 17:14:21 -07:00
  • b709f736f4 show max mem alloc slots in pagestats.cpp Matt 2015-08-16 17:32:47 -06:00
  • ffa6c09c74 fix BigFile::addPart(n) when adding parts out of order. Matt Wells 2015-08-16 15:13:59 -07:00
  • f8fb266844 fix new merging algo. Matt Wells 2015-08-16 10:11:21 -07:00
  • 178721d35b speed up getFileSize() by using stat() func again. despam logs at startup. do not perm check every coll dir, only first 100, on startup to make things faster. Matt Wells 2015-08-15 22:21:15 -07:00
  • bff643b555 use a linked list of merge candidates to make attemptMergeAll() much much faster. Matt 2015-08-15 19:26:37 -06:00
  • d9422d8b0e get rid of limits on file sizes. dynamically allocate file names and fixed-size File array in BigFile class. should save gigabytes of memory in many-collection systems with 1+ million files or so. Matt 2015-08-14 20:14:50 -06:00
  • f7f577cf98 the new disk page cache. temporarily disabled. Matt 2015-08-14 15:52:24 -06:00
  • 3213858545 Merge branch 'diffbot-testing' into diffbot Matt 2015-08-14 13:08:48 -06:00
  • 0d2aa33afb undo #define thing Matt 2015-08-14 13:08:11 -06:00
  • a1ed368d82 bring back max mem control into master controls. it's useful to limit per process mem usage to prevent oom killer because we can't save if we get killed. overhaul diskpagecache to just use rdbcache. much simpler and faster, but disabled for now until debugged more. reduce min files to merge for crawlbot collections so they stay more tightly merged to conserve fds and mem. improved logDebugDisk msgs. overhauled File.cpp fd pool. now it is way faster and doesn't use any extra mem. much simpler too. although could be sped up a little by using a linked list, but probably is not significant enough to warrant doing right now. increase mem ptr table from 3M to 8M slots. should really make dynamic though. fix core from null msg20s[0]->m_r. only call attemptMergeAll once every 60 seconds really. do not attempt merge if already merging. Matt 2015-08-14 12:58:54 -06:00
  • f09a94fc4e Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-08-13 23:31:17 -06:00
  • 36b8d384bd Fixes to injector script. New colors and metrics on performance graph. Zak Betz 2015-08-13 23:29:20 -06:00
  • 5c67cbe65d undo Matt 2015-08-12 08:43:44 -07:00
  • 444ebeeb65 one scp install per host Matt 2015-08-12 08:39:01 -07:00
  • 5c2a2ce496 fix core Matt 2015-08-12 08:36:23 -07:00
  • 866e712322 clarify log msg Matt Wells 2015-08-10 11:28:51 -07:00
  • c4189c64e6 Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-08-10 11:05:31 -07:00