Commit Graph

  • 3acd6a08d5 add the true spider request when retrying to spider a fake-ip spider request. add a EFAKEIP error reply for the fake ip request. prevents us double spidering the same url. Matt Wells 2013-12-23 10:27:42 -08:00
  • 11d6d5ad6a Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-23 09:30:52 -08:00
  • 2ac8ff2952 compile regex so it's case dependent Matt Wells 2013-12-23 09:30:35 -08:00
  • b0d77a834a do not spider fake ips requests, just re-add them with the right firstip Matt Wells 2013-12-20 12:22:02 -08:00
  • 6f2e552bcd fix core in linked list of msg13requests in case one gets freed Matt Wells 2013-12-20 11:26:46 -08:00
  • 5fcfff6729 fixes for spiders getting stuck Matt Wells 2013-12-19 20:04:06 -07:00
  • 4c7ce819b9 fix core dump Matt Wells 2013-12-19 18:39:29 -08:00
  • c2f8445a70 expand reg ex shortcuts like \d to [0-9] Matt Wells 2013-12-19 18:31:37 -08:00
  • 261f4feb9b fixed cdata parsing issue Matt Wells 2013-12-19 16:04:53 -08:00
  • 3092dcecaa rebuild url filters and regexes at startup Matt Wells 2013-12-19 15:56:27 -08:00
  • 99099505d8 call regfree before changing regex Matt Wells 2013-12-19 15:32:26 -08:00
  • 7f70e4e887 fix regex logic Matt Wells 2013-12-19 15:19:18 -08:00
  • aad12f9fe3 minor print format fix Matt Wells 2013-12-19 14:30:56 -08:00
  • ef5decb0b8 more fixing stuck spiders Matt Wells 2013-12-19 14:17:22 -08:00
  • 32db83ae47 try to fix spiders from petering out. reset doledb next keys and empty flags every 3 minutes. Matt Wells 2013-12-19 13:31:14 -08:00
  • cb111a1efa fix doledb empty logic Matt Wells 2013-12-19 13:06:35 -08:00
  • d2f9dcf8e0 revert last commit. not needed Matt Wells 2013-12-19 10:33:04 -07:00
  • 784b6900cd more spider fixes Matt Wells 2013-12-19 10:29:01 -07:00
  • d5f63888a3 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-12-19 10:15:05 -07:00
  • 56461ee795 fix spidering getting stuck bug Matt Wells 2013-12-19 10:14:50 -07:00
  • a440e1cbf5 update admin link on root page and documentation for url filters mwells 2013-12-18 19:51:50 -07:00
  • a0ceade641 fix oom doleiptable using too much mem so bulk job went oom Matt Wells 2013-12-18 17:20:53 -08:00
  • e93cfe8ac6 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-18 15:57:39 -08:00
  • 58ce15a7f3 fix big post of 70MB of urls Matt Wells 2013-12-18 15:57:10 -08:00
  • 894ced5b08 Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot Matt Wells 2013-12-18 15:29:54 -08:00
  • 356836812d wtf i did not modify these files. Matt Wells 2013-12-18 15:26:50 -08:00
  • 60dddfc669 final fixes for parms Matt Wells 2013-12-18 15:22:54 -08:00
  • 6f0137889b fixes for getUrlFilterNum so it looks at "hadReply" bit in SPiderRequest when getting diffbot api url. Matt Wells 2013-12-18 14:05:41 -08:00
  • 7170c4f0ea rebuild url filters was not getting called when some relavent parms were updated. Matt Wells 2013-12-18 13:24:38 -08:00
  • 1b5057ad42 log cleanups mostly. took out disk page cache, kinda buggy... need to fix at some point. Matt Wells 2013-12-18 10:57:18 -08:00
  • 2ffad5d835 fix cores Matt Wells 2013-12-17 17:35:07 -08:00
  • 2f2333abd1 parmdb fixes Matt Wells 2013-12-17 14:53:33 -08:00
  • 33ba8070b5 more bug fixes parmdb Matt Wells 2013-12-17 13:09:05 -08:00
  • 31e16c972d fix restart crawl Matt Wells 2013-12-17 11:17:33 -08:00
  • 39a0b7f85e parm updates Matt Wells 2013-12-17 10:53:12 -08:00
  • d03028ea93 bulk api post truncation fix Matt Wells 2013-12-17 10:03:46 -08:00
  • 2cd53386ad parm updates Matt Wells 2013-12-17 09:51:08 -08:00
  • 523d32a2ea parmdb updates Matt Wells 2013-12-16 18:13:38 -08:00
  • ad4a4415d0 fix pauseCrawl Matt Wells 2013-12-16 17:21:59 -08:00
  • 3f19ece776 parmdb updates Matt Wells 2013-12-16 17:07:15 -08:00
  • 617a0ff76e parmdb fixes Matt Wells 2013-12-16 16:04:43 -08:00
  • 6c652c1cc6 more parmdb fixes Matt Wells 2013-12-16 15:39:24 -08:00
  • 9a65febd9e parmdb updates Matt Wells 2013-12-16 14:35:27 -08:00
  • 1fe91cad2f parmdb updates Matt Wells 2013-12-16 14:10:39 -08:00
  • 9b080ff89c more parmdb bug fixes Matt Wells 2013-12-16 13:36:31 -08:00
  • 9be1ab6323 more parmdb fixes Matt Wells 2013-12-16 12:20:13 -08:00
  • a727fb10e6 parmdb fixes for checkboxes. use radio buttons. Matt Wells 2013-12-16 11:41:43 -08:00
  • 9cb99f7621 Merge branch 'diffbot' into diffbot-testing Matt Wells 2013-12-16 11:06:11 -08:00
  • 0615acff17 zero out url filters checkboxes on submit Matt Wells 2013-12-16 11:03:40 -08:00
  • 2b10a3327d Merge branch 'master' into diffbot Matt Wells 2013-12-16 10:49:40 -08:00
  • 22eb06e54d a few bugfixes imported from neo github subdir Matt Wells 2013-12-16 10:49:13 -08:00
  • 660f43cec7 fix bugs of pthreads junk not being async safe. we were calling fprintf from a signal handler (interrupt) while fprintf was currently in progress and the pthread junk did not like that. Matt Wells 2013-12-15 11:41:41 -07:00
  • 06f67db16b forgot to unlock thread lock Matt Wells 2013-12-15 10:43:34 -07:00
  • 0d976e5a7f include pthread.h Matt Wells 2013-12-15 10:40:50 -07:00
  • 7cad5df43e try to fix core from pthreads logging msgs. Matt Wells 2013-12-15 10:38:18 -07:00
  • 39c8a9a1d7 Merge branch 'master' of git@github.com:gigablast/open-source-search-engine Matt Wells 2013-12-14 15:20:04 -07:00
  • f9f73dae65 fixed core from null json Matt Wells 2013-12-14 15:19:52 -07:00
  • 4fdc781a27 fix spiders sticking when coll is immediately deleted after seeding a url. Matt Wells 2013-12-14 10:52:41 -07:00
  • 777dfb9713 fix round from incrementing while spiders out. Matt Wells 2013-12-13 20:14:34 -07:00
  • b7a96a0a1d minor update Matt Wells 2013-12-13 19:47:11 -07:00
  • a8d03b0634 more parmdb bug fixes Matt Wells 2013-12-12 13:57:19 -08:00
  • 463e6ced54 Merge branch 'diffbot' into diffbot-testing Matt Wells 2013-12-12 13:02:50 -08:00
  • abcbfa7a60 Merge branch 'master' into diffbot Matt Wells 2013-12-12 13:02:12 -08:00
  • 7b768d4b86 Merge branch 'diffbot' into diffbot-testing Matt Wells 2013-12-12 13:01:49 -08:00
  • 16e91375f4 bring in changes from live beta from ~/github. limit spiders to 50, not 500 to prevent oom. resume killed merges that had num files shrunk even if down to one file. show collnum in spider queue. remove back-to-back whitespace, and make all space a ' ' for getting the doc checksum for deduping. Matt Wells 2013-12-12 12:58:58 -08:00
  • 33d4b92544 Merge branch 'diffbot' into diffbot-testing Matt Wells 2013-12-12 12:51:43 -08:00
  • a13114605a more parm overhaul fixes Matt Wells 2013-12-12 12:44:54 -08:00
  • d85dbfb8e7 do not use safebuf in thread Matt Wells 2013-12-12 10:15:02 -07:00
  • 3f8c6378b3 parmdb fixes mwells 2013-12-10 17:45:34 -08:00
  • ead1112ea9 some parm overhaul bug fixes mwells 2013-12-10 17:06:27 -08:00
  • 76bb3d05e1 clean up logging so i can see what's going on mwells 2013-12-10 16:41:30 -08:00
  • db74af766b fix core in addExistingColl() mwells 2013-12-10 15:46:38 -08:00
  • 82494baa89 move CollectionRec stuff into Collectiondb files for simplicity. mwells 2013-12-10 15:28:04 -08:00
  • 14b0682d6b can't use safebuf in a thread. oops! Matt Wells 2013-12-10 14:20:44 -07:00
  • 22271c0bb2 do not accept msg4 add requests until in sync with host 0 mwells 2013-12-10 13:20:23 -08:00
  • f2d5661965 parmdb overhaul. support collection add/del sync when host comes back online. use udp not tcp. host #0 can now handle a new incoming request while a parm change is currently outstanding. all missed "command" parms will be received when a dead host comes back online, too, like a tight merge for instance. does not use msg4, uses msg3e and msg3f for syncing and sending parms. mwells 2013-12-10 13:09:55 -08:00
  • 0e47d48d8c test commit mwells 2013-12-10 13:02:52 -08:00
  • 1175478705 got this new parm shit compiling mwells 2013-12-10 12:54:19 -08:00
  • 9e1976a8e2 new parm stuff almost compiling. mwells 2013-12-10 11:13:43 -08:00
  • 6f6c4aed84 minor admin.html edit. Matt Wells 2013-12-10 10:39:38 -07:00
  • 1a7d5e389b very minor admin.html edit Matt Wells 2013-12-10 00:56:56 -07:00
  • ec2254d8ed added multi language support note to admin.html Matt Wells 2013-12-09 23:18:33 -07:00
  • f7e7acb398 minor log msg updates. updated admin.html to give some performance and storage capacity info. Matt Wells 2013-12-09 23:16:24 -07:00
  • 95bd6238d9 do not core when running filters when our gb home dir is really long. thanks bill! call XmlDoc::getSpiderPriority() with a SpiderReply so we can act on m_langId, like chinese, for instance, to filter those langs out from indexing. it was doing this before but got commented out for some reason. mwells 2013-12-09 22:55:02 -07:00
  • cc63fd048f Merge branch 'diffbot' of github.com:gigablast/open-source-search-engine into diffbot mwells 2013-12-09 13:46:08 -08:00
  • e04d596288 minor comments update. mwells 2013-12-09 13:42:33 -08:00
  • 2a5d4beec4 fix core from last push. Matt Wells 2013-12-09 14:21:46 -07:00
  • fa497de217 remove annoying log msg Matt Wells 2013-12-09 14:09:48 -07:00
  • 44ae7c4de6 mem labelling fixes. fixed bad alloc when generating gigabits. Matt Wells 2013-12-09 14:05:02 -07:00
  • 0dcd1211d3 new opensource icon. Matt Wells 2013-12-08 19:47:39 -07:00
  • 92ec3f1148 added open source icon to homepage Matt Wells 2013-12-08 19:45:49 -07:00
  • 92e3d841a6 minor update Matt Wells 2013-12-08 19:28:45 -07:00
  • 12404b4f85 doc updates Matt Wells 2013-12-08 19:26:48 -07:00
  • dd3b49faa9 collection name hell Matt Wells 2013-12-08 16:44:37 -07:00
  • 3353a90a85 fix resuming a killed merge condition. Matt Wells 2013-12-08 15:50:45 -07:00
  • ed79b67d2e core dump fixes Matt Wells 2013-12-08 15:36:23 -07:00
  • 144e2c898e save resources by not doing reads on an empty doledb priority. stop saving allSpidersOn and Off parms. Matt Wells 2013-12-08 14:07:31 -07:00
  • a2e52a5dc3 little fix Matt Wells 2013-12-08 10:15:54 -07:00
  • 020d7741b9 new coll.conf for main with ismedia filter. updated url filters docs some more for "isnew" and explained the errorcount stuff more. Matt Wells 2013-12-08 10:10:51 -07:00
  • 65e75167e3 limit posdb merging to 8 files max. added some more url filters documentation. Matt Wells 2013-12-08 09:41:05 -07:00