Commit Graph

  • 0cb3a5d44e fix ptr_metadata issue Matt 2015-07-01 20:01:57 -0600
  • 1327301a8d Merge branch 'testing' into ia Matt 2015-07-01 19:03:56 -0600
  • 4e987f71d2 use gigabits json array, and update the json results output documentation to describe each thing. Matt 2015-07-01 18:59:52 -0600
  • 5c61495aea Merge branch 'diffbot-testing' into testing Matt 2015-07-01 18:56:56 -0600
  • b88079a2d4 Fix warc injector script. Zak Betz 2015-06-30 22:19:13 -0600
  • 7b507a70ef Set value length to 0 for something that does not return a string value in Json.cpp. Fix the '-' -> '_' when indexing generic fields. Add a StackBuf macro which is a Safebuf initialized with a small stack buffer for use in a local scope. Zak Betz 2015-06-30 14:09:57 -0600
  • f615ac9331 try to fix infinite loop bug again Matt 2015-06-25 17:30:27 -0700
  • 8c8c7eddf6 Merge branch 'diffbot-testing' into testing Matt 2015-06-23 17:45:35 -0600
  • 57020dcb24 added micro.html Matt 2015-06-23 17:45:25 -0600
  • 1e3c52a0ef fix infinite loop bug from performance enhancement using active list for spidering i put in a few days back. Matt 2015-06-23 13:52:02 -0700
  • 5aa8e6ba2a Merge branch 'diffbot-testing' into diffbot Matt 2015-06-22 14:31:39 -0700
  • dfdea910ad fix fix Matt 2015-06-19 08:50:59 -0700
  • 7a0ae294a2 reduce log spam when rebalancing Matt 2015-06-19 08:47:06 -0700
  • 5b104f1bb8 diffbt max ips from 1 back to 7 Matt Wells 2015-06-18 15:29:03 -0700
  • bdebd79f4f spiderloop active list bug fix. change diffbot ip max from 1 to 7 again. Matt Wells 2015-06-18 15:05:16 -0700
  • 5f5ce7d12c Merge branch 'diffbot-testing' into testing Matt 2015-06-18 11:02:21 -0600
  • e1aab778e9 fix errno miscount bug. fix infinite loop in active list logic. Matt 2015-06-18 10:55:07 -0600
  • 902a8fc61d fix errno mismatch bug Matt 2015-06-18 10:33:27 -0600
  • 0493e7a899 use linked lists for closing least used fds for speed. right now just log if it differs from current algo. Matt Wells 2015-06-18 09:19:13 -0700
  • 18dbaf89c9 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-06-18 08:40:53 -0700
  • e9f1ab1150 make donesleepingwrapper in spider.cpp faster using the active list of colls to save time. Matt Wells 2015-06-18 08:38:46 -0700
  • f490847eb2 Fix injector build script. Add IA's lib for getting metadata. Zak Betz 2015-06-18 01:23:13 -0600
  • 9f61636881 Change collection on inject script. Zak Betz 2015-06-18 00:24:36 -0600
  • 9ca0223cf1 Translate metadata field names with dashes to _. Add unit tests for searching for certain types of metadata. Zak Betz 2015-06-17 23:36:31 -0600
  • b8049aae58 added isfakeip url filter expression to help speed up bulk jobs Matt Wells 2015-06-17 13:59:13 -0700
  • f8b17e1dc9 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-06-17 10:18:34 -0700
  • 43130f3a8d exit if corruption detected at startup Matt Wells 2015-06-17 10:17:36 -0700
  • 68d04b239f auto move dat/map files we can't regen map for to trash subdir. later: try to repair them better. Matt Wells 2015-06-17 06:55:49 -0700
  • 2728c15d30 Fix gigabit bug. Include python environment for running on ia's server. Zak Betz 2015-06-17 00:57:01 -0600
  • 814fe60824 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-06-17 00:30:14 -0600
  • fab62fab3f Fix gigabit corruption. Add scaffolding to show json metadata in summaries. *WIP* Zak Betz 2015-06-17 00:27:23 -0600
  • d050fb81b5 fix rebuild code to rebuild spider status docs in index, and to remove them from titledb if user has disabled 'index spider replies' in the spider controls to save disk. made them off by default by now since they use some disk. Matt Wells 2015-06-16 16:29:26 -0600
  • 8ed2af53b1 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-06-15 19:24:40 -0700
  • b8f1cf9298 added a quickpoll to spider.cpp. reduce diffbot max spiders per ip from 7 to 1 to fix collection starvation at least temporarily until the proper fix is deployed. Matt Wells 2015-06-15 11:46:51 -0700
  • 32987e76ee Add json metadata field to page inject. Fix memory leak when spidering warc files. Add script to inject warcs from internet archives search results. Zak Betz 2015-06-14 20:58:41 -0600
  • 5f84ad2c5d raise mem table ptrs from 1.2M to 3M. shard 22 was suffering really slow mem ops because of it. Matt Wells 2015-06-14 11:34:13 -0700
  • f2a9e68998 Fixes #2920. Allow facet ranges to include asterisk Kevin Truong 2015-06-11 13:45:55 -0700
  • 1e50f211b5 Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine Kevin Truong 2015-06-11 13:37:23 -0700
  • a63eea0927 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-06-10 10:15:13 -0600
  • e399a8b0aa Add qa test for arc and warc files. Change XmlDoc to use timeaxis url when creating the titlerec key instead of the firsturl. Zak Betz 2015-05-21 15:19:33 -0600
  • 1114deeb29 Merge branch 'ia' into testing Matt 2015-05-20 10:36:24 -0600
  • 145c125abd Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-05-19 19:39:32 -0700
  • 132a1940cc for urlage, use addedTime when discoveryTime is not available sam 2015-05-18 11:28:34 -0700
  • c54e9cd96e fix bug of printing results in csv when one summary had an error. Matt Wells 2015-05-16 09:28:13 -0700
  • 4b40845f94 put in place doxygen stuffs sam 2015-05-15 14:47:47 -0700
  • ca0b6f6ada a bit of documentation for RdbList, Msg2, SafeBuf sam 2015-05-15 11:55:40 -0700
  • 6c0407f257 Merge branch 'ia' into ia-zak Matt Wells 2015-05-13 10:32:26 -0600
  • 7f25fd884f Revert "always dedup if using time axis" Matt Wells 2015-05-13 10:32:09 -0600
  • e89eecf3dd always dedup if using time axis Matt 2015-05-12 20:13:09 -0600
  • 9b8065589a Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-05-12 15:20:20 -0600
  • 66a19d6e75 Merge branch 'master' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-05-12 15:20:11 -0600
  • 36037c23a1 Add a test for useTimeAxis. Zak Betz 2015-05-12 15:18:38 -0600
  • 284c44de02 fix fix Matt 2015-05-12 13:19:43 -0600
  • a5c11c666c try to fix double close bug while streaming Matt 2015-05-12 12:37:19 -0600
  • 315d5d8ea6 Merge branch 'ia' into ia-zak Matt 2015-05-11 08:26:43 -0600
  • f81521d1ce added html/test.arc.gz Matt 2015-05-11 08:26:17 -0600
  • 240d1e5fba Merge branch 'ia' into ia-zak Matt 2015-05-10 09:51:31 -0600
  • d7aa0e6de6 Merge branch 'diffbot-testing' into ia Matt 2015-05-10 09:51:21 -0600
  • 7c22d4770a Revert "added SpiderRequest::m_lastSuccessfulSpideredTime" Matt 2015-05-10 09:51:10 -0600
  • 29212bbe9c Merge branch 'ia' into ia-zak Matt 2015-05-10 09:41:55 -0600
  • 1a4bd55e0d Merge branch 'diffbot-testing' into ia Matt 2015-05-10 09:41:45 -0600
  • 29824085f1 added SpiderRequest::m_lastSuccessfulSpideredTime and a new url filter constraint for it. Matt 2015-05-10 09:41:12 -0600
  • 22ef7cb6a6 another test Matt Wells 2015-05-09 11:52:55 -0600
  • 21766acdfd nothing Matt 2015-05-09 11:42:47 -0600
  • 443135a7b5 Merge branch 'ia' into ia-zak Matt 2015-05-07 19:23:49 -0700
  • 420fc44ca5 added html/test.warc.gz for testing Matt 2015-05-07 19:23:17 -0700
  • 69e7b1165d do not truncate termlist reads. at least allow up to 2GB until we change minrecsizes from 32bit to 64bit. we were truncating at 90MB before. Matt Wells 2015-05-07 11:36:56 -0700
  • 4f2d3fe048 make query reindex (not query delete) distribute the spiderrequests based on the domain hash bits contained in the docid. that way the same domain is not clogging up all the spiders on all the hosts. Matt 2015-05-07 09:08:59 -0700
  • cd4ea14770 prevent mem leak Matt Wells 2015-05-06 12:32:01 -0700
  • 88dc6f0f06 do not add collection if crawlbot crawl name is > 30 chars. Matt Wells 2015-05-06 11:23:15 -0700
  • f90ebfd1d6 fix issue of not adding a spider status doc when the dns lookup failed on a fakefirstip spider request. also increment crawl attempts counter. Matt Wells 2015-05-06 10:47:27 -0700
  • 35e41d3615 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-05-06 09:58:51 -0700
  • a8a094a270 fix error reporting for regcomp() Matt Wells 2015-05-06 09:57:57 -0700
  • a821d8bc41 Merge branch 'ia' into ia-zak Matt 2015-05-05 23:46:16 -0700
  • 97238c577f log the discovery date of the url Matt 2015-05-05 21:46:32 -0700
  • c18ceb4eeb quick fix Matt 2015-05-05 21:38:06 -0700
  • bcae5de3b5 rename spiderage to urlage. Matt 2015-05-05 21:33:50 -0700
  • e66f096f6e Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-05-05 21:26:38 -0700
  • ef604e6d5e introduce SpiderRequest::m_discoveryTime Matt 2015-05-05 21:23:30 -0700
  • 7a7dacc56d revive spiderwaited keyword for url filters Matt 2015-05-05 20:36:45 -0700
  • e8be058067 I have no idea of what I am doing Merge branch 'diffbot-sam' of https://github.com/gigablast/open-source-search-engine into diffbot-sam sam 2015-05-05 16:37:14 -0700
  • 47657c2ee8 allows to filter urls by spider age sam 2015-05-05 16:33:30 -0700
  • 894bd7fe8a count # of facets printed. Matt 2015-05-05 13:09:36 -0700
  • 8af9237c21 if we did not print any facet tables out then do not print "facets":[]\n, Matt 2015-05-05 13:05:00 -0700
  • 138611a97c Merge branch 'diffbot-testing' into diffbot-kevin Matt 2015-05-05 10:45:00 -0700
  • 07fceef452 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-05-05 10:13:25 -0700
  • 050206d5dc fix core from resetting the url filters when a url was about to be spidered. Matt Wells 2015-05-05 09:39:45 -0700
  • b35b178ce8 Fixes #2764 Kevin Truong 2015-05-05 02:18:40 -0700
  • 2eb106aaf5 merge from master branch to diffbot-kevin Kevin Truong 2015-05-05 01:53:17 -0700
  • ee5ffef834 fix core Matt Wells 2015-05-05 02:53:42 +0000
  • 6cf7471c8c should enable the regexes on the URLs for the diffbot processing when the collection is created via gigablast UI sam 2015-05-04 17:31:10 -0700
  • d6bb5e98f8 more ban detect fixes Matt Wells 2015-05-04 14:40:52 -0700
  • 86800a0656 if a root/seed url has no outlinks, assumed banned. Matt 2015-05-04 14:23:28 -0700
  • 2908079ce8 sleep dont exit on parsing inconsistency during qa test Matt 2015-05-03 22:39:17 -0700
  • 08e01b5ac8 fix more bugs. new injections seem somewhat stable now. Matt 2015-05-03 21:58:26 -0700
  • ff969d92bb can inject a single doc now Matt 2015-05-03 21:14:28 -0700
  • bc54282339 complete overhaul of injection pipeline now compiles. should distribute injection requests evenly over the cluster. uses new InjectionRequest class which sets from httprequest using parms in Parms.cpp. and easily serializes into a udp request. very nice. we should use this model going forward. Matt 2015-05-03 19:07:44 -0700
  • b39a065259 checkpoint #2 Matt 2015-05-03 17:51:47 -0700
  • 0df4abc759 checkpoint Matt Wells 2015-05-04 00:17:17 +0000
  • f63cccaf01 arc indexing works again Matt 2015-05-03 13:10:17 -0700