Commit Graph

  • b8049aae58 added isfakeip url filter expression to help speed up bulk jobs Matt Wells 2015-06-17 13:59:13 -07:00
  • f8b17e1dc9 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-06-17 10:18:34 -07:00
  • 43130f3a8d exit if corruption detected at startup Matt Wells 2015-06-17 10:17:36 -07:00
  • 68d04b239f auto move dat/map files we can't regen map for to trash subdir. later: try to repair them better. Matt Wells 2015-06-17 06:55:49 -07:00
  • 2728c15d30 Fix gigabit bug. Include python environment for running on ia's server. Zak Betz 2015-06-17 00:57:01 -06:00
  • 814fe60824 Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-06-17 00:30:14 -06:00
  • fab62fab3f Fix gigabit corruption. Add scaffolding to show json metadata in summaries. *WIP* Zak Betz 2015-06-17 00:27:23 -06:00
  • d050fb81b5 fix rebuild code to rebuild spider status docs in index, and to remove them from titledb if user has disabled 'index spider replies' in the spider controls to save disk. made them off by default by now since they use some disk. Matt Wells 2015-06-16 16:29:26 -06:00
  • 8ed2af53b1 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-06-15 19:24:40 -07:00
  • b8f1cf9298 added a quickpoll to spider.cpp. reduce diffbot max spiders per ip from 7 to 1 to fix collection starvation at least temporarily until the proper fix is deployed. Matt Wells 2015-06-15 11:46:51 -07:00
  • 32987e76ee Add json metadata field to page inject. Fix memory leak when spidering warc files. Add script to inject warcs from internet archives search results. Zak Betz 2015-06-14 20:58:41 -06:00
  • 5f84ad2c5d raise mem table ptrs from 1.2M to 3M. shard 22 was suffering really slow mem ops because of it. Matt Wells 2015-06-14 11:34:13 -07:00
  • f2a9e68998 Fixes #2920. Allow facet ranges to include asterisk Kevin Truong 2015-06-11 13:45:55 -07:00
  • 1e50f211b5 Merge branch 'diffbot-testing' of https://github.com/gigablast/open-source-search-engine Kevin Truong 2015-06-11 13:37:23 -07:00
  • a63eea0927 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-06-10 10:15:13 -06:00
  • e399a8b0aa Add qa test for arc and warc files. Change XmlDoc to use timeaxis url when creating the titlerec key instead of the firsturl. Zak Betz 2015-05-21 15:19:33 -06:00
  • 1114deeb29 Merge branch 'ia' into testing Matt 2015-05-20 10:36:24 -06:00
  • 145c125abd Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-05-19 19:39:32 -07:00
  • 132a1940cc for urlage, use addedTime when discoveryTime is not available sam 2015-05-18 11:28:34 -07:00
  • c54e9cd96e fix bug of printing results in csv when one summary had an error. Matt Wells 2015-05-16 09:28:13 -07:00
  • 4b40845f94 put in place doxygen stuffs sam 2015-05-15 14:47:47 -07:00
  • ca0b6f6ada a bit of documentation for RdbList, Msg2, SafeBuf sam 2015-05-15 11:55:40 -07:00
  • 6c0407f257 Merge branch 'ia' into ia-zak Matt Wells 2015-05-13 10:32:26 -06:00
  • 7f25fd884f Revert "always dedup if using time axis" Matt Wells 2015-05-13 10:32:09 -06:00
  • e89eecf3dd always dedup if using time axis Matt 2015-05-12 20:13:09 -06:00
  • 9b8065589a Merge branch 'ia-zak' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-05-12 15:20:20 -06:00
  • 66a19d6e75 Merge branch 'master' of https://github.com/gigablast/open-source-search-engine into ia-zak Zak Betz 2015-05-12 15:20:11 -06:00
  • 36037c23a1 Add a test for useTimeAxis. Zak Betz 2015-05-12 15:18:38 -06:00
  • 284c44de02 fix fix Matt 2015-05-12 13:19:43 -06:00
  • a5c11c666c try to fix double close bug while streaming Matt 2015-05-12 12:37:19 -06:00
  • 315d5d8ea6 Merge branch 'ia' into ia-zak Matt 2015-05-11 08:26:43 -06:00
  • f81521d1ce added html/test.arc.gz Matt 2015-05-11 08:26:17 -06:00
  • 240d1e5fba Merge branch 'ia' into ia-zak Matt 2015-05-10 09:51:31 -06:00
  • d7aa0e6de6 Merge branch 'diffbot-testing' into ia Matt 2015-05-10 09:51:21 -06:00
  • 7c22d4770a Revert "added SpiderRequest::m_lastSuccessfulSpideredTime" Matt 2015-05-10 09:51:10 -06:00
  • 29212bbe9c Merge branch 'ia' into ia-zak Matt 2015-05-10 09:41:55 -06:00
  • 1a4bd55e0d Merge branch 'diffbot-testing' into ia Matt 2015-05-10 09:41:45 -06:00
  • 29824085f1 added SpiderRequest::m_lastSuccessfulSpideredTime and a new url filter constraint for it. Matt 2015-05-10 09:41:12 -06:00
  • 22ef7cb6a6 another test Matt Wells 2015-05-09 11:52:55 -06:00
  • 21766acdfd nothing Matt 2015-05-09 11:42:47 -06:00
  • 443135a7b5 Merge branch 'ia' into ia-zak Matt 2015-05-07 19:23:49 -07:00
  • 420fc44ca5 added html/test.warc.gz for testing Matt 2015-05-07 19:23:17 -07:00
  • 69e7b1165d do not truncate termlist reads. at least allow up to 2GB until we change minrecsizes from 32bit to 64bit. we were truncating at 90MB before. Matt Wells 2015-05-07 11:36:56 -07:00
  • 4f2d3fe048 make query reindex (not query delete) distribute the spiderrequests based on the domain hash bits contained in the docid. that way the same domain is not clogging up all the spiders on all the hosts. Matt 2015-05-07 09:08:59 -07:00
  • cd4ea14770 prevent mem leak Matt Wells 2015-05-06 12:32:01 -07:00
  • 88dc6f0f06 do not add collection if crawlbot crawl name is > 30 chars. Matt Wells 2015-05-06 11:23:15 -07:00
  • f90ebfd1d6 fix issue of not adding a spider status doc when the dns lookup failed on a fakefirstip spider request. also increment crawl attempts counter. Matt Wells 2015-05-06 10:47:27 -07:00
  • 35e41d3615 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-05-06 09:58:51 -07:00
  • a8a094a270 fix error reporting for regcomp() Matt Wells 2015-05-06 09:57:57 -07:00
  • a821d8bc41 Merge branch 'ia' into ia-zak Matt 2015-05-05 23:46:16 -07:00
  • 97238c577f log the discovery date of the url Matt 2015-05-05 21:46:32 -07:00
  • c18ceb4eeb quick fix Matt 2015-05-05 21:38:06 -07:00
  • bcae5de3b5 rename spiderage to urlage. Matt 2015-05-05 21:33:50 -07:00
  • e66f096f6e Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-05-05 21:26:38 -07:00
  • ef604e6d5e introduce SpiderRequest::m_discoveryTime Matt 2015-05-05 21:23:30 -07:00
  • 7a7dacc56d revive spiderwaited keyword for url filters Matt 2015-05-05 20:36:45 -07:00
  • e8be058067 I have no idea of what I am doing Merge branch 'diffbot-sam' of https://github.com/gigablast/open-source-search-engine into diffbot-sam sam 2015-05-05 16:37:14 -07:00
  • 47657c2ee8 allows to filter urls by spider age sam 2015-05-05 16:33:30 -07:00
  • 894bd7fe8a count # of facets printed. Matt 2015-05-05 13:09:36 -07:00
  • 8af9237c21 if we did not print any facet tables out then do not print "facets":[]\n, Matt 2015-05-05 13:05:00 -07:00
  • 138611a97c Merge branch 'diffbot-testing' into diffbot-kevin Matt 2015-05-05 10:45:00 -07:00
  • 07fceef452 Merge branch 'diffbot-testing' into diffbot-sam Matt 2015-05-05 10:13:25 -07:00
  • 050206d5dc fix core from resetting the url filters when a url was about to be spidered. Matt Wells 2015-05-05 09:39:45 -07:00
  • b35b178ce8 Fixes #2764 Kevin Truong 2015-05-05 02:18:40 -07:00
  • 2eb106aaf5 merge from master branch to diffbot-kevin Kevin Truong 2015-05-05 01:53:17 -07:00
  • ee5ffef834 fix core Matt Wells 2015-05-05 02:53:42 +00:00
  • 6cf7471c8c should enable the regexes on the URLs for the diffbot processing when the collection is created via gigablast UI sam 2015-05-04 17:31:10 -07:00
  • d6bb5e98f8 more ban detect fixes Matt Wells 2015-05-04 14:40:52 -07:00
  • 86800a0656 if a root/seed url has no outlinks, assumed banned. Matt 2015-05-04 14:23:28 -07:00
  • 2908079ce8 sleep dont exit on parsing inconsistency during qa test Matt 2015-05-03 22:39:17 -07:00
  • 08e01b5ac8 fix more bugs. new injections seem somewhat stable now. Matt 2015-05-03 21:58:26 -07:00
  • ff969d92bb can inject a single doc now Matt 2015-05-03 21:14:28 -07:00
  • bc54282339 complete overhaul of injection pipeline now compiles. should distribute injection requests evenly over the cluster. uses new InjectionRequest class which sets from httprequest using parms in Parms.cpp. and easily serializes into a udp request. very nice. we should use this model going forward. Matt 2015-05-03 19:07:44 -07:00
  • b39a065259 checkpoint #2 Matt 2015-05-03 17:51:47 -07:00
  • 0df4abc759 checkpoint Matt Wells 2015-05-04 00:17:17 +00:00
  • f63cccaf01 arc indexing works again Matt 2015-05-03 13:10:17 -07:00
  • a07c94f85d checkpoint Matt 2015-05-03 12:55:19 -07:00
  • 91d9179e46 nominal changes to warc injecting Matt 2015-05-03 12:30:25 -07:00
  • 5578153449 warc injects from file in spider pipeline working now. Matt 2015-05-03 12:28:02 -07:00
  • a3672701f6 Merge branch 'diffbot-testing' into ia Matt 2015-05-03 12:08:25 -07:00
  • 79b81ede00 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-05-03 11:21:22 -07:00
  • e1dbaf81e9 count threads whose callback has not been called as 'outstanding' for purposes of shutting down so we can update the rdbmap after having written the rdblist to disk in rdbdump.cpp. Matt Wells 2015-05-03 10:58:23 -07:00
  • 31e54df0c8 fix core from ::read returning 0. do not do ban checks when autobackoff and autoproxies are disabled. Matt Wells 2015-05-03 10:17:36 -07:00
  • 52e7c75970 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-05-03 00:34:07 -07:00
  • 821c6cb424 fix core Matt Wells 2015-05-03 00:33:47 -07:00
  • a0192318c0 warcs with wget almost working right Matt 2015-05-02 23:50:49 -07:00
  • 599b33524f wget cookie support Matt 2015-05-02 21:52:58 -07:00
  • 7f75a5a5dc insert wget thread call Matt 2015-05-02 20:46:54 -07:00
  • 9f27a5c4d1 inject warcs from file on disk since they are so big Matt Wells 2015-05-03 03:07:18 +00:00
  • 2421bf3d1d ia checkpoint Matt Wells 2015-05-02 23:51:19 +00:00
  • ff9cb85327 do not consider .gz a 'media' url extension any more since we got .warc.gz and .arc.gz Matt 2015-05-02 14:52:17 -07:00
  • c54b1e429c Merge branch 'diffbot-testing' into ia Matt 2015-05-02 14:46:29 -07:00
  • b0abe597e7 more fixes from qa test. Matt 2015-05-02 14:34:07 -07:00
  • 14f5e4f97e now passing all qa tests Matt 2015-05-02 13:24:08 -07:00
  • 16b73a9bdd now we pass both injection tests in qa.cpp Matt 2015-05-02 12:32:13 -07:00
  • 6d39bb5df8 added hack to log controls to avoid sending msg13 (download requests) to host #0 until we fix the streaming bug better. Matt Wells 2015-05-02 10:32:13 -07:00
  • b55359a95d fix XmlDoc::indexWarc() Matt 2015-05-01 23:57:14 -07:00
  • ecb6d081d5 fix indexArc() Matt 2015-05-01 23:24:40 -07:00
  • 5c89bde956 now all container doc logic is in xmldoc and out of pageinject. compiles. needs testing. Matt 2015-05-01 20:32:54 -07:00
  • 0ca27638bc checkpoint. moved warc and arc looping into xmldoc. now will any container doc from pageinject into xmldoc. simplifies pageinject.cpp a lot. and sets up a framework for dealing with container docs. Matt 2015-05-01 19:11:13 -07:00