Commit Graph

  • a07c94f85d checkpoint Matt 2015-05-03 12:55:19 -0700
  • 91d9179e46 nominal changes to warc injecting Matt 2015-05-03 12:30:25 -0700
  • 5578153449 warc injects from file in spider pipeline working now. Matt 2015-05-03 12:28:02 -0700
  • a3672701f6 Merge branch 'diffbot-testing' into ia Matt 2015-05-03 12:08:25 -0700
  • 79b81ede00 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-05-03 11:21:22 -0700
  • e1dbaf81e9 count threads whose callback has not been called as 'outstanding' for purposes of shutting down so we can update the rdbmap after having written the rdblist to disk in rdbdump.cpp. Matt Wells 2015-05-03 10:58:23 -0700
  • 31e54df0c8 fix core from ::read returning 0. do not do ban checks when autobackoff and autoproxies are disabled. Matt Wells 2015-05-03 10:17:36 -0700
  • 52e7c75970 Merge branch 'diffbot' into diffbot-testing Matt Wells 2015-05-03 00:34:07 -0700
  • 821c6cb424 fix core Matt Wells 2015-05-03 00:33:47 -0700
  • a0192318c0 warcs with wget almost working right Matt 2015-05-02 23:50:49 -0700
  • 599b33524f wget cookie support Matt 2015-05-02 21:52:58 -0700
  • 7f75a5a5dc insert wget thread call Matt 2015-05-02 20:46:54 -0700
  • 9f27a5c4d1 inject warcs from file on disk since they are so big Matt Wells 2015-05-03 03:07:18 +0000
  • 2421bf3d1d ia checkpoint Matt Wells 2015-05-02 23:51:19 +0000
  • ff9cb85327 do not consider .gz a 'media' url extension any more since we got .warc.gz and .arc.gz Matt 2015-05-02 14:52:17 -0700
  • c54b1e429c Merge branch 'diffbot-testing' into ia Matt 2015-05-02 14:46:29 -0700
  • b0abe597e7 more fixes from qa test. Matt 2015-05-02 14:34:07 -0700
  • 14f5e4f97e now passing all qa tests Matt 2015-05-02 13:24:08 -0700
  • 16b73a9bdd now we pass both injection tests in qa.cpp Matt 2015-05-02 12:32:13 -0700
  • 6d39bb5df8 added hack to log controls to avoid sending msg13 (download requests) to host #0 until we fix the streaming bug better. Matt Wells 2015-05-02 10:32:13 -0700
  • b55359a95d fix XmlDoc::indexWarc() Matt 2015-05-01 23:57:14 -0700
  • ecb6d081d5 fix indexArc() Matt 2015-05-01 23:24:40 -0700
  • 5c89bde956 now all container doc logic is in xmldoc and out of pageinject. compiles. needs testing. Matt 2015-05-01 20:32:54 -0700
  • 0ca27638bc checkpoint. moved warc and arc looping into xmldoc. now will any container doc from pageinject into xmldoc. simplifies pageinject.cpp a lot. and sets up a framework for dealing with container docs. Matt 2015-05-01 19:11:13 -0700
  • b15b7cdb94 log a note to hint how to debug ssl library errors of some kind. Matt 2015-05-01 18:15:53 -0700
  • d3ca12ab0a fix corrupt spider replies from causing a url with error to be spidered over and over again. Matt Wells 2015-05-01 14:50:05 -0700
  • 6fe4c209fc notes Matt Wells 2015-05-01 13:53:34 -0700
  • dccc1667ec added logdebugmsg13 to find out why urls are getting stuck on host 0 in msg13 handler. turn crawldelay backoff logic off by default until we fix case of mis-detection on captchas and maybe some other things. fix core when loading twitchy table on startup. Matt Wells 2015-05-01 13:19:45 -0700
  • 03db6a0970 fix infinite spidering of a url because of titlerec uncompress error. Matt 2015-05-01 10:04:45 -0700
  • 4760462db9 be more consistent with cgi parms Matt 2015-05-01 09:53:37 -0700
  • d3c071e4c0 fix gbiaitem page Matt 2015-04-30 21:27:11 -0700
  • 85737739a8 move time axis parm down Matt 2015-04-30 20:30:20 -0700
  • ce030fcfb0 now .arc and .arc.gz injections work Matt 2015-04-30 20:25:26 -0700
  • 697b8307b2 fix qa test to make it easier to see the real diffs Matt 2015-04-30 19:38:27 -0700
  • b4d0c53904 fix single url injects Matt 2015-04-30 19:09:07 -0700
  • fbfdde5195 fix for old delimeterized injects. was coring in gb smokes. Matt 2015-04-30 19:07:12 -0700
  • e387c0f154 yay test warc injecting working Matt 2015-04-30 18:45:46 -0700
  • f1663402d9 compiles again now Matt 2015-04-30 18:23:46 -0700
  • df7dec9c74 Merge branch 'diffbot-testing' into ia Matt 2015-04-30 17:51:14 -0700
  • 67fad3e237 clarification Matt 2015-04-30 17:35:40 -0700
  • 16bf1cf063 make auto back off a parm. i could see you'd want to disable that if the ban/throttle detection is wrong. Matt 2015-04-30 17:32:51 -0700
  • 656d89d98d spider proxy simplifications. remove global parms and just make coll specific for now. Matt 2015-04-30 17:18:57 -0700
  • 20912541af remove global spider proxy parms for simpliclty Matt Wells 2015-04-30 17:06:35 -0700
  • 3051310d16 fix core from bulk job ip ban detection in msg13 Matt Wells 2015-04-30 16:59:02 -0700
  • a15c9fd4c6 more fixes for auto proxies Matt 2015-04-30 16:52:46 -0700
  • 7f1ac7460f fixes for auto backoff Matt 2015-04-30 16:34:11 -0700
  • 1825f6bd27 retry download if was in the twitchy table at start of download, and not using proxies at all. Matt Wells 2015-04-30 16:06:13 -0700
  • 0970975a57 tested auto proxy use and auto spider (non-proxy) backoff to 3 second crawldelay successfully on the stamps site. Matt 2015-04-30 15:31:09 -0700
  • e1a1fd001a fix oopsy Matt 2015-04-30 14:20:58 -0700
  • 75c05ef9a9 twitchy updates Matt 2015-04-30 14:18:23 -0700
  • 66db73f494 if we can't use any proxies and we detected a url as banned then just use a crawldelay of 3 seconds. Matt 2015-04-30 14:16:17 -0700
  • 6d8bb19962 checkpoint for auto proxy logic Matt 2015-04-30 13:28:57 -0700
  • f3d9b016ce Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-04-30 11:11:42 -0700
  • ad88ea8ba9 fix gbss related cores. fix bn.com crawling redir bug. Matt Wells 2015-04-30 11:11:27 -0700
  • 184b157365 Merge branch 'diffbot-testing' into ia Matt 2015-04-29 21:43:00 -0700
  • 2479dd330d ok, move all the warc/arc parsing/indexing logic into pageinject.cpp and out of xmldoc.cpp. it makes more sense there. since really all we need to do is download the warc's content and it is like injecting a delimeterized document in the loop already in pageinject.cpp. Matt 2015-04-29 21:39:18 -0700
  • 45c0909cb7 injecting warc files nicely now Matt 2015-04-29 19:55:06 -0700
  • d60444360f Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-29 18:16:11 -0700
  • bdff012152 checkpoint for auto-proxy logic. Matt 2015-04-29 18:15:54 -0700
  • 13f5ef0b2c Merge branch 'testing' into diffbot-testing Matt Wells 2015-04-29 18:44:33 -0600
  • 6cca99ad6f Merge branch 'master' into testing Matt Wells 2015-04-29 18:44:15 -0600
  • b65fd6a0cb comments Matt Wells 2015-04-29 18:43:16 -0600
  • 9177f6c75d ok, fix for debian jessie. no more -O3 flags because it breaks! :( Matt Wells 2015-04-29 18:41:16 -0600
  • 29a6d7a085 fix core Matt Wells 2015-04-29 13:34:28 -0700
  • dd289e54da if a seed (hopcount 0) url redirects to a different domain then treat all outlinks from EITHER domain as being "on the same domain" for purposes of matching the url filter "isonsamedomain". so we can spider bn.com which redirects to barnesandnoble.com even though only bn.com may be in the seed list of the crawlbot crawl. Matt Wells 2015-04-29 13:02:22 -0700
  • 26ecd0dcef Merge branch 'diffbot' into diffbot-testing Matt 2015-04-29 10:56:03 -0700
  • fd0cd4b3db 4th time is a charm for links with spaces Matt 2015-04-29 10:50:18 -0700
  • f92ba87ba6 fix bn.com -> barnesandnoble.com redirect issue so if bn.com is the seed we can still spider the barnesandnoble.com links as if they are in the same seed domain. Matt 2015-04-29 10:13:12 -0700
  • a57a3dac4e fix and test the links with spaces thing in it again. Matt 2015-04-29 08:56:37 -0700
  • 21948e15f6 more fixes Matt 2015-04-28 23:30:14 -0700
  • 9370c8f52e more fixes Matt 2015-04-28 23:20:16 -0700
  • faf2c06d29 some fixes for indexing warcs/arcs. Matt 2015-04-28 22:30:58 -0700
  • a3f1802b26 prevent core when injecting when not in sync with host #0 as far as collections and clock. Matt 2015-04-28 15:29:26 -0700
  • 33985734ba only show gbssMatchesUrl** things if they are non-empty patterns/regexes. Matt Wells 2015-04-28 14:47:33 -0700
  • 71de09a299 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-28 14:42:32 -0700
  • 09a79d230c check for .css?* better as media extensions. do it when adding outlinks in xmldoc.cpp. Matt 2015-04-28 14:42:04 -0700
  • 0b43390c35 fix core from making status doc. Matt Wells 2015-04-28 14:35:13 -0700
  • e615ff72d4 Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-04-28 14:08:08 -0700
  • e2eba10068 qa test fix Matt 2015-04-28 13:48:29 -0700
  • 63c1ac7628 add all thee gbssMatches*Regex/Pattern stuff to the status doc so we know what patterns/regexes are being matched or not. Matt 2015-04-28 10:27:14 -0700
  • cb6b2b6cd4 show gbssMatchesPageProcessPattern in status doc if page was downloaded during a custom crawl. Matt 2015-04-28 10:14:07 -0700
  • b6ff0b0173 Merge branch 'diffbot-testing' into ia Matt 2015-04-27 21:42:01 -0600
  • 0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files Matt 2015-04-27 21:41:22 -0600
  • 5c926fa217 fix spaces in link urls some more. hack. Matt 2015-04-27 18:32:50 -0600
  • 109fe29afe fix mem leak some more Matt Wells 2015-04-27 16:07:46 -0700
  • ee8652e55f fix the mem leak bug of xmldoc1/TitleRecu1 Matt Wells 2015-04-27 15:53:53 -0700
  • ccb53eb4e7 use http://127.0.0.1:8000/iagbcoll/<itemname> as a url whose content will be the arc/warc files as urls. Matt 2015-04-25 17:50:22 -0600
  • eb10c62303 fix support for _html.json Matt Wells 2015-04-25 14:37:16 -0700
  • 891edf479c fix csv when &stream=1 Matt Wells 2015-04-25 09:07:03 -0700
  • 109aaef36d show diffbot uri in csv output Matt Wells 2015-04-25 08:48:03 -0700
  • 4d2e04959e Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-04-25 08:17:31 -0700
  • 17f6f77721 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-25 08:23:20 -0700
  • 42bdd551d2 fix core Matt 2015-04-25 08:22:35 -0700
  • 71fbdf6518 time axis support Matt 2015-04-24 22:09:10 -0600
  • fc6d9631c5 fix for _html.json download Matt Wells 2015-04-24 18:02:46 -0600
  • 18877e51ac pretty printing Matt Wells 2015-04-24 11:41:06 -0600
  • 57fd26e289 expose pattern matching stuff to gb ui so diffbot can use the gi directly for crawl jobs. Matt Wells 2015-04-24 11:32:17 -0600
  • 95a70bc13c fix for indexing diffbot uri in gbss doc Matt Wells 2015-04-24 11:16:58 -0600
  • 824e664ff1 fix printing of new facet count stat Matt Wells 2015-04-24 11:04:39 -0600
  • b5e092d047 now return the total # of docs that have the facet/value pair, not just the # in the search results. Matt Wells 2015-04-24 10:59:34 -0600