Commit Graph

  • b15b7cdb94 log a note to hint how to debug ssl library errors of some kind. Matt 2015-05-01 18:15:53 -07:00
  • d3ca12ab0a fix corrupt spider replies from causing a url with error to be spidered over and over again. Matt Wells 2015-05-01 14:50:05 -07:00
  • 6fe4c209fc notes Matt Wells 2015-05-01 13:53:34 -07:00
  • dccc1667ec added logdebugmsg13 to find out why urls are getting stuck on host 0 in msg13 handler. turn crawldelay backoff logic off by default until we fix case of mis-detection on captchas and maybe some other things. fix core when loading twitchy table on startup. Matt Wells 2015-05-01 13:19:45 -07:00
  • 03db6a0970 fix infinite spidering of a url because of titlerec uncompress error. Matt 2015-05-01 10:04:45 -07:00
  • 4760462db9 be more consistent with cgi parms Matt 2015-05-01 09:53:37 -07:00
  • d3c071e4c0 fix gbiaitem page Matt 2015-04-30 21:27:11 -07:00
  • 85737739a8 move time axis parm down Matt 2015-04-30 20:30:20 -07:00
  • ce030fcfb0 now .arc and .arc.gz injections work Matt 2015-04-30 20:25:26 -07:00
  • 697b8307b2 fix qa test to make it easier to see the real diffs Matt 2015-04-30 19:38:27 -07:00
  • b4d0c53904 fix single url injects Matt 2015-04-30 19:09:07 -07:00
  • fbfdde5195 fix for old delimeterized injects. was coring in gb smokes. Matt 2015-04-30 19:07:12 -07:00
  • e387c0f154 yay test warc injecting working Matt 2015-04-30 18:45:46 -07:00
  • f1663402d9 compiles again now Matt 2015-04-30 18:23:46 -07:00
  • df7dec9c74 Merge branch 'diffbot-testing' into ia Matt 2015-04-30 17:51:14 -07:00
  • 67fad3e237 clarification Matt 2015-04-30 17:35:40 -07:00
  • 16bf1cf063 make auto back off a parm. i could see you'd want to disable that if the ban/throttle detection is wrong. Matt 2015-04-30 17:32:51 -07:00
  • 656d89d98d spider proxy simplifications. remove global parms and just make coll specific for now. Matt 2015-04-30 17:18:57 -07:00
  • 20912541af remove global spider proxy parms for simpliclty Matt Wells 2015-04-30 17:06:35 -07:00
  • 3051310d16 fix core from bulk job ip ban detection in msg13 Matt Wells 2015-04-30 16:59:02 -07:00
  • a15c9fd4c6 more fixes for auto proxies Matt 2015-04-30 16:52:46 -07:00
  • 7f1ac7460f fixes for auto backoff Matt 2015-04-30 16:34:11 -07:00
  • 1825f6bd27 retry download if was in the twitchy table at start of download, and not using proxies at all. Matt Wells 2015-04-30 16:06:13 -07:00
  • 0970975a57 tested auto proxy use and auto spider (non-proxy) backoff to 3 second crawldelay successfully on the stamps site. Matt 2015-04-30 15:31:09 -07:00
  • e1a1fd001a fix oopsy Matt 2015-04-30 14:20:58 -07:00
  • 75c05ef9a9 twitchy updates Matt 2015-04-30 14:18:23 -07:00
  • 66db73f494 if we can't use any proxies and we detected a url as banned then just use a crawldelay of 3 seconds. Matt 2015-04-30 14:16:17 -07:00
  • 6d8bb19962 checkpoint for auto proxy logic Matt 2015-04-30 13:28:57 -07:00
  • f3d9b016ce Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt Wells 2015-04-30 11:11:42 -07:00
  • ad88ea8ba9 fix gbss related cores. fix bn.com crawling redir bug. Matt Wells 2015-04-30 11:11:27 -07:00
  • 184b157365 Merge branch 'diffbot-testing' into ia Matt 2015-04-29 21:43:00 -07:00
  • 2479dd330d ok, move all the warc/arc parsing/indexing logic into pageinject.cpp and out of xmldoc.cpp. it makes more sense there. since really all we need to do is download the warc's content and it is like injecting a delimeterized document in the loop already in pageinject.cpp. Matt 2015-04-29 21:39:18 -07:00
  • 45c0909cb7 injecting warc files nicely now Matt 2015-04-29 19:55:06 -07:00
  • d60444360f Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-29 18:16:11 -07:00
  • bdff012152 checkpoint for auto-proxy logic. Matt 2015-04-29 18:15:54 -07:00
  • 13f5ef0b2c Merge branch 'testing' into diffbot-testing Matt Wells 2015-04-29 18:44:33 -06:00
  • 6cca99ad6f Merge branch 'master' into testing Matt Wells 2015-04-29 18:44:15 -06:00
  • b65fd6a0cb comments Matt Wells 2015-04-29 18:43:16 -06:00
  • 9177f6c75d ok, fix for debian jessie. no more -O3 flags because it breaks! :( Matt Wells 2015-04-29 18:41:16 -06:00
  • 29a6d7a085 fix core Matt Wells 2015-04-29 13:34:28 -07:00
  • dd289e54da if a seed (hopcount 0) url redirects to a different domain then treat all outlinks from EITHER domain as being "on the same domain" for purposes of matching the url filter "isonsamedomain". so we can spider bn.com which redirects to barnesandnoble.com even though only bn.com may be in the seed list of the crawlbot crawl. Matt Wells 2015-04-29 13:02:22 -07:00
  • 26ecd0dcef Merge branch 'diffbot' into diffbot-testing Matt 2015-04-29 10:56:03 -07:00
  • fd0cd4b3db 4th time is a charm for links with spaces Matt 2015-04-29 10:50:18 -07:00
  • f92ba87ba6 fix bn.com -> barnesandnoble.com redirect issue so if bn.com is the seed we can still spider the barnesandnoble.com links as if they are in the same seed domain. Matt 2015-04-29 10:13:12 -07:00
  • a57a3dac4e fix and test the links with spaces thing in it again. Matt 2015-04-29 08:56:37 -07:00
  • 21948e15f6 more fixes Matt 2015-04-28 23:30:14 -07:00
  • 9370c8f52e more fixes Matt 2015-04-28 23:20:16 -07:00
  • faf2c06d29 some fixes for indexing warcs/arcs. Matt 2015-04-28 22:30:58 -07:00
  • a3f1802b26 prevent core when injecting when not in sync with host #0 as far as collections and clock. Matt 2015-04-28 15:29:26 -07:00
  • 33985734ba only show gbssMatchesUrl** things if they are non-empty patterns/regexes. Matt Wells 2015-04-28 14:47:33 -07:00
  • 71de09a299 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-28 14:42:32 -07:00
  • 09a79d230c check for .css?* better as media extensions. do it when adding outlinks in xmldoc.cpp. Matt 2015-04-28 14:42:04 -07:00
  • 0b43390c35 fix core from making status doc. Matt Wells 2015-04-28 14:35:13 -07:00
  • e615ff72d4 Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-04-28 14:08:08 -07:00
  • e2eba10068 qa test fix Matt 2015-04-28 13:48:29 -07:00
  • 63c1ac7628 add all thee gbssMatches*Regex/Pattern stuff to the status doc so we know what patterns/regexes are being matched or not. Matt 2015-04-28 10:27:14 -07:00
  • cb6b2b6cd4 show gbssMatchesPageProcessPattern in status doc if page was downloaded during a custom crawl. Matt 2015-04-28 10:14:07 -07:00
  • b6ff0b0173 Merge branch 'diffbot-testing' into ia Matt 2015-04-27 21:42:01 -06:00
  • 0eb415d408 added preliminary support for spidering .warc.gz and .arc.gz files Matt 2015-04-27 21:41:22 -06:00
  • 5c926fa217 fix spaces in link urls some more. hack. Matt 2015-04-27 18:32:50 -06:00
  • 109fe29afe fix mem leak some more Matt Wells 2015-04-27 16:07:46 -07:00
  • ee8652e55f fix the mem leak bug of xmldoc1/TitleRecu1 Matt Wells 2015-04-27 15:53:53 -07:00
  • ccb53eb4e7 use http://127.0.0.1:8000/iagbcoll/<itemname> as a url whose content will be the arc/warc files as urls. Matt 2015-04-25 17:50:22 -06:00
  • eb10c62303 fix support for _html.json Matt Wells 2015-04-25 14:37:16 -07:00
  • 891edf479c fix csv when &stream=1 Matt Wells 2015-04-25 09:07:03 -07:00
  • 109aaef36d show diffbot uri in csv output Matt Wells 2015-04-25 08:48:03 -07:00
  • 4d2e04959e Merge branch 'diffbot-testing' into diffbot Matt Wells 2015-04-25 08:17:31 -07:00
  • 17f6f77721 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-25 08:23:20 -07:00
  • 42bdd551d2 fix core Matt 2015-04-25 08:22:35 -07:00
  • 71fbdf6518 time axis support Matt 2015-04-24 22:09:10 -06:00
  • fc6d9631c5 fix for _html.json download Matt Wells 2015-04-24 18:02:46 -06:00
  • 18877e51ac pretty printing Matt Wells 2015-04-24 11:41:06 -06:00
  • 57fd26e289 expose pattern matching stuff to gb ui so diffbot can use the gi directly for crawl jobs. Matt Wells 2015-04-24 11:32:17 -06:00
  • 95a70bc13c fix for indexing diffbot uri in gbss doc Matt Wells 2015-04-24 11:16:58 -06:00
  • 824e664ff1 fix printing of new facet count stat Matt Wells 2015-04-24 11:04:39 -06:00
  • b5e092d047 now return the total # of docs that have the facet/value pair, not just the # in the search results. Matt Wells 2015-04-24 10:59:34 -06:00
  • e254118b5f print numbers as strings when printing the csv Matt Wells 2015-04-24 10:26:53 -06:00
  • 0a48930ba3 spaces in links fix. added gbssDiffbotUri to gbss docs. Matt Wells 2015-04-24 10:23:07 -06:00
  • e6a914d882 Merge branch 'diffbot-testing' of github.com:gigablast/open-source-search-engine into diffbot-testing Matt 2015-04-23 21:18:19 -06:00
  • a81dcb6442 fix slow spider proxy loop Matt 2015-04-23 21:17:55 -06:00
  • 2aeb88e19b update search api doc Matt Wells 2015-04-22 18:55:25 -06:00
  • c3c7e757fa Merge remote-tracking branch 'origin/diffbot' into diffbot-kevin Matt 2015-04-22 17:45:19 -06:00
  • 4e9a8f351b try to fix a core from restarting a collection that was in the middle of dumping to disk. Matt Wells 2015-04-22 16:07:16 -07:00
  • 0656cc4c72 fix a core on seraph host #6 Matt Wells 2015-04-22 15:46:35 -07:00
  • b0b26126a5 fix parens bug for gbsortbyint:gbspiderdate) do not include ( or ) as part of the field value since they are associated with boolean syntax. Matt Wells 2015-04-22 14:02:28 -06:00
  • 7462b0cd84 gb -h fix Matt Wells 2015-04-22 12:51:32 -06:00
  • 00661287da mysyn fixes Matt 2015-04-22 08:34:29 -06:00
  • 05fc660ef2 fix love<->like syn mapping from wiktionary. Matt 2015-04-21 20:58:33 -06:00
  • a2feab9a4a tap in some fixes for running the newly updated smokes for dealing with the new urls.csv format Matt Wells 2015-04-21 15:20:57 -07:00
  • 1dd3912ca0 default isr back on mwells 2015-04-21 08:19:32 -06:00
  • 8e5f57d677 take comments out mwells 2015-04-21 08:14:54 -06:00
  • a7640dadc1 hop count bug fix when merging spiderdb lists and doing deduping. do not change hopcounts in spider request records. mwells 2015-04-20 15:17:36 -06:00
  • e05dde5934 show the path depth of spidered urls in the logs mwells 2015-04-19 16:17:30 -06:00
  • 644ad28912 debugging the hopcount bug Matt Wells 2015-04-19 15:51:29 -06:00
  • 80f2584b5d more new urls.csv fixes Matt Wells 2015-04-15 18:38:29 -07:00
  • 25aab18870 add crawl try # to urls.csv Matt 2015-04-15 19:31:44 -06:00
  • 11ea50935d use new urls.csv only for GET /v3/crawl/download/token-collname_urls.csv version 3 Matt 2015-04-15 17:48:55 -06:00
  • ef42a9cf28 new urls.csv polish. moved columns around. added some new gbss fields, like spidered time. Matt 2015-04-15 17:42:56 -06:00
  • fec347a7df fix bug of partial facet counting. Matt 2015-04-15 14:54:49 -06:00
  • 496124da39 fix new urls.csv output Matt 2015-04-15 12:53:43 -06:00