Commit Graph

  • d4313ac193 valgrind: Suppress a bunch and memcheck warnings originating from cld3->protobuf. Posisbly inappropriate suppressions but easily deleted when full-run leak diagnostics are wanted Ivan Skytte Jørgensen 2017-09-12 15:17:33 +02:00
  • 8d09aae731 fix comment Ivan Skytte Jørgensen 2017-09-12 15:03:15 +02:00
  • 378df7e0e7 WantedCheck: renamed the content check to multi-content-check in preparation for single-content-check Ivan Skytte Jørgensen 2017-09-12 15:01:07 +02:00
  • ae8694a20c Made HttpMime::getContentTypeFromExtension() static Ivan Skytte Jørgensen 2017-09-12 14:04:23 +02:00
  • 239334e239 HttpMime: more const Ivan Skytte Jørgensen 2017-09-12 14:01:55 +02:00
  • 9c274b656d Add unit test for RobotsCheckList Ai Lin Chia 2017-09-12 12:21:53 +02:00
  • baf96a3f7f Add RobotsCheckList feature so that we don't spider pages we're not allowed to Ai Lin Chia 2017-09-12 12:05:48 +02:00
  • 7e10ea8fa0 Handle host containing port Ai Lin Chia 2017-09-11 17:47:27 +02:00
  • ccfbe494f7 Slightly better log in SpiderColl::populateWaitingTreeFromSpiderdb() Ivan Skytte Jørgensen 2017-09-11 15:56:07 +02:00
  • 19313bce13 Repaled SpiderColl::m_waitingTreeEndKey with KEYMAX() (it always had that value) Ivan Skytte Jørgensen 2017-09-11 15:41:52 +02:00
  • 435b3b13ab Fix timing issue with interrupted docdelete Ai Lin Chia 2017-09-09 20:46:54 +02:00
  • 9f3532ed7c Strip cart_id Ai Lin Chia 2017-09-08 17:31:27 +02:00
  • 6a5a1b4f9e Detect Wordfence capthca Ivan Skytte Jørgensen 2017-09-08 14:12:06 +02:00
  • 01e9448244 Remove query parameter random, rand, _random, _rand Ai Lin Chia 2017-09-08 14:08:46 +02:00
  • 72a58f1b2b Keep statistics on crawl bans Ivan Skytte Jørgensen 2017-09-08 13:18:45 +02:00
  • d1e8fededa Append detected blocks/captchas to crawlban.* files Ivan Skytte Jørgensen 2017-09-08 12:28:02 +02:00
  • 7a979593ab Explain crawl-ban better Ivan Skytte Jørgensen 2017-09-08 12:18:28 +02:00
  • 31136a6346 Block urls ending with %00 to %1F Ai Lin Chia 2017-09-08 12:00:30 +02:00
  • 2fd07495a3 Run Url test through multiple version Ai Lin Chia 2017-09-07 12:57:32 +02:00
  • 45a8885b98 Make sure we check full path value when validating. Run UrlParserTest through multiple version of titlerec to make sure we don't break anything through the versions Ai Lin Chia 2017-09-07 12:43:47 +02:00
  • 1bbe5e6d77 Detect Distil networks captcha-blocks Ivan Skytte Jørgensen 2017-09-07 16:20:06 +02:00
  • 9f1c2f80ac Detect blocked-by-cloudflare and avoid deleting the already-indexed document Ivan Skytte Jørgensen 2017-09-07 14:21:37 +02:00
  • 548f964fbb Enable block of IP based urls Ai Lin Chia 2017-09-06 16:52:18 +02:00
  • 27efc0c476 Document that parameter value is case sensitive Ai Lin Chia 2017-09-06 15:53:24 +02:00
  • fe659a27a0 Add UrlMatch by parameter & parameter value Ai Lin Chia 2017-09-06 15:50:01 +02:00
  • 10b56037db Simplify code for UrlMatch/UrlMatchList Ai Lin Chia 2017-09-06 15:11:36 +02:00
  • 1741bfe385 Add urlblacklist file example Ai Lin Chia 2017-09-06 12:14:57 +02:00
  • 17f843c273 Add block by file for UrlMatch Ai Lin Chia 2017-09-06 12:03:25 +02:00
  • 1f7b82c384 Fix unit test Ai Lin Chia 2017-09-06 11:27:32 +02:00
  • cc979dbbc4 Add more url parameters to clean. (mostly on sid, refererurl, affliate ids) Ai Lin Chia 2017-09-06 10:44:01 +02:00
  • 648f332fb8 Fix tools help input arg Ai Lin Chia 2017-08-31 14:31:59 +02:00
  • b27ea27252 Extended HttpMime class to also track Server: header Ivan Skytte Jørgensen 2017-09-05 16:41:30 +02:00
  • 99e6b7bdd9 Msg13.cpp: move #includes to top of file Ivan Skytte Jørgensen 2017-09-05 15:18:13 +02:00
  • 14d4f03862 typo in comment Ivan Skytte Jørgensen 2017-09-04 16:46:23 +02:00
  • aeb9cf3e27 Don't treat econnreset when fetching /robots.txt as a permissiont crawl Ivan Skytte Jørgensen 2017-09-04 16:23:08 +02:00
  • 77400c4137 More const Ivan Skytte Jørgensen 2017-09-04 12:47:06 +02:00
  • 6cd4ae0950 added option to dump wanted docids (excluding those that gets blocked by blocklists and parameter stripping) Brian Rasmusson 2017-09-03 19:56:20 +02:00
  • 8a72b10b33 Made bigram weight configurable Ivan Skytte Jørgensen 2017-09-01 13:57:46 +02:00
  • f6ee72a925 Posdbtable: minimergebuffer: rename methods ...TermInfoINdex... to ...TermIndex.. Ivan Skytte Jørgensen 2017-08-31 17:07:07 +02:00
  • 212d2170b9 posdbtable: allow multiple sublists per wordpos Ivan Skytte Jørgensen 2017-08-31 16:07:11 +02:00
  • 6d6d45e655 Removed constant-value 'currTermDone' Ivan Skytte Jørgensen 2017-08-31 14:24:32 +02:00
  • a61cf61656 Add matchPrefix & matchSuffix to MatchCriteria Ai Lin Chia 2017-08-31 14:22:33 +02:00
  • a016d2053f Whitespace changes Ai Lin Chia 2017-08-31 12:08:23 +02:00
  • 73f00cbf0e Add constructor for GigablastRequest instead of memset which will break a preallocated vector in Msg4 Ai Lin Chia 2017-08-30 16:41:37 +02:00
  • b787e2374e Add dump unwanted spiderdb records. Clean up unwanted spiderdb records during merge Ai Lin Chia 2017-08-30 13:18:45 +02:00
  • 69958b70b7 A bit more const in msg25/20/linkdb (getLinkInfo() does not modify url/site parameters) Ivan Skytte Jørgensen 2017-08-29 16:40:35 +02:00
  • 32e6072855 Removed superflous intermediate local variables Ivan Skytte Jørgensen 2017-08-29 16:33:19 +02:00
  • 2900913e80 Merge branch 'master' into staging staging Ai Lin Chia 2017-08-29 16:23:08 +02:00
  • 3a8f2caaf3 Log redir url as well Ai Lin Chia 2017-08-29 15:44:56 +02:00
  • 0b335bca75 We don't want the document when redirected url is blocked Ai Lin Chia 2017-08-29 15:08:17 +02:00
  • a5dc8cc517 Merge branch 'master' into staging Ai Lin Chia 2017-08-29 14:15:14 +02:00
  • 02876dd164 Msg4::m_inUse fix Ivan Skytte Jørgensen 2017-08-29 14:14:15 +02:00
  • 9d3a92d447 minor fix to json output: better indentation so humans can parse it more easily Ivan Skytte Jørgensen 2017-08-29 13:51:53 +02:00
  • 681a03e2ce Merge branch 'master' into staging Ai Lin Chia 2017-08-29 13:44:39 +02:00
  • 1c88a8b370 Handle queryies with only highfreq and qstopwords better Ivan Skytte Jørgensen 2017-08-29 13:33:55 +02:00
  • 47f42ce3de Add hasDeadHost check to DocDelete Ai Lin Chia 2017-08-29 13:28:59 +02:00
  • 3aa10d1cf8 Stop processing of docdelete when spidering is enabled midway Ai Lin Chia 2017-08-29 12:21:38 +02:00
  • 6d82724e80 Don't process docdelete when spidering is enabled Ai Lin Chia 2017-08-29 12:15:22 +02:00
  • 1f46f8100f Fix error where we continue processing even when there are msg4 being queued Ai Lin Chia 2017-08-29 11:29:00 +02:00
  • 1569926807 Remove no longer true comment Ai Lin Chia 2017-08-29 10:52:45 +02:00
  • ff1cfa0bb3 Extend dump titledb unwanted to use shlib Ai Lin Chia 2017-08-28 22:40:49 +02:00
  • 984d62f1d7 Fix bug in Msg4Out where when we need to retry due to lack of buffer space/multicast object, we'll lose rdb records Ai Lin Chia 2017-08-28 18:00:30 +02:00
  • 79ac584806 Fix hanging of gb when gb is shutdown while docdelete is processing file Ai Lin Chia 2017-08-28 16:32:34 +02:00
  • f0cf4010d7 Merge branch 'master' into dev-docdelete Ai Lin Chia 2017-08-28 13:13:24 +02:00
  • ffceff1e4c Fix use-after-scope in GbDns (ares_timeout() returns the 3rd argument) Ivan Skytte Jørgensen 2017-08-28 12:09:42 +02:00
  • d7bbc0e7bf Use logTrace instead of log Ai Lin Chia 2017-08-28 11:27:55 +02:00
  • 711afdf728 Merge branch 'master' into dev-docdelete Ai Lin Chia 2017-08-25 17:08:50 +02:00
  • 0dc337a6e8 Fix typo in log Ivan Skytte Jørgensen 2017-08-25 16:46:33 +02:00
  • 976cb4aa37 More const in main.cpp Ivan Skytte Jørgensen 2017-08-25 16:45:13 +02:00
  • 048f9fa753 Fix buffer overrun of static strings (so probably not harmful) Ivan Skytte Jørgensen 2017-08-25 16:26:07 +02:00
  • dd6db193ba Print nice message for new error code. Add newly added shlib error code to be part of 'deleted' statistics Ai Lin Chia 2017-08-25 15:35:21 +02:00
  • 5699981dde Add dump_rdbtree Ai Lin Chia 2017-08-25 15:26:57 +02:00
  • 19b09f12e7 Count detailed staistics for isUrlBlocked() Ivan Skytte Jørgensen 2017-08-24 16:52:17 +02:00
  • 319a564ae3 Merge branch 'master' into dev-docdelete Ai Lin Chia 2017-08-24 16:42:08 +02:00
  • f80036ce78 Use distinct error numbers for blocked-by-shlib Ivan Skytte Jørgensen 2017-08-24 16:25:54 +02:00
  • c1407dfe31 Implemented whitelist feature. Ivan Skytte Jørgensen 2017-08-24 15:50:56 +02:00
  • 8501e6a8bc Moved is-url-blocked check to separate file Ivan Skytte Jørgensen 2017-08-24 15:12:04 +02:00
  • 55972b7e40 Add docdelete last processed docid. Modify docdelete to namespace instead of class. Add DocDelete::finalize method Ai Lin Chia 2017-08-24 14:48:17 +02:00
  • a0b00870bc Renamed UrlBlock* to UrlMatch* Ivan Skytte Jørgensen 2017-08-24 14:46:02 +02:00
  • e50f24c160 Use dlerror to display error instead of printing errno Ai Lin Chia 2017-08-23 12:31:33 +02:00
  • 3203777bbd Initial implementation of DocDelete Ai Lin Chia 2017-08-23 11:49:06 +02:00
  • 9717298c58 Changed Query::m_...Bits from members to plain local variables Ivan Skytte Jørgensen 2017-08-22 15:51:32 +02:00
  • f642f81729 Removed unused Query::m_numRequired Ivan Skytte Jørgensen 2017-08-22 14:39:23 +02:00
  • dc8e25d097 Fix problem when all terms are high-freq-terms (and there are more than 2 of them) Ivan Skytte Jørgensen 2017-08-22 14:06:05 +02:00
  • 6628a4f7b4 Fix problem when all terms are high-freq-terms (and there are more than 2 of them) Ivan Skytte Jørgensen 2017-08-22 14:06:05 +02:00
  • 9fcf540d3f New logtrace for DocDelete Ai Lin Chia 2017-08-21 18:29:31 +02:00
  • e27f3330f9 Add log line when ignoring urlblocklist line Ai Lin Chia 2017-08-21 17:05:40 +02:00
  • 4ff90bedf2 Remove immediate flag for registerSleepCallback for UrlBlockList::reload and DnsBlockList::reload Ai Lin Chia 2017-08-21 11:20:56 +02:00
  • 4154a0e3c3 added option to dump unwanted documents (gb dump u) Brian Rasmusson 2017-08-21 09:54:10 +02:00
  • 9729c64ae7 Load blocklist immediately on init instead of waiting a minute Brian Rasmusson 2017-08-21 09:51:14 +02:00
  • d5b7891a12 Use strcmp instead of memcmp Ai Lin Chia 2017-08-18 15:50:59 +02:00
  • 220e250a53 Use strcmp instead of memcmp Ai Lin Chia 2017-08-18 15:50:59 +02:00
  • 3b24effa72 Temporarily skip spider req url that are different after stripping parameters Ai Lin Chia 2017-08-18 15:08:16 +02:00
  • c950451f7c Revert "We should stripped the url before inserting/comparing to spidered url cache" Ai Lin Chia 2017-08-18 14:54:20 +02:00
  • fefdd3395b Fix log statement (incorrect arguments)in SpiderLoop.cpp Ivan Skytte Jørgensen 2017-08-18 14:48:55 +02:00
  • 766bb53dc5 We should stripped the url before inserting/comparing to spidered url cache Ai Lin Chia 2017-08-18 14:06:14 +02:00
  • 304bdc4272 Temporarily skip spider req url that are different after stripping parameters Ai Lin Chia 2017-08-18 15:08:16 +02:00
  • e9ad22fc7a Revert "We should stripped the url before inserting/comparing to spidered url cache" Ai Lin Chia 2017-08-18 14:54:20 +02:00
  • e9cd1ae670 Fix log statement (incorrect arguments)in SpiderLoop.cpp Ivan Skytte Jørgensen 2017-08-18 14:48:55 +02:00
  • 4836fe48a0 We should stripped the url before inserting/comparing to spidered url cache Ai Lin Chia 2017-08-18 14:06:14 +02:00