Commit Graph

  • 74d1a4b457 Made QueryTerm::m_term const Ivan Skytte Jørgensen 2017-11-23 13:59:06 +01:00
  • 646a2e792e Changed SafeBuf::jsonEncode() to no longer modify the source string for temporary NUL termination Ivan Skytte Jørgensen 2017-11-23 13:56:48 +01:00
  • e3021a1c9c mroe constness (and removal of unhelpful comments and commented-out code) Ivan Skytte Jørgensen 2017-11-23 13:37:27 +01:00
  • cad25a18e2 Dropped unused member QueryTerm::m_numAlnumWordsInBase Ivan Skytte Jørgensen 2017-11-23 12:41:19 +01:00
  • 2ce7d59931 Revert "Run test sequentially" Ai Lin Chia 2017-11-22 11:45:23 +01:00
  • 1abbd86fd3 Merge branch 'dev-robotstxt' Ai Lin Chia 2017-11-22 10:49:33 +01:00
  • b6d69fc3ab Fix error description slightly since we're using the same error code for disallowed by robots.txt & meta tag Ai Lin Chia 2017-11-22 10:35:32 +01:00
  • 2436f36ce9 Check meta robots noindex Ai Lin Chia 2017-11-22 10:17:03 +01:00
  • b6413659cf Merge branch 'master' into sto Ivan Skytte Jørgensen 2017-11-21 16:50:21 +01:00
  • 56694dddc4 cleanup/simplify msg39 term debug log Ivan Skytte Jørgensen 2017-11-21 16:50:11 +01:00
  • 5c41095b4f Merge branch 'master' into sto Ivan Skytte Jørgensen 2017-11-21 16:18:54 +01:00
  • 128790084b Use lang_t enum more than just plain uint8_t Ivan Skytte Jørgensen 2017-11-21 16:18:45 +01:00
  • c51437e3ee Merge branch 'master' into sto Ivan Skytte Jørgensen 2017-11-21 15:57:38 +01:00
  • ba5c30d7e1 main.cpp: split out printHelp() to separate function Ivan Skytte Jørgensen 2017-11-21 15:48:28 +01:00
  • 0ea0d7b4d4 Better clean target in sto Ivan Skytte Jørgensen 2017-11-21 13:09:35 +01:00
  • c908c61884 Added STO subdir with tools and classes Ivan Skytte Jørgensen 2017-11-21 13:08:12 +01:00
  • 7d99d0f347 Parse more meta robots keyword. Change getIsNoArchive to use new implementation Ai Lin Chia 2017-11-20 22:58:19 +01:00
  • 3058981dae Make sure EDOCBADCONTENTTYPE doesn't return as EDOCTOOBIG Ai Lin Chia 2017-11-20 15:22:47 +01:00
  • fbda6fffd9 Check for blocked content type in dump unwanted Ai Lin Chia 2017-11-20 14:23:16 +01:00
  • e7dc626c0c Check download status for robots.txt and set error to EDOCDISALLOWEDERROR when there is a download error Ai Lin Chia 2017-11-20 14:04:29 +01:00
  • c3fe374641 Initial bug fix of too big non html doc Ai Lin Chia 2017-11-19 22:32:19 +01:00
  • 563670d76f Remove commented out code Ai Lin Chia 2017-11-19 16:09:40 +01:00
  • 867db37d46 Add more HttpMime unit test Ai Lin Chia 2017-11-19 10:54:28 +01:00
  • b179b8ddac Strip ending whitespace Ai Lin Chia 2017-11-19 10:51:14 +01:00
  • 74d0bb29cf Remove getContentTypePrivate and move logic into parseContentType function Ai Lin Chia 2017-11-19 10:24:23 +01:00
  • adabb04bd6 Move input file to error when other conversion error occurs as well Ai Lin Chia 2017-11-17 15:40:33 +01:00
  • 4bd6860e31 xmldoc filter (pdf): no error handling from rename() in error-handling. Is fine Ivan Skytte Jørgensen 2017-11-17 14:02:13 +01:00
  • e035d6a872 Fix bug in writing contenttypeallowed.txt Ai Lin Chia 2017-11-17 11:09:47 +01:00
  • 36dc58cb86 Remove commented out code Ai Lin Chia 2017-11-16 17:19:43 +01:00
  • 4b96a8fefe Rename conversion of input file with error to in.error.${thread_id} Ai Lin Chia 2017-11-16 14:07:08 +01:00
  • a9cd5aa048 Convert to lower case before adding to allowed list Ai Lin Chia 2017-11-15 17:34:33 +01:00
  • e870c31deb pass any url to 'gb dump rtc' and it will form the correct robots.txt url for you Brian Rasmusson 2017-11-15 17:08:12 +01:00
  • 43943035ce fix for getting robots.txt shard num in PageSpiderdbLookup.cpp Brian Rasmusson 2017-11-15 16:41:41 +01:00
  • c89bb6f9ad Use user-agent instead of bot name Ai Lin Chia 2017-11-15 16:17:55 +01:00
  • 112c91c7b8 added info about robots.txt host in PageSpiderdbLookup.cpp Brian Rasmusson 2017-11-15 15:20:30 +01:00
  • 9a3f87b56e Add ContentTypeBlockList to block by http content-type Ai Lin Chia 2017-11-15 12:36:00 +01:00
  • 09bf57e147 Add host to dsh & dsh2 command Ai Lin Chia 2017-11-14 11:43:56 +01:00
  • 558c408d89 added option to dump robots.txt.cache content for a url Brian Rasmusson 2017-11-14 21:08:37 +01:00
  • f177a9e04a Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-14 15:19:47 +01:00
  • d36a6a4e39 disable site clustering when doing domain-like searches (configurable) Ivan Skytte Jørgensen 2017-11-14 14:50:30 +01:00
  • 710142bde4 #include cleanup in Pages.h Ivan Skytte Jørgensen 2017-11-14 14:23:54 +01:00
  • 3e9419dc2e #include cleanup in PageResults.h and PageRoot.h Ivan Skytte Jørgensen 2017-11-14 14:11:36 +01:00
  • 9ed523f04f More const in PageResult.cpp and PageRoot.cpp Ivan Skytte Jørgensen 2017-11-14 14:06:00 +01:00
  • 3626dc6e89 Bugfix domain-like searches Ivan Skytte Jørgensen 2017-11-14 13:45:46 +01:00
  • 3ca5da2ca9 Factored out log-qterms-to-log so we can also trace them after query::modifyQuery() Ivan Skytte Jørgensen 2017-11-14 12:52:41 +01:00
  • b7ca158b9a Add host to dsh & dsh2 command Ai Lin Chia 2017-11-14 11:43:56 +01:00
  • d12ae26586 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-13 16:42:46 +01:00
  • 16db6b56a5 Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2017-11-13 16:19:34 +01:00
  • 88d2f176ef Bugfix domain-like search modification Ivan Skytte Jørgensen 2017-11-13 16:18:23 +01:00
  • 4f9cb72732 More comments Ivan Skytte Jørgensen 2017-11-13 15:46:36 +01:00
  • 47f97ff214 Removed always-0 log element in Msg39 Ivan Skytte Jørgensen 2017-11-13 15:00:23 +01:00
  • 3f755a088a Don't overflow dns wildcard check Ai Lin Chia 2017-11-13 13:21:59 +01:00
  • da2e53078e typo fix in comment Ivan Skytte Jørgensen 2017-11-13 12:30:53 +01:00
  • 74dfc5f9f0 Removed some of the co-branding parameters Ivan Skytte Jørgensen 2017-11-10 17:06:22 +01:00
  • 07edeea716 Revert "Added temporary debugging aid for finding out why msg4in-thread sometimes takes a long time" Ivan Skytte Jørgensen 2017-11-10 16:08:53 +01:00
  • 12441e8dcf Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-10 16:05:40 +01:00
  • 45aad17d8e indexing: handle URL componets better Ivan Skytte Jørgensen 2017-11-10 15:57:23 +01:00
  • 92ca1b77d3 Add tools/dump_unwanted Ai Lin Chia 2017-11-10 15:03:54 +01:00
  • 569e6f0ada Add unwanted ext to unwanted url check Ai Lin Chia 2017-11-10 15:01:50 +01:00
  • df9d840afc Improve domain-like search detection Ivan Skytte Jørgensen 2017-11-10 13:53:22 +01:00
  • 89840c3126 Make sqlite synchronous level configurable (defaulting to 1) Ivan Skytte Jørgensen 2017-11-09 17:16:46 +01:00
  • e4c6595ab8 Add code to run through external filter when the html document have low word count (configurable) Ai Lin Chia 2017-11-09 17:14:17 +01:00
  • dbcf668ccc Add EDOCCONVERTFAILED for conversion error Ai Lin Chia 2017-11-08 15:36:22 +01:00
  • 4f94cd5609 Code style changes. Replace assignment of boolean to 1 with true Ai Lin Chia 2017-11-08 15:16:09 +01:00
  • 1931bca0c7 Remove commented out code Ai Lin Chia 2017-11-07 16:52:39 +01:00
  • 8d92d38f92 Check style for non-visible iframe Ai Lin Chia 2017-11-07 16:45:16 +01:00
  • 08b1b24546 Use g_conf.m_logTraceXmlDoc instead of g_conf.m_logTraceSpider Ai Lin Chia 2017-11-07 16:20:34 +01:00
  • f8fce43216 Use NOINDEXFLAGS as badFlags in Summary Ai Lin Chia 2017-11-07 16:19:33 +01:00
  • da064f6e45 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-09 16:44:58 +01:00
  • 2ddbe2cf38 Query: domain-like searchs: ignore last term Ivan Skytte Jørgensen 2017-11-09 16:28:14 +01:00
  • d0640c1094 Query::modifyQuery(): domain detection: check TLD list Ivan Skytte Jørgensen 2017-11-09 16:16:03 +01:00
  • d04f4d65f3 Added temporary debugging aid for finding out why msg4in-thread sometimes takes a long time Ivan Skytte Jørgensen 2017-11-09 14:05:56 +01:00
  • 97ba7bc72d renamed new term checklist classes and added option to check for potential spam docs Brian Rasmusson 2017-11-08 14:14:24 +01:00
  • e5616cbbb2 Run test sequentially Ai Lin Chia 2017-11-07 15:50:34 +01:00
  • 891e774b22 sqlite: log record count in addrecords() timing log Ivan Skytte Jørgensen 2017-11-07 14:11:24 +01:00
  • e325366ca2 Merge branch 'sqlite' of github.com:privacore/open-source-search-engine into sqlite Ivan Skytte Jørgensen 2017-11-07 13:45:46 +01:00
  • 2b476deeef Added timing log to sqlite operations Ivan Skytte Jørgensen 2017-11-07 13:45:37 +01:00
  • c8d166f2f8 Check for javascript tags Ai Lin Chia 2017-11-07 11:56:05 +01:00
  • 326f25cbb4 Don't dump redirection pages/non txt/html pages Ai Lin Chia 2017-11-07 11:46:24 +01:00
  • ebd242a8c1 Merge branch 'master' into sqlite Ai Lin Chia 2017-11-06 16:25:51 +01:00
  • f0fe189340 Fix segfault due to storing m_indexCode in TitleRec causing wrong logic to be run when getting inlink text for Msg20 Ai Lin Chia 2017-11-06 16:24:48 +01:00
  • a2972c763c More trace log in SpiderdbRdbSqliteBridge Ivan Skytte Jørgensen 2017-11-06 14:07:29 +01:00
  • 1c370bf55b Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-06 13:35:16 +01:00
  • ad7c18e72c cleandb target: remove rebalance.txt too Ivan Skytte Jørgensen 2017-11-06 13:34:51 +01:00
  • ab50b1cddf Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-06 13:29:47 +01:00
  • 1b40342dd1 We should use slot which is passed into Msg51::gotClusterRec instead of always using the first slot Ai Lin Chia 2017-11-06 12:27:07 +01:00
  • de35c3a7e9 Initialize m_loading Ai Lin Chia 2017-11-03 23:22:50 +01:00
  • 11c328ebb1 Check content when dumping unwanted as well Ai Lin Chia 2017-11-03 18:00:20 +01:00
  • c2747a07c7 Make sure non modified files are added to the list to load as well Ai Lin Chia 2017-11-03 16:34:47 +01:00
  • 6f2f1ff242 Merge branch 'master' into sqlite Ivan Skytte Jørgensen 2017-11-03 16:22:11 +01:00
  • 9685695002 Fix bug where only modified urlmatchlist will be reloaded, causing content of the non-modified file not to be in the list Ai Lin Chia 2017-11-03 16:20:24 +01:00
  • a0d2ee230a Encapsulate Msg0 better Ivan Skytte Jørgensen 2017-11-03 15:30:32 +01:00
  • 809d7c9b0d Removed the msg5* parameter to Msg0::getList() Ivan Skytte Jørgensen 2017-11-03 15:13:08 +01:00
  • 5d7e3450da Added ScopedSqlitedbLock Ivan Skytte Jørgensen 2017-11-03 13:58:42 +01:00
  • 4d7f2494c1 catch more errors in sqlite-bridge Ivan Skytte Jørgensen 2017-11-03 13:37:52 +01:00
  • c3e7d050a3 Handle errors in spiderdbsqlite conversion Ivan Skytte Jørgensen 2017-11-03 12:37:21 +01:00
  • 9be65ab54e Fix g_hostdb.init call Ai Lin Chia 2017-11-03 10:34:12 +01:00
  • 5795c94fc0 Add dump_wordcount tool Ai Lin Chia 2017-11-03 10:26:28 +01:00
  • 7508e500e8 Fix g_hostdb.init calls Ai Lin Chia 2017-11-02 17:29:50 +01:00
  • 79231c121f Let's try to shutdown always (even if it's not a failure) Ai Lin Chia 2017-11-02 17:10:01 +01:00