Commit Graph

  • 4b7b40ac96 #include cleanup in HashTableX.h Ivan Skytte Jørgensen 2018-07-20 16:29:06 +02:00
  • 84e7a4153c Avoid using gbmemcpy() in HashTableX.h Ivan Skytte Jørgensen 2018-07-20 16:27:04 +02:00
  • 28228e2393 Removed SafeBuf::print() by moving 1-liner content to XmlDoc::printMetaList() where it was called Ivan Skytte Jørgensen 2018-07-20 16:18:09 +02:00
  • 2b802fc83d #include cleanup in linkspam.h Ivan Skytte Jørgensen 2018-07-20 16:12:23 +02:00
  • 17049b4276 #include cleanup in SiteGetter.h Ivan Skytte Jørgensen 2018-07-20 16:08:27 +02:00
  • 7d1fb1e992 #include cleanup in Rebalance.* Ivan Skytte Jørgensen 2018-07-20 16:05:57 +02:00
  • ef46e84e13 #include cleanup in Json.* Ivan Skytte Jørgensen 2018-07-20 16:03:46 +02:00
  • 115985f4b6 Renamed MemoryMappedFile.cc to MemoryMappedFile.cpp (for consistent extensions) Ivan Skytte Jørgensen 2018-07-20 15:59:06 +02:00
  • 46b720027a #include cleanup in Summary.h Ivan Skytte Jørgensen 2018-07-20 15:57:26 +02:00
  • 9233558fad Removed non-const version of Title::getTitle() Ivan Skytte Jørgensen 2018-07-20 15:55:29 +02:00
  • 44b12577e9 Merge branch 'master' into dev-urlmatchlist Ai Lin Chia 2018-07-17 17:06:14 +02:00
  • 4db48c3c76 wordvariations: avoid suffixes that are individual letters but multiple bytes in utf-8 Ivan Skytte Jørgensen 2018-07-17 15:29:51 +02:00
  • 8f94d483e0 support older version of unique_ptr<> Ivan Skytte Jørgensen 2018-07-17 14:32:41 +02:00
  • f64f44e9f7 Made lexicon_da.sto non-required Ivan Skytte Jørgensen 2018-07-17 14:07:14 +02:00
  • 68f6e60069 Only load sto lexicon once Ivan Skytte Jørgensen 2018-07-17 13:49:53 +02:00
  • 65636c5600 Merge branch 'master' into dev-urlmatchlist Ai Lin Chia 2018-07-17 13:08:26 +02:00
  • 373c575b6f Modifications to tld list Ai Lin Chia 2018-07-17 13:07:35 +02:00
  • 01466009d5 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-07-16 12:36:37 +02:00
  • 18ab4f415c Default to match suffix for host Ai Lin Chia 2018-07-13 18:02:47 +02:00
  • 4eefda5576 use mmap'ping for page temperatures, and exploit that it is sorted Ivan Skytte Jørgensen 2018-07-13 13:41:45 +02:00
  • 3a7f156683 Assume docid2flagsandsitemap.dat is sorted Ivan Skytte Jørgensen 2018-07-12 17:05:28 +02:00
  • 6a9a988d7e Revert "Revert back to original TLD list" Ai Lin Chia 2018-07-12 15:31:52 +02:00
  • aad58132a4 Optimized page-temperature file loading Ivan Skytte Jørgensen 2018-07-12 14:58:24 +02:00
  • 40e154bb22 Merge branch 'master' into dev-urlmatchlist Ai Lin Chia 2018-07-12 13:41:50 +02:00
  • 9bdafc231e Add logic for matchpartial, matchsuffix, matchprefix to host, domain, path Ai Lin Chia 2018-07-12 13:40:43 +02:00
  • 2fb8ca2337 Add test for subdomain Ai Lin Chia 2018-07-11 22:30:14 +02:00
  • 4cc156c011 Add test for middomain Ai Lin Chia 2018-07-11 14:59:50 +02:00
  • 651e74fe08 More types/test to UrlMatchList Ai Lin Chia 2018-07-10 16:57:16 +02:00
  • 1842422b8c Fix split function Ai Lin Chia 2018-07-10 16:27:28 +02:00
  • e8a44b5dca bugfix query field detection (bug introduced in tokenizer branch) Ivan Skytte Jørgensen 2018-07-10 15:11:18 +02:00
  • 0061ddf52c language: handle query-language-server results for no/nn/no-bk correctly by taking the maximum Ivan Skytte Jørgensen 2018-07-09 17:10:12 +02:00
  • e730413c5e json 'languageWeights': list top-7 langauges instead of top-5 Ivan Skytte Jørgensen 2018-07-09 16:44:51 +02:00
  • a0625a5eab summary: decode double-encoded html entities Ivan Skytte Jørgensen 2018-07-09 15:16:46 +02:00
  • 46899cd178 lemma: capitalized0uppercase for query lemman/syn generation Ivan Skytte Jørgensen 2018-07-09 14:38:14 +02:00
  • 616ce55370 lemma: clookup capitalizaed and uppercase forms too Ivan Skytte Jørgensen 2018-07-09 14:33:57 +02:00
  • 506335a402 Don't make lemmas for capitalized words that are already in baseform Ivan Skytte Jørgensen 2018-07-09 13:06:29 +02:00
  • 8a6efce462 fix searching for colon-separated words where none of the words are regular operations like url/site/inurl/ip/... Ivan Skytte Jørgensen 2018-07-06 15:27:26 +02:00
  • 75e5a9d719 Add to urlmatchlist as well Ai Lin Chia 2018-07-06 15:24:21 +02:00
  • 942971b5d2 Initial commit for multiple criteria url match list (not working yet) Ai Lin Chia 2018-07-06 14:23:43 +02:00
  • 3c5c58378d Split string by string delimiter Ai Lin Chia 2018-07-06 14:23:16 +02:00
  • 10665cc11b a bit of #include cleanup in XmlDoc.cpp Ivan Skytte Jørgensen 2018-07-06 14:20:10 +02:00
  • 7dc97ee4a0 lemma: generate lemmas for proper nouns too (normally genetive-case -> unmarked-case) Ivan Skytte Jørgensen 2018-07-06 14:06:28 +02:00
  • 777802de1d lemma: handle capitalized/uppercase words while processing query words Ivan Skytte Jørgensen 2018-07-06 13:49:16 +02:00
  • 477bee1fb1 lemma: handle capitalized/uppercase words whiel indexing Ivan Skytte Jørgensen 2018-07-06 13:45:19 +02:00
  • 8f0c558c25 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-07-06 13:26:04 +02:00
  • 3f05dc5b28 Fix compilation of unit test Ai Lin Chia 2018-07-05 14:38:09 +02:00
  • f432af18ac #include cleanup in GbCache.h Ivan Skytte Jørgensen 2018-07-05 14:10:25 +02:00
  • 54a7ede0a3 Made code lookup up in clusterdb quick cache less clumsy Ivan Skytte Jørgensen 2018-07-05 14:02:41 +02:00
  • 1f6c6b5c78 Change memory balance in clusterdb quick cache so it is actually active Ivan Skytte Jørgensen 2018-07-05 14:01:52 +02:00
  • 95d3042e26 parms: better description for clusterdb quick cache Ivan Skytte Jørgensen 2018-07-05 13:53:56 +02:00
  • 727f4b0c35 Made size of s_clusterdbQuickCache configurable Ivan Skytte Jørgensen 2018-07-05 13:47:32 +02:00
  • e449522e05 Made cache statistics for s_clusterdbQuickCache visible in PageStats Ivan Skytte Jørgensen 2018-07-05 13:35:20 +02:00
  • 42ae36432e Remove commented out code Ai Lin Chia 2018-07-04 15:17:08 +02:00
  • 38a981fe47 Use whitelist instead of blacklist. We probably only want to allow a limited type of links when it's in link tag Ai Lin Chia 2018-07-04 13:22:55 +02:00
  • 1a10816623 Remove unused Msg25::m_adBanTable Ai Lin Chia 2018-07-04 11:48:24 +02:00
  • fecebb4def Filter out some unwanted outlinks (eg: rel=icon, rel=dns-prefetch) Ai Lin Chia 2018-07-04 11:35:17 +02:00
  • 1b82a3fcf8 Use const char, simplify code, remove commented out codes Ai Lin Chia 2018-07-04 11:32:32 +02:00
  • add0f8d700 bgufix clustering: it was not clustering docids < 2^29, and thus letting through more than 2 results per site in some cases Ivan Skytte Jørgensen 2018-07-03 16:36:05 +02:00
  • 8feac777e0 Removed forgotten tracelog in PosdbTable Ivan Skytte Jørgensen 2018-07-03 15:57:05 +02:00
  • 31b566b6b9 Made ClusterDb::get*() take properly typed key96_t* Ivan Skytte Jørgensen 2018-07-02 16:46:48 +02:00
  • a7bdaace66 Improved log in Msg51::gotClusterRec() Ivan Skytte Jørgensen 2018-07-02 16:12:50 +02:00
  • b25a4a5807 Improved log in Msg51::gotClusterRec() Ivan Skytte Jørgensen 2018-07-02 14:49:27 +02:00
  • 4247c4a1b8 use correct log prefix in Msg51 (was 'build' should be 'query') Ivan Skytte Jørgensen 2018-07-02 14:37:57 +02:00
  • a17b646f5f Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2018-06-29 14:18:07 +02:00
  • dbdc26d966 Made Title.h self-sufficient (makes kdevelop happier) Ivan Skytte Jørgensen 2018-06-29 14:17:55 +02:00
  • 564bde6b56 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-29 13:58:06 +02:00
  • 6546953115 bugfix hashLemmas() when wordpos exceeds limit Ivan Skytte Jørgensen 2018-06-29 13:55:37 +02:00
  • 243e3c1b2b Log in hex for sitehash instead of base 10 Ai Lin Chia 2018-06-29 13:48:33 +02:00
  • 7ce1816257 Bugfix Title::setTiel() Ivan Skytte Jørgensen 2018-06-28 17:02:12 +02:00
  • 0e8aadcf66 bugfix Title::setTiel(): check of http:// prefix properly in token array Ivan Skytte Jørgensen 2018-06-28 15:36:04 +02:00
  • 3b847c18aa Title.cpp: if 'http://' title candidates should be avoided then so should 'https://' Ivan Skytte Jørgensen 2018-06-28 14:24:34 +02:00
  • 6f353a7995 Removed weird test in Title.cpp Ivan Skytte Jørgensen 2018-06-28 14:20:52 +02:00
  • c894a1b037 Index lemmas only once per document Ivan Skytte Jørgensen 2018-06-25 16:10:24 +02:00
  • c39955b4a3 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-25 15:36:39 +02:00
  • c89d59d8a0 Fix commment on HASHGROUP_INLIST+HASHGROUP_INLINKTEXT. Wrong coment position in commit 13fb07bde2 Ivan Skytte Jørgensen 2018-06-25 15:20:08 +02:00
  • 03406a8725 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-25 14:22:41 +02:00
  • 44530a7ee5 bugfix middomain hashing: coredumped on hosts with more than 3 labels Ivan Skytte Jørgensen 2018-06-25 14:22:25 +02:00
  • b5bf73377a Eliminate more duplicated lemmas; better lemma-indexing-log Ivan Skytte Jørgensen 2018-06-25 13:48:12 +02:00
  • eac027eae6 Fix merge error (d2e31e96ba) in lemmatizer branch - lemmas were not indexed Ivan Skytte Jørgensen 2018-06-25 12:46:08 +02:00
  • 04de89ae17 More detailed trace log for lemma indexing Ivan Skytte Jørgensen 2018-06-22 16:51:41 +02:00
  • b56095b653 bugfix sto::Lexicon::lookup(). Would find non-matching entries Ivan Skytte Jørgensen 2018-06-22 16:46:57 +02:00
  • 9e10c173a5 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-22 16:01:03 +02:00
  • 7de71ae59e Support middomain hashgroup Ivan Skytte Jørgensen 2018-06-22 14:49:35 +02:00
  • afb97e7e7b bgufix Posdb::getHashGroup to use HASHGROUP_END instead of a hardcoded 10 Ivan Skytte Jørgensen 2018-06-22 14:33:41 +02:00
  • c9a062e262 More explicit comment in slightly confusing indexing code Ivan Skytte Jørgensen 2018-06-22 13:51:12 +02:00
  • dcecaa71bd dbuglog for m_hashGroupWeightExplicitKeywords Ivan Skytte Jørgensen 2018-06-22 13:31:01 +02:00
  • d930c6ac65 Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-22 13:05:59 +02:00
  • fd67efe213 Missed in previous commit Ai Lin Chia 2018-06-21 11:08:53 +02:00
  • c930fc9afb Fix bug where wrong function name is used to overload parent function Ai Lin Chia 2018-06-21 11:02:59 +02:00
  • 4b17111b07 Don't return default key size for invalid rdbid Ai Lin Chia 2018-06-21 10:43:21 +02:00
  • ba48322847 Fix bug where datasize is stored even when it's fixed sized Ai Lin Chia 2018-06-20 23:55:30 +02:00
  • 91335e875d Merge branch 'master' into dev-sitetemp Ai Lin Chia 2018-06-20 17:50:24 +02:00
  • 859e093102 Don't generate site default page temperature 'rdb' entry when it's for delete Ai Lin Chia 2018-06-20 17:39:13 +02:00
  • e6b4c455fe We shouldn't check site median page temperature while rebuilding Ai Lin Chia 2018-06-20 17:38:34 +02:00
  • 2659c4e49d added extra check for attempted use of uninitialized TLD table Brian Rasmusson 2018-06-20 12:27:36 +02:00
  • b754b5b960 Bugfix: Make sure to always initialize TLD list before calling any URL handling function Brian Rasmusson 2018-06-20 12:19:58 +02:00
  • 867cd5e559 shut down if we cannot parse a domain on a blocklist Brian Rasmusson 2018-06-20 12:19:33 +02:00
  • 8666888749 wordvariations: handle compound words better Ivan Skytte Jørgensen 2018-06-19 16:40:49 +02:00
  • 390b8d6639 Fix respider url Ai Lin Chia 2018-06-19 15:10:27 +02:00
  • a5da0fbd95 Add trace log for Docid2FlagsAndSiteMap & some more comments Ai Lin Chia 2018-06-19 15:10:03 +02:00