Commit Graph

  • f20c2d1d85 Log timing if tokenization-phase-2 took more than 15ms Ivan Skytte Jørgensen 2018-03-23 17:27:52 +01:00
  • a5b6771ac8 query: less assumptions about tokens being contiguous in memory Ivan Skytte Jørgensen 2018-03-23 16:15:00 +01:00
  • 8f9708d038 tokenizer: added some limited support for Norwegian Ivan Skytte Jørgensen 2018-03-23 15:58:36 +01:00
  • 10e0980440 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-23 14:42:02 +01:00
  • ef62bd490a No debuglog in SafeBuf::reserve() - it just adds noise in the log Ivan Skytte Jørgensen 2018-03-23 14:35:58 +01:00
  • 79ecc2594f bugfix debug log in safebuf: use correct prefix Ivan Skytte Jørgensen 2018-03-22 18:31:44 +01:00
  • 8efdca607d Fix merge error in Sections.cpp Ivan Skytte Jørgensen 2018-03-22 17:36:21 +01:00
  • 7981ba23af Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-22 17:25:47 +01:00
  • a6100282c6 Fix non-ending iframes (incomplete fix) Ivan Skytte Jørgensen 2018-03-22 17:19:06 +01:00
  • 305b4bbc9e tokenizer: bugfix wiktionary-synonyms Ivan Skytte Jørgensen 2018-03-20 15:28:38 +01:00
  • d0e37e46c4 tokenizer: valgrinded Ivan Skytte Jørgensen 2018-03-20 14:46:21 +01:00
  • 126b9b8d87 tokenizer: keep track of whether a token is from phase 1 or phase 2 Ivan Skytte Jørgensen 2018-03-20 14:37:56 +01:00
  • 746e4ec9cf tokenier: makefile dependencies Ivan Skytte Jørgensen 2018-03-20 14:37:00 +01:00
  • 9373a3485c Load unicode_is_alphabetic.dat (copy-pasta error) Ivan Skytte Jørgensen 2018-03-20 14:32:01 +01:00
  • 505440b8ee gitignore: Ignore libtokenizer.a Ivan Skytte Jørgensen 2018-03-20 14:20:08 +01:00
  • 9f93708519 Added missing ucdata/ files and updated .gitignore to not ignore ucdata/*.dat Ivan Skytte Jørgensen 2018-03-20 14:19:25 +01:00
  • 1e61245d23 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-20 14:09:39 +01:00
  • 5ba94f28e8 tokenizer: fix clean target Ivan Skytte Jørgensen 2018-03-20 14:06:30 +01:00
  • 4c999181ca Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-20 13:52:56 +01:00
  • 814d820f3b Merge branch 'tokenizer' of github.com:privacore/open-source-search-engine into tokenizer Ivan Skytte Jørgensen 2018-03-20 13:52:44 +01:00
  • 4005309220 tokenizer: recognize C++ and other oddballs Ivan Skytte Jørgensen 2018-03-20 00:56:37 +01:00
  • 7885afd127 Use utf8 copyright symbol in source code instead of hardcoded utf8 byte sequence Ivan Skytte Jørgensen 2018-03-19 17:42:06 +01:00
  • b94a8431e1 Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2018-03-19 17:36:53 +01:00
  • 0a6c6199db Use ✓ html entity instead of harcoded utf-8 bytes Ivan Skytte Jørgensen 2018-03-19 17:36:46 +01:00
  • 28fce75432 Removed obsolete commented-out code in hash.h Ivan Skytte Jørgensen 2018-03-19 17:30:51 +01:00
  • 9f9c9a4be0 Add hint to unitest about hash calculation in tokenizer Ivan Skytte Jørgensen 2018-03-19 17:26:36 +01:00
  • d6b7ff4b4d Minor change in Sectiosn so unittests run unchanged Ivan Skytte Jørgensen 2018-03-19 17:25:36 +01:00
  • 1959f36105 made summary test less fragile Ivan Skytte Jørgensen 2018-03-19 16:43:29 +01:00
  • 91c6919c02 Made unittest compile&link again Ivan Skytte Jørgensen 2018-03-19 16:03:06 +01:00
  • a009153233 Fixed signed/unsigned comparison resulting from tokenizer known there cannot be negativer number of tokens Ivan Skytte Jørgensen 2018-03-19 15:33:26 +01:00
  • 392ece1033 Fixed soft-hypen unittest in Xml tokenizer Ivan Skytte Jørgensen 2018-03-19 15:32:30 +01:00
  • 3523bfe2be Remove tokens duplicated by phase-2 tokenizer Ivan Skytte Jørgensen 2018-03-19 14:50:37 +01:00
  • a1a89b271d Merge branch 'tokenizer' of github.com:privacore/open-source-search-engine into tokenizer Ivan Skytte Jørgensen 2018-03-19 12:40:20 +01:00
  • 5ed14bcaa5 phase-2 tokes don't have to be NUL terminated Ivan Skytte Jørgensen 2018-03-19 12:40:17 +01:00
  • a20900fa14 Merge branch 'master' into dev-language Ai Lin Chia 2018-03-19 10:16:09 +01:00
  • 8eb70cb98e Use key instead of SpiderReq data to get/store firstIp Ai Lin Chia 2018-03-16 10:07:22 +01:00
  • 8ac4f23b71 Fix bug where data size is not set hence we're not able to start rebuilding waiting tree correctly Ai Lin Chia 2018-03-15 17:35:27 +01:00
  • 115b9b08f2 Run SpiderdbRdbSqliteBridge::getList in a thread Ai Lin Chia 2018-03-15 16:49:01 +01:00
  • 6fd9ba02e4 Run SpiderdbRdbSqliteBridge::getFirstIps in a thread Ai Lin Chia 2018-03-15 16:25:59 +01:00
  • 626d4afa08 Fix incorrect use of cp-1257 with utf8 content Ivan Skytte Jørgensen 2018-03-15 15:34:52 +01:00
  • b95017ef4d Simplified logic in getDensityRanks() a bit Ivan Skytte Jørgensen 2018-03-12 17:54:49 +01:00
  • 1f774c3891 Removed obsolete, incorrect or unuseful comments from Phrases.h Ivan Skytte Jørgensen 2018-03-12 15:33:35 +01:00
  • bde0320848 Changed Phrases::getPhraseIds2() to a simpler and looser-coupled getPhraseId() Ivan Skytte Jørgensen 2018-03-12 15:29:38 +01:00
  • 0471bafb61 Removed unused XmlDoc::m_unchanged Ivan Skytte Jørgensen 2018-03-12 13:57:22 +01:00
  • 47233aa140 Removed unused XmlDoc::m_filteredContentMaxSize Ivan Skytte Jørgensen 2018-03-12 13:52:35 +01:00
  • bc84f22565 Removed unused XmlDoc::m_synBuf Ivan Skytte Jørgensen 2018-03-12 13:16:54 +01:00
  • 8d94c87521 Title: flags&0x01 was not used. Ivan Skytte Jørgensen 2018-03-09 14:58:21 +01:00
  • fb899383b9 Use StackBuf<> in Title.cpP Ivan Skytte Jørgensen 2018-03-09 14:50:07 +01:00
  • 44155164f5 More trace info inPrahses.cpp Ivan Skytte Jørgensen 2018-03-09 13:53:34 +01:00
  • 07559c2c7a Fix clang++ warning: ISO C++11 does not allow conversion from string literal to 'char *' Ai Lin Chia 2018-03-09 10:30:42 +01:00
  • ec70308ba5 Recognize German telephonee numbers Ivan Skytte Jørgensen 2018-03-18 19:18:11 +01:00
  • a5d3db42a9 Added stub+comment for German ligatures (which there aren't) Ivan Skytte Jørgensen 2018-03-18 00:47:19 +01:00
  • c654f55429 Merge branch 'tokenizer' of github.com:privacore/open-source-search-engine into tokenizer Ivan Skytte Jørgensen 2018-03-18 00:03:19 +01:00
  • cfafbc303e Bugfix bigram generation when tokenizer phase took several subtokens and combined them (eg phone number) Ivan Skytte Jørgensen 2018-03-16 18:04:22 +01:00
  • 48de9b4453 Removed forgotten debug-print Ivan Skytte Jørgensen 2018-03-16 17:39:44 +01:00
  • 26f0edb1e7 Handle œ/æ as ligatures in French and English Ivan Skytte Jørgensen 2018-03-16 17:33:15 +01:00
  • 086872b6da Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-16 16:29:37 +01:00
  • f0c53d975f Don't generate bigrams with non-alfanum tokens Ivan Skytte Jørgensen 2018-03-16 16:29:09 +01:00
  • b7ebfc641b bugfix term-debug info for bigrams Ivan Skytte Jørgensen 2018-03-16 16:23:26 +01:00
  • f1e13e2d88 constness in XmlDoc::hashString_ct() Ivan Skytte Jørgensen 2018-03-16 16:08:50 +01:00
  • 6ac786b525 Fix buffer overrun in XmlDoc::getCountTable() Ivan Skytte Jørgensen 2018-03-16 16:04:59 +01:00
  • 848e8fa54d Use elipses in Section::print() Ivan Skytte Jørgensen 2018-03-16 15:53:30 +01:00
  • e5f3f80f35 Investigated TODO (which was a false alarm) Ivan Skytte Jørgensen 2018-03-16 15:47:10 +01:00
  • 9b5513824c Clarified todo-comment Ivan Skytte Jørgensen 2018-03-16 15:38:07 +01:00
  • f106472637 Added Conf::m_logTraceTokenIndexing and used it in XmlDoc::hashWords3() Ivan Skytte Jørgensen 2018-03-16 15:35:41 +01:00
  • d67291291e Cleaned up so TODOs Ivan Skytte Jørgensen 2018-03-16 15:28:30 +01:00
  • f00322e7a2 Rewrite ampersands to the language's 'and' word Ivan Skytte Jørgensen 2018-03-16 15:11:04 +01:00
  • 21409fc904 Updated tokenizer untitest to reflect possessive-s change Ivan Skytte Jørgensen 2018-03-16 14:52:36 +01:00
  • 6a1e5169d5 tokenizer: posessive-s and hyphens verified Ivan Skytte Jørgensen 2018-03-16 14:48:20 +01:00
  • d5aafde56c Added todo to tokenizer phase 2 Ivan Skytte Jørgensen 2018-03-16 13:37:45 +01:00
  • 1ce9ae75bf Use key instead of SpiderReq data to get/store firstIp Ai Lin Chia 2018-03-16 10:07:22 +01:00
  • 8efc115f53 Fix comment Ivan Skytte Jørgensen 2018-03-15 23:49:12 +01:00
  • 62a64aee36 Fix bug where data size is not set hence we're not able to start rebuilding waiting tree correctly Ai Lin Chia 2018-03-15 17:35:27 +01:00
  • dfe96a8afe tokenizer: bigram and title fixes Ivan Skytte Jørgensen 2018-03-15 17:02:34 +01:00
  • 77526cdc18 Run SpiderdbRdbSqliteBridge::getList in a thread Ai Lin Chia 2018-03-15 16:49:01 +01:00
  • deabd17354 Run SpiderdbRdbSqliteBridge::getFirstIps in a thread Ai Lin Chia 2018-03-15 16:25:59 +01:00
  • b615de9b8e Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-15 15:36:29 +01:00
  • 0f18fb4332 Merge branch 'tokenizer' of github.com:privacore/open-source-search-engine into tokenizer Ivan Skytte Jørgensen 2018-03-15 15:36:26 +01:00
  • f7a8c424cc Fix incorrect use of cp-1257 with utf8 content Ivan Skytte Jørgensen 2018-03-15 15:34:52 +01:00
  • a1eea75349 Better comments in tokenizer2.cpp Ivan Skytte Jørgensen 2018-03-15 00:25:03 +01:00
  • cfda843e47 tokenizer: phase2: detect Norwegian phone numbers Ivan Skytte Jørgensen 2018-03-15 00:13:09 +01:00
  • 42e05ba7db load unicode maps - it makes the unittests work much better Ivan Skytte Jørgensen 2018-03-13 18:42:28 +01:00
  • 57588d97aa tokenizer: more diacritics unittests Ivan Skytte Jørgensen 2018-03-13 18:10:40 +01:00
  • ce26026076 tokenizer: uise phase-2 tokens Ivan Skytte Jørgensen 2018-03-13 17:12:42 +01:00
  • 467e74cbb0 tokneizer unittest: explicit (but fragile) test on soft hyphens Ivan Skytte Jørgensen 2018-03-13 17:07:59 +01:00
  • 45e9a2019c More test in xml-tokenizer Ivan Skytte Jørgensen 2018-03-13 17:07:20 +01:00
  • d19cea420b debug log while testing out new tokenizer bigrams Ivan Skytte Jørgensen 2018-03-13 16:11:34 +01:00
  • b61bf2c48d Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-12 17:55:47 +01:00
  • c1c4262590 Simplified logic in getDensityRanks() a bit Ivan Skytte Jørgensen 2018-03-12 17:54:49 +01:00
  • df235ef70d Simplified Phrases class a bit Ivan Skytte Jørgensen 2018-03-12 16:18:17 +01:00
  • 1dd5130905 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-12 15:38:45 +01:00
  • 3b9a89c615 Removed obsolete, incorrect or unuseful comments from Phrases.h Ivan Skytte Jørgensen 2018-03-12 15:33:35 +01:00
  • e2448c4c8b Changed Phrases::getPhraseIds2() to a simpler and looser-coupled getPhraseId() Ivan Skytte Jørgensen 2018-03-12 15:29:38 +01:00
  • 7a67b6855f Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-12 13:58:10 +01:00
  • 61c8aad50b Removed unused XmlDoc::m_unchanged Ivan Skytte Jørgensen 2018-03-12 13:57:22 +01:00
  • d3d5b5ca14 Removed unused XmlDoc::m_filteredContentMaxSize Ivan Skytte Jørgensen 2018-03-12 13:52:35 +01:00
  • a7ec48da44 Removed unused XmlDoc::m_synBuf Ivan Skytte Jørgensen 2018-03-12 13:16:54 +01:00
  • eb54b401be Don't re-tokenize <title> content for indexing. Reuse existing tokenization results Ivan Skytte Jørgensen 2018-03-09 16:49:46 +01:00
  • be58489e83 Removed inclusion ofr Words.h Ivan Skytte Jørgensen 2018-03-09 16:40:47 +01:00
  • 036b76c64d Removed Words.* Ivan Skytte Jørgensen 2018-03-09 16:36:21 +01:00