Commit Graph

  • 2a7515812b tokenizer: improe code comments Ivan Skytte Jørgensen 2018-04-12 14:41:15 +02:00
  • 606a89d852 Fix compilation error Ai Lin Chia 2018-04-12 11:45:25 +02:00
  • cb4cc51992 Add utf8 decoded as csWindows1257 Ai Lin Chia 2018-04-12 11:44:01 +02:00
  • c381e7ce0d tokenizer: recognize swedish telephone numbers Ivan Skytte Jørgensen 2018-04-12 01:58:52 +02:00
  • cf7d301969 tokenizer: do ampwersand rewrite in Swedish Ivan Skytte Jørgensen 2018-04-11 22:14:15 +02:00
  • 133e55de56 tokenizer: fix diacritics in Swedish Ivan Skytte Jørgensen 2018-04-11 22:09:56 +02:00
  • 774d7d06d9 Add german to dump wrong encoding Ai Lin Chia 2018-04-11 15:32:21 +02:00
  • 6eeeebe9b6 Add swedish to dump wrong encoding Ai Lin Chia 2018-04-11 15:24:24 +02:00
  • 05721bcb9f Fix dump_wrong_encoding tool Ai Lin Chia 2018-04-11 15:14:21 +02:00
  • e2e99f7be4 Add new tool to dump wrongly detected encoding Ai Lin Chia 2018-04-11 14:51:40 +02:00
  • 4c73bd6e64 Fix compilation error in tools Ai Lin Chia 2018-04-11 13:13:25 +02:00
  • 4dec4689f2 Assume that first url is not canonical when doing a docprocess with docid Ai Lin Chia 2018-04-11 12:26:31 +02:00
  • 3c7637232d Revert "Make sure we clear canonical url when we redirect as well" Ai Lin Chia 2018-04-11 12:21:51 +02:00
  • bd803d3479 tokenizer: Handle combining marks for German Ivan Skytte Jørgensen 2018-04-10 15:38:24 +02:00
  • f75ad48446 Make sure we clear canonical url when we redirect as well Ai Lin Chia 2018-04-10 15:27:38 +02:00
  • 5c7b978f0a Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-10 12:59:49 +02:00
  • a60ee4b038 Html tag attributes are case insensitive Ai Lin Chia 2018-04-10 12:34:49 +02:00
  • 213bd036c0 Use currentUrl instead of firstUrl Ai Lin Chia 2018-04-09 14:43:12 +02:00
  • 0543d1a7b7 Use separate is_reliable variable, log encoding name instead of int Ai Lin Chia 2018-04-09 09:36:18 +02:00
  • 7d358bd066 Don't ignore summary when summary starts with title (only ignore it when it's exactly the same) Ai Lin Chia 2018-04-06 23:20:22 +02:00
  • cef3771f1a Add summary generation from meta property='description' as well Ai Lin Chia 2018-04-06 23:13:56 +02:00
  • c2d8c8c4d1 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-06 13:25:34 +02:00
  • 4c1b85e7c0 Silence coverity about non-happening buffer overflow Ivan Skytte Jørgensen 2018-04-06 13:24:54 +02:00
  • 9634a38605 bugfix SiteMedianPageTemperatureRegistry::add(), incorrect condition in if / wrong use of iterator Ivan Skytte Jørgensen 2018-04-06 13:22:01 +02:00
  • 4e0d920b1d gitignore unicode/*.dat Ivan Skytte Jørgensen 2018-04-06 13:19:30 +02:00
  • 6004f6650f Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-06 13:18:48 +02:00
  • 9ad6e27e2b Simplified error checking in Msg4In::addMetaList() Ivan Skytte Jørgensen 2018-04-06 13:16:47 +02:00
  • a0b6d6a30a Removed logically dead code Ivan Skytte Jørgensen 2018-04-06 13:15:20 +02:00
  • 16998d7045 Removed logically dead code Ivan Skytte Jørgensen 2018-04-06 13:11:53 +02:00
  • d7d7504860 Fix coverity warning of array compared against 0 Ai Lin Chia 2018-04-06 09:09:56 +02:00
  • 78391f8235 Fix coverity warning of uninitialized scalar field Ai Lin Chia 2018-04-06 09:07:14 +02:00
  • 3a834862ba Merge branch 'master' into dev-language Ai Lin Chia 2018-04-05 17:19:15 +02:00
  • 9e8cc43a25 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-05 15:59:22 +02:00
  • 0f0ee77584 Removed effectively unused local variable in Title::setTitle() Ivan Skytte Jørgensen 2018-04-05 15:51:07 +02:00
  • 56de787bad Title.cpp: use enum instead of #defines Ivan Skytte Jørgensen 2018-04-05 15:36:49 +02:00
  • 4ec514a9c9 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-05 14:59:17 +02:00
  • 8e15884b8c fix Makefile so libunicode.a is rebuilt Ivan Skytte Jørgensen 2018-04-05 14:59:08 +02:00
  • f05edcaf85 Add reinitializeSettings to QueryLanguage Ai Lin Chia 2018-04-05 14:01:09 +02:00
  • 5177eb1050 output termid when dumping posdb with string-term instead of number-term Ivan Skytte Jørgensen 2018-04-05 13:50:30 +02:00
  • aa68b9433c Merge branch 'master' into tokenizer Ai Lin Chia 2018-04-04 12:18:30 +02:00
  • 15105bde61 Merge branch 'master' into dev-language Ai Lin Chia 2018-04-04 12:18:20 +02:00
  • a6dca3a407 Add x bit on gbclean.sh Ai Lin Chia 2018-04-04 12:18:03 +02:00
  • 7fb1a168e9 Archive gbclean.sh script as well Ai Lin Chia 2018-04-04 12:06:33 +02:00
  • 884077f970 Merge branch 'master' into tokenizer Ai Lin Chia 2018-04-04 11:55:47 +02:00
  • 5758ae0564 Merge branch 'master' into dev-language Ai Lin Chia 2018-04-04 11:17:40 +02:00
  • 7f5763d465 tokenizer: a few more unittests for slash-abbreviations Ivan Skytte Jørgensen 2018-04-04 01:07:10 +02:00
  • d77f25be29 tokenizer: added a few more slash-abbreviations Ivan Skytte Jørgensen 2018-04-03 19:15:48 +02:00
  • 3e3f116546 Add gbclean.sh file for test purposes (clear gb data) Ai Lin Chia 2018-04-03 17:18:28 +02:00
  • 367591e2f4 Revert "Use gbclean.sh instead of Makefile" Ai Lin Chia 2018-04-03 17:16:42 +02:00
  • bf72a7d80b Use gbclean.sh instead of Makefile Ai Lin Chia 2018-04-03 17:15:44 +02:00
  • 78eeccfb0b bugfix: fx_blang handling with the new longer content Ivan Skytte Jørgensen 2018-04-03 17:14:33 +02:00
  • 045c9bcce8 Bugfix: Don't index what is inside <iframe>...</iframe> tags Ivan Skytte Jørgensen 2018-04-03 16:19:31 +02:00
  • 2f0d5f9c4c tokenizer (query part): Handle slash-abbrevations as a single token Ivan Skytte Jørgensen 2018-04-03 16:08:58 +02:00
  • f68528d554 tokenizer: Handle slash-abbrevations as a single token Ivan Skytte Jørgensen 2018-04-03 16:00:55 +02:00
  • d6769f032f Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-03 14:25:46 +02:00
  • e6171bf9ad Remoed one unused variant of XmlDoc::printTermList() Ivan Skytte Jørgensen 2018-04-03 14:25:35 +02:00
  • 9c0712c10d Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-03 14:20:16 +02:00
  • ed75e821d1 Simplify XmlDoc::printTermList() a bit Ivan Skytte Jørgensen 2018-04-03 14:20:10 +02:00
  • 9498b2ce5e Add libcares.so symlink to dist file Ai Lin Chia 2018-04-03 11:41:33 +02:00
  • 93700bf815 Add unicode files into libgb.a Ai Lin Chia 2018-04-03 10:45:55 +02:00
  • 7dc709e77e Revert "Don't archive files anymore" Ai Lin Chia 2018-03-28 17:19:48 +02:00
  • 9c31931be7 Don't archive files anymore Ai Lin Chia 2018-03-28 16:41:44 +02:00
  • ea5d33d38f Fix exclusion Ai Lin Chia 2018-03-28 16:32:03 +02:00
  • 2706bda99c More header files Ai Lin Chia 2018-03-28 16:12:51 +02:00
  • b221e068cf Back to GitSCM Ai Lin Chia 2018-03-28 15:52:07 +02:00
  • cffa6fbdec Archive header files Ai Lin Chia 2018-03-28 15:08:52 +02:00
  • df10d1ced4 Add extensions Ai Lin Chia 2018-03-28 14:33:00 +02:00
  • 6dd09f5f9c Archive WantedCheckerApi.h as well Ai Lin Chia 2018-03-28 14:26:59 +02:00
  • 0005023861 Add more SubmoduleOptionTrait options Ai Lin Chia 2018-03-28 13:55:33 +02:00
  • b18e4e38b5 Merge branch 'master' into dev-language Ai Lin Chia 2018-03-28 11:15:49 +02:00
  • 335d83a4c0 Force symlink Ai Lin Chia 2018-03-28 10:40:34 +02:00
  • 4c6808e3ff Fix dependency error Ai Lin Chia 2018-03-28 10:28:53 +02:00
  • 304dd05700 Don't distribute *.a files Ai Lin Chia 2018-03-28 00:34:37 +02:00
  • 01204bd359 Fix archive separator Ai Lin Chia 2018-03-28 00:19:39 +02:00
  • 0d3aaf465e Archive libgb.a as well Ai Lin Chia 2018-03-27 23:51:48 +02:00
  • ac79d9c076 Fix dependency Ai Lin Chia 2018-03-27 23:42:51 +02:00
  • 6e7891d840 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-27 17:16:41 +02:00
  • 5b81487eb4 Give CED URL hint too Ivan Skytte Jørgensen 2018-03-27 16:50:59 +02:00
  • 93b5fafd33 Merge branch 'master' into dev-language Ai Lin Chia 2018-03-27 16:48:31 +02:00
  • 582a58a945 More jenkinsfile changes Ai Lin Chia 2018-03-27 16:38:52 +02:00
  • fab6b3b4e6 Give charset hints to CED Ivan Skytte Jørgensen 2018-03-27 16:26:57 +02:00
  • b49ddadc91 More jenkinsfile changes Ai Lin Chia 2018-03-27 16:25:33 +02:00
  • f281b3ba77 More test Ai Lin Chia 2018-03-27 16:20:54 +02:00
  • b05f143f93 More Jenkinsfile test Ai Lin Chia 2018-03-27 16:17:13 +02:00
  • 17593b6a97 Jenkinsfile test changes Ai Lin Chia 2018-03-27 16:09:32 +02:00
  • 31be2f7213 Tweak charset detection: also prefer utf8 if charset is so far unknown Ivan Skytte Jørgensen 2018-03-27 15:44:00 +02:00
  • 5cfcfe77c2 Fix archive path Ai Lin Chia 2018-03-27 11:26:59 +02:00
  • 3aa5763217 Combine archive to build step Ai Lin Chia 2018-03-27 10:52:13 +02:00
  • 411215be2a Add archive artifacts Ai Lin Chia 2018-03-27 10:21:11 +02:00
  • 2a3a305f4e Fix make dist Ai Lin Chia 2018-03-27 10:20:49 +02:00
  • ded1e24119 Merge branch 'master' into dev-language Ai Lin Chia 2018-03-27 10:20:28 +02:00
  • 312a5b9a43 Use utf8 copyright symbol in source code instead of hardcoded utf8 byte sequence Ivan Skytte Jørgensen 2018-03-19 17:42:06 +01:00
  • 59748796b7 Use &check; html entity instead of harcoded utf-8 bytes Ivan Skytte Jørgensen 2018-03-19 17:36:46 +01:00
  • 9833c1593d bugfix summary debug Ivan Skytte Jørgensen 2018-03-26 18:01:04 +02:00
  • c6bc01eea5 Use same optimization in unicode+tokenizer as in the main executable Ivan Skytte Jørgensen 2018-03-26 16:58:25 +02:00
  • de4ff8373a tokenizer: hacked support for phase-2 tokens C++ and C# in Query Ivan Skytte Jørgensen 2018-03-26 15:37:46 +02:00
  • aaaa03d58a tokenizer: tweak possessive-apostrophe for English vs. non-English Ivan Skytte Jørgensen 2018-03-26 15:26:55 +02:00
  • d921db3ba0 tokenizer: possessive-s: handle uppercase too Ivan Skytte Jørgensen 2018-03-26 14:56:35 +02:00
  • 50740e4a00 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-03-26 12:59:24 +02:00
  • 229ff8e036 Renamed Query*::phrase* to ...bigram... Ivan Skytte Jørgensen 2018-03-26 12:58:02 +02:00