Commit Graph

  • a6753aebb6 Use better default weight for unsupported langauges Ivan Skytte Jørgensen 2018-05-22 16:10:51 +02:00
  • eaa5137582 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-22 15:53:18 +02:00
  • 9ac7d9324c Fix double->int conversion for language weights Ivan Skytte Jørgensen 2018-05-22 15:30:54 +02:00
  • 66e66e5474 Fix race condition in Msg3a Ivan Skytte Jørgensen 2018-05-22 14:28:46 +02:00
  • 383ae325c2 Use more descriptive memory label for Msg0 State00 Ivan Skytte Jørgensen 2018-05-22 14:04:24 +02:00
  • 4ca9aaacb4 Mem.cpp: use size_t instead of unsigned long Ivan Skytte Jørgensen 2018-05-22 14:02:50 +02:00
  • 4d08365ba5 Remove undefined function Ai Lin Chia 2018-05-22 12:35:40 +02:00
  • 96be2e4a9c Remove read only variable & useless function Ai Lin Chia 2018-05-22 12:27:54 +02:00
  • 4cbf1b2639 By default disable query language & url classification Ai Lin Chia 2018-05-19 00:21:02 +02:00
  • 9dee894cae bugfix: Use mfree() isntead of delete for copied-out data from RdbCache Ivan Skytte Jørgensen 2018-05-18 16:38:51 +02:00
  • a45d903cc6 Merge branch 'master' into dev-inject Ai Lin Chia 2018-05-18 14:53:08 +02:00
  • bbcfa50719 Remove read only XmlDoc::m_isImporting Ai Lin Chia 2018-05-18 13:06:07 +02:00
  • 3594824604 Make sure we don't insert into spiderdb when spideringLinks is not set while injecting Ai Lin Chia 2018-05-18 12:55:51 +02:00
  • c5ff7e48e0 If language weights for ranking were non-identical then include the weigths in the JSON output (makes life easier while debugging) Ivan Skytte Jørgensen 2018-05-17 16:10:36 +02:00
  • 7d08fb9791 Make sure we pass in the paramters as well Ai Lin Chia 2018-05-17 15:26:28 +02:00
  • e7337a6ebd Log unicode data filename that failed to load Ivan Skytte Jørgensen 2018-05-17 15:19:23 +02:00
  • a30865179a Set ptr_redirUrl as well Ai Lin Chia 2018-05-17 14:50:43 +02:00
  • 1ad46b1ee7 Add 3 new parameter for injecting document (httpstatus, indexcode, redirurl) Ai Lin Chia 2018-05-17 12:36:59 +02:00
  • 75d81c5469 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-15 17:04:55 +02:00
  • 56ca8155c9 Some refinement of queyr-lang-server use (fallback, langUnknown handling, etc) Ivan Skytte Jørgensen 2018-05-15 16:51:51 +02:00
  • 74e9036c47 Use multiple/all langauge weigths returned by the external query-language-server Ivan Skytte Jørgensen 2018-05-15 16:30:52 +02:00
  • 1455be5f97 qlangserver: return weights as doubles instead of ints Ivan Skytte Jørgensen 2018-05-15 15:09:51 +02:00
  • 996470ade0 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-15 14:41:11 +02:00
  • 0e08a15346 Eliminate <br> tags and similar in summaries Ivan Skytte Jørgensen 2018-05-15 13:56:14 +02:00
  • b30f860fdd Detect (some) html formatting in <meta> content and ignored it for indexing purposes Ivan Skytte Jørgensen 2018-05-15 12:45:57 +02:00
  • 1f3f327916 try to detect double-encoded entities in <title> and index it the expected way Ivan Skytte Jørgensen 2018-05-14 16:45:31 +02:00
  • 484badd223 Index double-encoded html entities in meta description/summary too Ivan Skytte Jørgensen 2018-05-14 16:26:32 +02:00
  • cbbe51692b Work around double-encoded html entities in og:description and similar Ivan Skytte Jørgensen 2018-05-14 16:13:37 +02:00
  • c1491a846f Don't escape/encode < and > in meta tags for indexing Ivan Skytte Jørgensen 2018-05-14 15:28:26 +02:00
  • a62e42b773 bugfix indexing of some <meta> tags Ivan Skytte Jørgensen 2018-05-14 15:20:09 +02:00
  • c4ce78a3c6 bugfix timeout for msg20 generation (int32 overflow) Ivan Skytte Jørgensen 2018-05-14 15:19:23 +02:00
  • 4b36241dd4 Don't do retries for error scenario in SiteGetter Ai Lin Chia 2018-05-11 19:20:28 +02:00
  • 415e4c315d Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-11 16:23:38 +02:00
  • 7807295773 Expanded punycode-safe ccTLD list Ivan Skytte Jørgensen 2018-05-11 16:06:44 +02:00
  • 44e6e99946 Handle punycode/IDN for middle-domain indexing Ivan Skytte Jørgensen 2018-05-11 15:46:32 +02:00
  • 606567c6b2 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-11 15:21:56 +02:00
  • dc276bf0e8 Index punycode-encoded labels for inurl: Ivan Skytte Jørgensen 2018-05-11 14:56:47 +02:00
  • 3dbf510e41 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-11 12:00:47 +02:00
  • 39786cba91 Overload delete with size as well (address sanitizer dies at protobuf if this is not there) Ai Lin Chia 2018-05-09 15:49:20 +02:00
  • 5463567d68 Add FX-Index-Code & HTTP-Status to FXARC Ai Lin Chia 2018-05-08 11:52:01 +02:00
  • 0318fed04d Only dump redirected titlerecs and add field to store redirected url Ai Lin Chia 2018-05-08 11:23:27 +02:00
  • 0702af3408 Don't do strlen on null string Ai Lin Chia 2018-05-07 11:18:00 +02:00
  • 7f9ee3eae8 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-03 14:12:34 +02:00
  • 635c8a2a71 Better comments about the lang' field value in posdb Ivan Skytte Jørgensen 2018-05-03 14:05:13 +02:00
  • 1b2181bace Changed posdbtable to use enum lang_t Ivan Skytte Jørgensen 2018-05-03 12:00:25 +02:00
  • ee27183e70 Made s_variantLikeSubDomainTable more complete by adding most of the two-letter iso 639-1 language codes Ivan Skytte Jørgensen 2018-05-01 16:25:47 +02:00
  • 7087c124a3 Parmaeter for 'dedup URLs by default' (ddud) so we can control it per collection Ivan Skytte Jørgensen 2018-05-01 16:13:42 +02:00
  • 8ced12dd83 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-01 14:13:44 +02:00
  • f83f88d9e4 Use language-weight array instead of a single language in msg39 Ivan Skytte Jørgensen 2018-04-30 15:31:44 +02:00
  • 5b7fabe5be Removed Msg39Request::m_stripe Ivan Skytte Jørgensen 2018-04-26 16:16:10 +02:00
  • fc2317bc89 Fix dump name and memory leak Ai Lin Chia 2018-04-26 15:00:09 +02:00
  • 4616b66046 Add initial implementation of dumping titledb to customized archive file Ai Lin Chia 2018-04-26 14:40:26 +02:00
  • 9595e595e5 Remove unused InjectionRequest variable Ai Lin Chia 2018-04-25 10:56:25 +02:00
  • 1269bfdbea Add print version to tools Ai Lin Chia 2018-04-24 15:39:40 +02:00
  • abb30bc51e Initialize domains for tools Ai Lin Chia 2018-04-24 14:42:58 +02:00
  • 4879092936 Fix DomainsTest & make sure we initializeDomains for unit test Ai Lin Chia 2018-04-24 14:35:52 +02:00
  • 8a40f04f3e Fixed print_urlinfo.cpp Ivan Skytte Jørgensen 2018-04-24 14:17:20 +02:00
  • d13c73fe48 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-24 13:41:47 +02:00
  • 937967970b Fix comments. DocProcess must not use niceness of 0 (this is used for queries) Ai Lin Chia 2018-04-24 12:37:50 +02:00
  • 8bc88fd846 Removed check for apporved checksum in Speller.cpp and Wiktionary.cpp Ivan Skytte Jørgensen 2018-04-24 12:28:54 +02:00
  • dd1796a3f1 Revert back to original TLD list Ivan Skytte Jørgensen 2018-04-24 12:24:06 +02:00
  • ac7fd6293d fix .au and .us 2nd-level TLDs Ivan Skytte Jørgensen 2018-04-24 12:20:00 +02:00
  • 99048df12d document where to get Mozilla's 2nd-level TLD list from Ivan Skytte Jørgensen 2018-04-23 15:35:40 +02:00
  • f955370464 bugfix tlds.txt Ivan Skytte Jørgensen 2018-04-23 15:33:51 +02:00
  • a7be88039f Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2018-04-23 15:16:39 +02:00
  • ad4f73c85e tlds.txt should be lowercase Ivan Skytte Jørgensen 2018-04-23 15:16:30 +02:00
  • 869546ad13 Use const& instead Ai Lin Chia 2018-04-23 15:10:02 +02:00
  • 9cf1f7e5bb const for ::getDomain() Ivan Skytte Jørgensen 2018-04-23 13:23:26 +02:00
  • 2af5abfadf More unittests for Domains.cpp Ivan Skytte Jørgensen 2018-04-23 13:21:10 +02:00
  • c93ca32eac Unittest for Domains.cpp; added initialize/finalize for Domains Test Ivan Skytte Jørgensen 2018-04-23 13:11:46 +02:00
  • 518f60803d bugfix cornercase of processing of tlds.txt Ivan Skytte Jørgensen 2018-04-22 17:33:29 +02:00
  • 9674e723e4 Change list size Ai Lin Chia 2018-04-20 16:23:30 +02:00
  • 3872fe9ec9 Add code to verify shardnum Ai Lin Chia 2018-04-20 16:13:40 +02:00
  • 400e417b95 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-20 14:28:47 +02:00
  • e8f6314d3e Optionally load tlds from tlds.txt instead of using built-in defaults Ivan Skytte Jørgensen 2018-04-20 14:03:17 +02:00
  • d2e1e6e5eb Moved hardcoded TLD list from Doamins.cpp to external text files Ivan Skytte Jørgensen 2018-04-20 13:43:49 +02:00
  • a437ba4f96 Check for out of range docId based on url Ai Lin Chia 2018-04-20 12:08:38 +02:00
  • 20d00433e7 Remove check for url with wrongly encoded sequence Ai Lin Chia 2018-04-20 11:45:59 +02:00
  • 41135ba0ea Only use blang with max weight as content language hint to CLD2 Ai Lin Chia 2018-04-19 11:37:56 +02:00
  • 3797e3b3fb Don't use fx_fetld as hint when it's not a country tld Ai Lin Chia 2018-04-19 11:15:27 +02:00
  • 3fa85873cd Log ip in string instead of int Ai Lin Chia 2018-04-18 16:06:36 +02:00
  • ac4be4814a Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-17 12:02:11 +02:00
  • 5566483a8a Try to find url which may be encoded wrongly Ai Lin Chia 2018-04-16 17:01:00 +02:00
  • 650c3c51b1 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-16 15:47:27 +02:00
  • 2307a59dc2 More checks for detecting wrong ascii encoding Ai Lin Chia 2018-04-16 15:14:38 +02:00
  • 8750337040 Fix dump_wrong_encoding with ASCII charset Ai Lin Chia 2018-04-16 13:58:00 +02:00
  • b670b56a23 Check for ascii as well (to detect probable wrong encoding) Ai Lin Chia 2018-04-16 12:51:06 +02:00
  • 8cfe1aa813 Don't check for more wrong encodings once we have establish the fact Ai Lin Chia 2018-04-16 11:39:20 +02:00
  • ce92434b62 Add more encodings to dump_wrong_encoding Ai Lin Chia 2018-04-16 11:34:04 +02:00
  • 4c9ce2a1d2 Add dump_charset to tools Ai Lin Chia 2018-04-13 15:35:13 +02:00
  • 8324b5f337 We should be able to skip contenthash check even when we're not injecting content Ai Lin Chia 2018-04-13 13:27:12 +02:00
  • 8b3036b4c9 Add a tmperr file so we can keep track of tmp errors that happen during docprocess Ai Lin Chia 2018-04-13 13:19:17 +02:00
  • 70bce30c3b skipContentHashCheck for DocReindex Ai Lin Chia 2018-04-13 13:18:08 +02:00
  • c888ff1a21 Split up parameter for DocProcess delay as well Ai Lin Chia 2018-04-13 11:54:31 +02:00
  • 108626e9fe Restart processing of url/docid when we get udp timeout Ai Lin Chia 2018-04-13 11:39:35 +02:00
  • 53f5ebb85c Fix compilation error (missed file in previous commit) Ai Lin Chia 2018-04-12 22:50:58 +02:00
  • ee9baa56ef Split up configuration for max pending (DocDelete, DocReindex, DocRebuild) Ai Lin Chia 2018-04-12 22:48:31 +02:00
  • 0ff7b161ed Make sure we strip session id even if query parameter is separated by '?' Ai Lin Chia 2018-04-12 15:24:55 +02:00
  • 3a2d56849f Merge branch 'tokenizer' of github.com:privacore/open-source-search-engine into tokenizer Ivan Skytte Jørgensen 2018-04-12 14:41:44 +02:00
  • 48d674fb46 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-04-12 14:41:36 +02:00