Commit Graph

  • 312b8c197e Merge branch 'master' into dev-sitetemp Ai Lin Chia 2018-06-19 11:54:16 +02:00
  • 27b3b6c199 Collect warnings at end Ivan Skytte Jørgensen 2018-06-18 00:49:49 +02:00
  • 4cadd5549e More sto changes for tree-of-files Ivan Skytte Jørgensen 2018-06-15 17:48:12 +02:00
  • 88a4f9067f Changed sto_convert.py to be able to handle whole tree of files Ivan Skytte Jørgensen 2018-06-15 17:02:54 +02:00
  • 7449f84afb Initialize & finalize SiteMedianPageTemperature Ai Lin Chia 2018-06-15 15:24:11 +02:00
  • 927b5eb8d8 Use the right trace log conf Ai Lin Chia 2018-06-15 14:58:26 +02:00
  • cce058770d Replace SiteDefaultPageTemperatureRemoteRegistry with new SiteMedianPageTemperature Ai Lin Chia 2018-06-15 14:43:30 +02:00
  • 889e54066a Don't use local site median page temperature registry while spidering anymore Ai Lin Chia 2018-06-15 12:24:54 +02:00
  • 3f8736e12c New site median page temperature client & settings Ai Lin Chia 2018-06-15 11:38:32 +02:00
  • 7512be9752 Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-06-14 17:29:59 +02:00
  • 743755a542 Only call site num inlinks server once Ai Lin Chia 2018-06-14 14:14:47 +02:00
  • 26876ed149 word variations: danish: adjective grammatical-gender simplification Ivan Skytte Jørgensen 2018-06-14 13:23:24 +02:00
  • f9daa253b0 indentation fix Ivan Skytte Jørgensen 2018-06-14 12:51:38 +02:00
  • 8f0dfe203e Extended cgi parameter name length limit from ~54 to 128 Ivan Skytte Jørgensen 2018-06-14 12:50:44 +02:00
  • d8a9f394df Merge branch 'master' into dev-proxy Ai Lin Chia 2018-06-13 17:38:33 +02:00
  • d8e03ccfb2 tokenizer: combining mark removal for Italian Ivan Skytte Jørgensen 2018-06-13 16:56:59 +02:00
  • 6cc5bb157f log erro message when lexicon for lemmatizer doesn't load Ivan Skytte Jørgensen 2018-06-11 12:18:26 +02:00
  • af1dc0700c Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-08 15:54:16 +02:00
  • c4be13a0bf Handle greater-than in attribute values more correctly Ivan Skytte Jørgensen 2018-06-08 15:46:10 +02:00
  • 303f327a9e Don't rely on nul-terminateation in PageResults.cpp:sendReply() Ivan Skytte Jørgensen 2018-06-08 13:56:22 +02:00
  • 68abfbaf04 Sometimes size_linkText includes the terminating NUL, causing NULs to appear in search results, truncating it Ivan Skytte Jørgensen 2018-06-08 13:49:11 +02:00
  • 2375762a3e bugfix msg3a: when shards are down it didn't keep track of request/reply correctly Ivan Skytte Jørgensen 2018-06-07 16:40:57 +02:00
  • ee80ea98d0 Get site num inlinks with g_siteNumInlinks.getSiteNumInlinks if sitenuminlinks server is up Ai Lin Chia 2018-06-06 21:58:26 +02:00
  • 4a1f05f9eb Disable m_max_outstanding when it's 0 (as described in configuration) Ai Lin Chia 2018-06-06 16:05:45 +02:00
  • 1c6ccacd7a Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-06-06 13:30:32 +02:00
  • aa736a8314 Merge branch 'master' into dev-proxy Ai Lin Chia 2018-06-06 12:14:33 +02:00
  • 2f88aa7d9b It looks like whitespace is part of utf8 punctuation, but we handle whitespace differently. Ai Lin Chia 2018-06-06 12:12:47 +02:00
  • 3f83c78291 Include tokenizer/*.o in libgb.a Ivan Skytte Jørgensen 2018-06-04 15:02:59 +02:00
  • ffa15c0ab4 Optimize combine_possessive_s_tokens() Ivan Skytte Jørgensen 2018-06-04 13:42:42 +02:00
  • c84d652216 Merge branch 'master' into dev-proxy Ai Lin Chia 2018-06-04 13:22:30 +02:00
  • 6af730d745 Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2018-06-04 13:18:27 +02:00
  • 0862cdc818 More unittests on collapse_slash_abbreviations() Ivan Skytte Jørgensen 2018-06-04 13:15:15 +02:00
  • cefc616ec4 bugfix collapse_slash_abbreviations(): handling last 2 tokens Ivan Skytte Jørgensen 2018-06-04 13:15:01 +02:00
  • 68caf85944 Fix collapse_slash_abbreviations(): too slow and inefficient Ivan Skytte Jørgensen 2018-06-04 13:07:46 +02:00
  • beac6599bd Answer is not required when it's just for printing in the UI Ai Lin Chia 2018-06-04 13:07:28 +02:00
  • 81cd4ac0be Don't retry again if we've already tried it once Ai Lin Chia 2018-06-03 21:34:49 +02:00
  • 822ef732c9 Log both original url & redirected url Ai Lin Chia 2018-06-03 21:28:04 +02:00
  • c4a4fe47b2 Fix segfault when ts is null Ai Lin Chia 2018-06-01 20:26:03 +02:00
  • cfd43973c4 Merge branch 'master' into dev-proxy Ai Lin Chia 2018-06-01 20:25:45 +02:00
  • d2e31e96ba Merge branch 'master' into lemma Ivan Skytte Jørgensen 2018-06-01 14:17:34 +02:00
  • fb95cf1461 bugfix handlign swedish phonenumbers when they are at the end of the document/string Ivan Skytte Jørgensen 2018-06-01 13:38:37 +02:00
  • fe9239d574 bugfix coredump when handling ampersands Ivan Skytte Jørgensen 2018-06-01 13:30:39 +02:00
  • 8f45a6f7b9 Do phase-2 tokens on inlink text and geo-placename too Ivan Skytte Jørgensen 2018-06-01 12:19:37 +02:00
  • 2de6cf352a fix buffer overflow when handling posessive-s with inter-word non-words of byte length 3..4 Ivan Skytte Jørgensen 2018-05-31 20:02:19 +02:00
  • 1c59db3d7f Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-05-31 18:07:47 +02:00
  • d204364290 Use phase-2 tokens for indexing keywords tag Ivan Skytte Jørgensen 2018-05-31 17:53:35 +02:00
  • 80bc62d4d2 Use phase-2 tokens for indexing description and summary Ivan Skytte Jørgensen 2018-05-31 17:52:01 +02:00
  • 31634c482e Generate phase-2 tokens for indexing og:description and other meta tags Ivan Skytte Jørgensen 2018-05-31 17:40:38 +02:00
  • 9f7af52927 bugfix some corner cases of bigrams involiving phase-2 tokens Ivan Skytte Jørgensen 2018-05-31 17:39:42 +02:00
  • 850b7962f8 Log url that is being 'detected' to be retried to proxy Ai Lin Chia 2018-05-31 16:55:09 +02:00
  • 322bae20d1 XmlDoc::hashTitle() now also generates tokenizer phase 2 tokens for indexing Ivan Skytte Jørgensen 2018-05-31 16:49:46 +02:00
  • 07e41027da More fixes for redirects proxy Ai Lin Chia 2018-05-31 15:57:46 +02:00
  • c206e0c191 When using getLocation base url needs to be set Ai Lin Chia 2018-05-31 15:29:00 +02:00
  • 8a68b5c619 Fix of-fby-one in hashTitle() Ivan Skytte Jørgensen 2018-05-31 14:25:56 +02:00
  • b2edf251e4 Update example file Ai Lin Chia 2018-05-31 13:40:24 +02:00
  • 1e6293fd7a bugfix XmlDoc::hashTitle() (off-by-one error when there is no ending </title> tag) Ivan Skytte Jørgensen 2018-05-31 13:00:40 +02:00
  • 15c7713497 Rename urlblacklist.txt.example to urlmatchlist.txt.example Ai Lin Chia 2018-05-31 12:59:55 +02:00
  • 253a7cf736 Add msg when returning true Ai Lin Chia 2018-05-31 12:46:44 +02:00
  • dcb7aa46ee Add g_contentRetryProxyList & rename BlockList to MatchList Ai Lin Chia 2018-05-31 12:44:20 +02:00
  • a990000fe4 Retry when redirected url match urlretryproxylist Ai Lin Chia 2018-05-31 11:04:42 +02:00
  • c6c4c2ba2a Add urlproxylist.txt to decide if a url should be spidered using a proxy or not Ai Lin Chia 2018-05-30 12:21:24 +02:00
  • c1f5a18546 Fix coredump in getCountTable()->hashString_ct()->plain_tokenizer_phase_1() Ivan Skytte Jørgensen 2018-05-29 17:57:21 +02:00
  • d41a5c14ca bugfix hashTitle() -> possiblyDecodeHtmlEntitiesAgain() Ivan Skytte Jørgensen 2018-05-29 16:50:22 +02:00
  • 1f7bdfff37 Bits: Handle question marks/exclamation marks and non-latin punctuation for detectign end of sentences Ivan Skytte Jørgensen 2018-05-29 15:31:06 +02:00
  • 901677e634 Added some unittests for Bits::setForSummary() Ivan Skytte Jørgensen 2018-05-29 14:59:13 +02:00
  • de1625723e Renamed D_IN_PARENS -> D_IN_PARENTHESES Ivan Skytte Jørgensen 2018-05-29 14:52:41 +02:00
  • 21b76d9581 Renamed D_STARTS_FRAG->D_STARTS_FRAGMENT etc Ivan Skytte Jørgensen 2018-05-29 14:22:49 +02:00
  • 1bddf21b67 Removed dead code in Bits::setForSummary() Ivan Skytte Jørgensen 2018-05-29 14:18:40 +02:00
  • 97c7ffd23f Merge branch 'master' of github.com:privacore/open-source-search-engine Ivan Skytte Jørgensen 2018-05-28 17:02:37 +02:00
  • f1916431e7 Forgotten const on Url::getMidDomain() Ivan Skytte Jørgensen 2018-05-28 17:02:09 +02:00
  • 71629398ef Handle HTTP 1.0 & HTTP 1.1 response Ai Lin Chia 2018-05-28 15:23:01 +02:00
  • 98df7f0ad9 Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-05-28 15:21:56 +02:00
  • 1c464c94f5 Fix cgi name for sni parameters (too long) Ai Lin Chia 2018-05-28 12:29:24 +02:00
  • c506fd6155 Initial implementation of SiteNumInlinks client Ai Lin Chia 2018-05-25 16:45:38 +02:00
  • 7e9571e208 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-25 15:18:38 +02:00
  • 403f3463db hackish implementation for lexicon-based lemmatization Ivan Skytte Jørgensen 2018-05-25 14:51:56 +02:00
  • dc6faca09a Remove unused input parameter. Combine 2 if conditions to use if/else instead Ai Lin Chia 2018-05-25 14:42:00 +02:00
  • 8bcee9ecff STO: Added LexicalEntry::find_base_wordform() method Ivan Skytte Jørgensen 2018-05-25 12:33:38 +02:00
  • fe1f08f656 Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-05-25 12:24:39 +02:00
  • f2944d2152 bugfix Msg3a: hang when submitting additional round of Msg39 (eg. if clustering eliminates too many results) Ivan Skytte Jørgensen 2018-05-25 12:19:35 +02:00
  • d1e1a25112 Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-05-25 11:34:47 +02:00
  • 0a7dbf0839 Merge branch 'master' into tokenizer Ivan Skytte Jørgensen 2018-05-24 16:10:06 +02:00
  • 28b7595295 Allow user to speficy language weights Ivan Skytte Jørgensen 2018-05-24 15:39:59 +02:00
  • d8d8bc1b42 Merge branch 'master' into dev-siteinfo Ai Lin Chia 2018-05-24 14:42:16 +02:00
  • a6d04c5b59 html decoding i meta tags and other places: replace <p> and <br> with spaces instead of empty strings (while indexing) Ivan Skytte Jørgensen 2018-05-24 14:27:35 +02:00
  • d2eeb2f56b html decoding i meta tags and other places: replace <p> and <br> with spaces instead of empty strings Ivan Skytte Jørgensen 2018-05-24 14:23:32 +02:00
  • 67e597cead Fix buffer overflow that happens when we're using SafeBuf::safeReplace2 to delete string Ai Lin Chia 2018-05-24 00:50:49 +02:00
  • 2cd16445d2 Remove write only m_rtq Ai Lin Chia 2018-05-23 14:59:44 +02:00
  • 64679c53a3 Remove commented out code. Remove unused variables Ai Lin Chia 2018-05-23 14:41:59 +02:00
  • ae139b0d71 Remove unused variables Ai Lin Chia 2018-05-23 14:39:41 +02:00
  • 9c22431026 We're no longer using m_pbuf in set2 (will never be not null) Ai Lin Chia 2018-05-23 14:36:10 +02:00
  • d04dc9aecf Code style changes. Simplify logic. Group same criteria together Ai Lin Chia 2018-05-23 14:32:33 +02:00
  • af0be95f54 Remove redundent code Ai Lin Chia 2018-05-23 14:29:46 +02:00
  • bb90b71e3d Remove commented out code Ai Lin Chia 2018-05-23 14:29:34 +02:00
  • 296432c3e6 Remove always null pbuf from XmlDoc::set2 input parameter Ai Lin Chia 2018-05-23 14:28:51 +02:00
  • 330798a16a Remove write only m_spamWeight & simplify some log lines Ai Lin Chia 2018-05-23 14:20:40 +02:00
  • 08683280f2 Remove unused canBeCanceled parameter Ai Lin Chia 2018-05-23 14:19:10 +02:00
  • 7d05dd7d28 Remove unused /print api & cleanup PageParser Ai Lin Chia 2018-05-23 14:15:36 +02:00
  • ba129ef36a Code style changes/Remove commented out code Ai Lin Chia 2018-05-23 11:25:12 +02:00
  • 3b51b3a079 Remove unused function/variable Ai Lin Chia 2018-05-23 11:23:26 +02:00