bugfix/workaround for bigram hashes

If a bigram contained one of the 109 eglish stopwords then the hash was XORed with 0x768867 in an attempt to distinquish bigrams that were split compound words and bigrams that couldn't form a compound word. eg:
  Not compound word: "the rapist" versus "therapist"
  Possibly a compound word: "light footed" vs. "lightfooted"
However, the code was english-specific and could hurt resutls for other languages. And you need a POS-tagger to do it correctly.
This commit is contained in:
Ivan Skytte Jørgensen 2018-08-02 13:14:44 +02:00
parent c1de250ddc
commit 8853a156ac
2 changed files with 10 additions and 2 deletions

@ -246,7 +246,15 @@ void Phrases::setPhrase(unsigned i, const TokenizerResult &tr, const Bits &bits)
// . "st. and" !-> stand
// . "the rapist" !-> therapist
else {
m_phraseIds2[i] = h2 ^ 0x768867;
m_phraseIds2[i] = h2;
//The XORing with 0x768867 was lost in the code that uses the new tokenizer instead of the old "Words" class.
//So the above detection of stopwords doesn't have any effect currently. I'm not even sure if it has any
//benefits for languages that compound words more often than English (german/danish/norwegian/greek/…)
//We keep the code around if we ever want to do something similar, but possibly for more than just English,
//and hopefully with detection of part-of-speech.
//Old code:
//m_phraseIds2[i] = h2 ^ 0x768867;
//note: Synonyms.cpp also does this
logTrace( g_conf.m_logTracePhrases, "i=%3" PRId32 ", wids[i]=%20" PRIu64". END. either no hyphen or a stopword. m_phraseIds2[i]=%" PRIu64 "", i, token1.token_hash, m_phraseIds2[i]);
}
}

@ -490,7 +490,7 @@ bool Synonyms::addAmpPhrase(const TokenizerResult *tr, unsigned wordNum, class H
// logic in Phrases.cpp will xor it with 0x768867
// because it contains a stop word. this prevents "st.
// and" from matching "stand".
h ^= 0x768867;
h ^= 0x768867; //keep in sync with Phrases
// do not add dups
if ( dt->isInTable ( &h ) ) return true;