Files
privacore-open-source-searc…/Synonyms.cpp
Ivan Skytte Jørgensen 8853a156ac bugfix/workaround for bigram hashes
If a bigram contained one of the 109 eglish stopwords then the hash was XORed with 0x768867 in an attempt to distinquish bigrams that were split compound words and bigrams that couldn't form a compound word. eg:
  Not compound word: "the rapist" versus "therapist"
  Possibly a compound word: "light footed" vs. "lightfooted"
However, the code was english-specific and could hurt resutls for other languages. And you need a POS-tagger to do it correctly.
2018-08-02 13:14:44 +02:00

14 KiB