mirror of
https://github.com/privacore/open-source-search-engine.git
synced 2025-12-13 21:54:35 -05:00
If a bigram contained one of the 109 eglish stopwords then the hash was XORed with 0x768867 in an attempt to distinquish bigrams that were split compound words and bigrams that couldn't form a compound word. eg: Not compound word: "the rapist" versus "therapist" Possibly a compound word: "light footed" vs. "lightfooted" However, the code was english-specific and could hurt resutls for other languages. And you need a POS-tagger to do it correctly.
14 KiB
14 KiB