privacore-open-source-searc.../unicode
2018-07-24 17:21:54 +02:00
..
.gitignore gitignore unicode/*.dat 2018-04-06 13:19:30 +02:00
generate_tables.py tokenizer: handle word hyphenation 2018-03-02 14:04:11 +01:00
Makefile Use same optimization in unicode+tokenizer as in the main executable 2018-03-26 16:58:25 +02:00
README.txt Spelling fix unicode readme 2018-07-24 17:21:54 +02:00
UCCMDecompose_unittest.cpp work-in-progress: new tokenizer 2018-03-01 16:38:19 +01:00
UCCMDecompose.cpp minor stuff 2018-03-01 18:17:30 +01:00
UCCMDecompose.h combining-mark decomposition+composition 2018-03-01 16:18:51 +01:00
UCDecompose_unittest.cpp Aded Unicode::recursive_canonical_decompose() 2018-02-06 14:54:01 +01:00
UCDecompose.cpp combining-mark decomposition+composition 2018-03-01 16:18:51 +01:00
UCDecompose.h Aded Unicode::recursive_canonical_decompose() 2018-02-06 14:54:01 +01:00
UCEnums.h Make gcc 7.x happy 2018-02-09 16:01:47 +01:00
UCMap_unittest.cpp Make unicode_is_ignorable.dat 2018-02-26 17:44:30 +01:00
UCMap.cpp Added unicode decomposition map for combining-mark removal 2018-02-26 15:46:22 +01:00
UCMap.h Added unicode decomposition map for combining-mark removal 2018-02-26 15:46:22 +01:00
UCMaps.cpp Merge branch 'master' into tokenizer 2018-05-22 15:53:18 +02:00
UCMaps.h Make unicode_is_ignorable.dat 2018-02-26 17:44:30 +01:00
UnicodeData.py Preliminary commit on unicode update 2018-02-02 18:01:59 +01:00

File maps and how they are used.

The data comes from unicode.com and are in text format.
To speed up loading the files are converted into binary format. The format depends on the data and how it is used.

==== Lookups
We are only interested in following lookups:

=== codepoint -> script (Scripts.txt)
All codepoints map to a script. The script can be latin/greek/cyrillic/... but also "common" and "inherited".

=== codepoint -> category (UnicodeData.txt)
A codepoint belongs to a general category such as uppercase_letter, number, punctuation, separator, ...
UnicodeData.txt uses 30+ categories.

=== ? -> lowercase
For hashing consistently we need to map from uppercase/titlecase to lowercase.

==== Optimal map types
Most of the codepoints we process are in the 0..128 range, so optimizing for that can be worth it. Some properties are in long ranges, so that may be worth optimizing for too. Some properties are sparse i.e. applies only to a small subset of the available codepoints. So the optimal map type isn't clear-cut.

=== codepoint->script
All codepoints belong to a script (almost, there are some without in the private-use blocks that doesn't have them).
Optimal map type: full map, 1 byte per codepoint. Lookups will mostly hit the first 128 bytes. It May be worth adding hardcoded checks for the latin letters a-z and numbers, but tthose conditional branches may end up slower than just fetching from memory.

=== codepoint->properties
Codepoints can have a multiple of properties. UnicodeData.txt uses 34 distinct properties.
We ignore 2 of those properties and map the rest into a 32-bit bitmask. One 32-bit bitmask per codepoint.
Optimal map type: full map, 32 bit per entry.

Additional, testing if a codepoint can be in a word/identifier (a-z, 0-9, æ, greek letters, ...) is the most used lookup so we make a specialized map for that: full map, 1-byte boolean per entry. It may be worth squeezing down to 1 bit per codepoint.

=== upper/lowercase
We only need a mapping to lowercase. Lowercase only applies to a tiny fraction (~1300) of the codepoints, We currently only need the mapping to lowercase so we only dump that.
Optimal map type: sparse map, format:
	<codepoint>
	<lowercase codepoint count>
	[<codepoint>...]

=== Decomposition
A subset of the codepoints can be decomposed into (generally) <normal grapheme> + <diacritic...>
The subset is relatively small (~5700)
Optimal map type: sparse map, format:
	<codepoint>
	<decomposition codepoint count>
	[<decomposition codepoint>...]