privacore-open-source-searc.../sto
2018-07-06 14:06:28 +02:00
..
.gitignore Refined .gitignore 2018-02-12 13:31:03 +01:00
convert_sto.sh More sto changes for tree-of-files 2018-06-15 17:48:12 +02:00
dump_sto.cpp word variations: look for other entries with same morphological unit id (kafeteater/cafétater) 2017-12-06 16:45:27 +01:00
load_sto.cpp Improved performance of STO loading (approx 30% faster now) 2017-12-15 14:28:51 +01:00
Makefile Improved performance of STO loading (approx 30% faster now) 2017-12-15 14:28:51 +01:00
README Added STO subdir with tools and classes 2017-11-21 13:08:12 +01:00
sto_convert.py Collect warnings at end 2018-06-18 00:49:49 +02:00
sto_structure.txt word variations: look for other entries with same morphological unit id (kafeteater/cafétater) 2017-12-06 16:45:27 +01:00
sto.cpp lemma: generate lemmas for proper nouns too (normally genetive-case -> unmarked-case) 2018-07-06 14:06:28 +02:00
sto.h STO: Added LexicalEntry::find_base_wordform() method 2018-05-25 12:33:38 +02:00

This directory contains GB-specific tools and libraries to process the STO database.

The STO database is a lexicon made by Center for Sprogteknologi (CST), a subsection
of University of Copenhagen. Web site: http://cst.ku.dk/

The GB distribution does not contain the STO database. CST did not make any files in
this directory. You have to get hold of the STO files yourself.

The source lexicon is a bit too big (320MB) to use as-is in GB. It is processed into
a more compact format without losing details we care about. This resulting file is
binary and can be accessed using the classes in sto.h


To generate the binary lexicon:

1: Get hold of the STO XML files.
	The download link is difficult to find. You may have to email CST.

2: Run ./convert_sto.sh <directory where you unpacked the STO files> <target file>
	The tool will run through the XML files (STO_LMF_morphology_noun_a_jan2013.xml, STO_LMF_morphology_pronoun_jan2013.xml, ...). It will complain a bit about entries that seem incomplete (no recognized "feat" attribute values).

3: You now have a binary lexicon, approximately 14MB. You can use ./dump_sto to check the contents.