[Home]
[English] [Japanese]
In order to build this software from the source code, you will need libLBFGS, as is also the case for the original CRFSuite (You don't need libLBFGS to use the Windows binary as they are statically linked. Skip to the usage). Here are typical steps to take on a Unix-compatible platforms.
First, download the source code package from the libLBFGS page and unpack it to a temporary directory. Execute 'configure' with a '--prefix' option to designate the install path. 'make' & 'make install'.
[/path/where/you/extracted/libLBFGS/] ./configure --prefix=/install/path/of/libLBFGS/
[/path/where/you/extracted/libLBFGS/] make
[/path/where/you/extracted/libLBFGS/] make install
Next, download the package of CRFSuite for variable-order Markov models (crfsuite-variableorder) from the github repository and unpack it to a temporary directory. Execute 'configure' with a '--with-liblbfgs' to designate the libLBFGS install path above and a '--prefix' option to designate the install path for crfsuite-variableorder. 'make' & 'make install'.
[/path/where/you/extracted/crfsuite-variableorder/] ./configure --with-liblbfgs=/install/path/of/libLBFGS/ --prefix=/install/path/of/crfsuite-variableorder/
[/path/where/you/extracted/crfsuite-variableorder/] make
[/path/where/you/extracted/crfsuite-variableorder/] make install
The basic usage is similar to the original CRFSuite. The main difference is that crfsuite-variableorder does not auto-generate features and instead reads them from a file specified by the -f or --features option. Here is the format used to describe a feature of order (n-1).
ATTR[tab]LABEL0[tab]LABEL1[tab]...LABELn[newline]The order is as follows: LABEL0 for the current position, LABEL1 for the previous position, and so on. For example, a feature activated when the label sequence up to the current position is NN-VBZ-IN and the word at the current position is "like" should be described as follows:
W0_like[tab]IN[tab]VBZ[tab]NN[newline]where "W0_" is an arbitrary prefix.
The format of training/evaluation data file is the same as the original CRFSuite. Please refer to the CRFSuite manual.
It is not realistic to manually write the features, training, and evaluation files, so a script is usually needed to generate them. The source and binary packages contain an "example" directory with a Python script named "conv.py". The script was originally written for English POS tagging and should be customized for other kinds of tasks.
If you have Penn Treebank 3, you can reproduce the experiments described in Hiroshi Manabe's thesis:
Open mrg_to_pos.py under "example" and modify the part that contains the path to the "mrg" directory. Run the script to convert the .mrg files into text files, which should look like:
NNP Pierre NNP Vinken , , CD 61 NNS years JJ old , , ...
train.txt and test.txt respectively contain data in 0-18 and 22-24. Execute "conv.py" to generate the same dataset as in "maximum 2nd order" in the thesis (the number of features will not be exactly the same because of changes in end-of-sequence processing). Features, training data, and evaluation data should be written respectively as "features.txt", "train_data.txt", and "test_data.txt". features.txt should look as follows:
W-2-1_of_crop . W-2-1_of_crop NN W0+1_still-raging_bidding VBG W-2-10_did_a_story NN W-2-10_David_Boren_-LRB- ( W-3-2-1__``_Five VBDtrain_data.txt/test_data.txt should look like as follows:
NNP LABEL W0_Pierre W-1_ W+1_Vinken W-10__Pierre W0+1_Pierre_Vinken W-2-1__ W-2-10___Pierre W-3-2-1___ suf1_e pre1_P suf2_re pre2_Pi suf3_rre pre3_Pie suf4_erre pre4_Pier suf5_ierre pre5_Pierr suf6_Pierre pre6_Pierre CONTAIN_UPPER NNP LABEL W0_Vinken W-1_Pierre W+1_, W-10_Pierre_Vinken W0+1_Vinken_, W-2-1__Pierre W-2-10__Pierre_Vinken W-3-2-1___Pierre suf1_n pre1_V suf2_en pre2_Vi suf3_ken pre3_Vin suf4_nken pre4_Vink suf5_inken pre5_Vinke suf6_Vinken pre6_Vinken CONTAIN_UPPER , LABEL W0_, W-1_Vinken W+1_61 W-10_Vinken_, W0+1_,_61 W-2-1_Pierre_Vinken W-2-10_Pierre_Vinken_, W-3-2-1__Pierre_Vinken suf1_, pre1_,
To train the model, input the following line:
$ crfsuite learn -m wsj.model -t test_data.txt -f features.txt train_data.txt
The program consumes 20 to 30 gigabytes. The implementation keeps the data in memory rather than swapping it out during sequential access, so substantial memory is required. The model will be written to wsj.model.
To tag sentences using the model, input the following line:
$ crfsuite tag -m wsj.model test.txt
To evaluate the tagging performance, input:
$ crfsuite tag -m wsj.model -qt test.txt
CRFSuite for variable-order Markov model shares most of the options with the original CRFSuite. Please refer to the CRFSuite manual.
Here is the list of the differences.