In order to build this software from the source code, you will need libLBFGS, as is also the case for the original CRFSuite (You don't need libLBFGS to use the Windows binary as they are statically linked. Skip to the usage). Here are typical steps to take on a Unix-compatible platforms.
First, download the source code package from the libLBFGS page and unpack it to a temporary directory. Execute 'configure' with a '--prefix' option to designate the install path. 'make' & 'make install'.
[/path/where/you/extracted/libLBFGS/] ./configure --prefix=/install/path/of/libLBFGS/ [/path/where/you/extracted/libLBFGS/] make [/path/where/you/extracted/libLBFGS/] make install
Next, download the package of CRFSuite for variable-order Markov models (crfsuite-variableorder) from the github repository and unpack it to a temporary directory. Execute 'configure' with a '--with-liblbfgs' to designate the libLBFGS install path above and a '--prefix' option to designate the install path for crfsuite-variableorder. 'make' & 'make install'.
[/path/where/you/extracted/crfsuite-variableorder/] ./configure --with-liblbfgs=/install/path/of/libLBFGS/ --prefix=/install/path/of/crfsuite-variableorder/ [/path/where/you/extracted/crfsuite-variableorder/] make [/path/where/you/extracted/crfsuite-variableorder/] make install
The basic usage is common as the original CRFSuite. The difference lies that crfsuite-variableorder does not auto-generate features and instead read them from a file, which should be specified by -f or --features option. Here is the format to describe a feature of (n-1)th order.
ATTR[tab]LABEL0[tab]LABEL1[tab]...LABELn[newline]The order is : LABEL0 for the current position, LABEL1 for the previous position etc. For example, a feature activated when the label sequence to the current position is NN-VBZ-IN and the word at the current position is "like" should be described as follows:
W0_like[tab]IN[tab]VBZ[tab]NN[newline]where "W0_" is an arbitrary prefix.
The format of training/evaluation data file is the same as the original CRFSuite. Please refer to the CRFSuite manual.
It is not realistic to manually write the features/training/evaluation files, so you will need a script to generate them. The source and binary packages contain an "example" directory, under which you can find a python script "conv.py". I wrote this script for the English POS-tagging task. You should customize the script in order to apply to other kind of tasks.
If you have Penn Treebank 3, you can reproduce the experiences in my thesis:
Open mrg_to_pos.py under "example" and modify the part that contains the path to the "mrg" directory. Run the script to convert the .mrg files into text files, which should look like:
NNP Pierre NNP Vinken , , CD 61 NNS years JJ old , , ...
train.txt and test.txt respectively contain data in 0-18 and 22-24. Execute "conv.py" to generate the same dataset as in "maximum 2nd order" in my thesis (the number of the features won't be exactly the same because of the changes in end-of-sequence processing. Features/training/evaluation data should be output respectively as "features.txt", "train_data.txt", "test_data.txt". features.txt should look like as follows:
W-2-1_of_crop . W-2-1_of_crop NN W0+1_still-raging_bidding VBG W-2-10_did_a_story NN W-2-10_David_Boren_-LRB- ( W-3-2-1__``_Five VBDtrain_data.txt/test_data.txt should look like as follows:
NNP LABEL W0_Pierre W-1_ W+1_Vinken W-10__Pierre W0+1_Pierre_Vinken W-2-1__ W-2-10___Pierre W-3-2-1___ suf1_e pre1_P suf2_re pre2_Pi suf3_rre pre3_Pie suf4_erre pre4_Pier suf5_ierre pre5_Pierr suf6_Pierre pre6_Pierre CONTAIN_UPPER NNP LABEL W0_Vinken W-1_Pierre W+1_, W-10_Pierre_Vinken W0+1_Vinken_, W-2-1__Pierre W-2-10__Pierre_Vinken W-3-2-1___Pierre suf1_n pre1_V suf2_en pre2_Vi suf3_ken pre3_Vin suf4_nken pre4_Vink suf5_inken pre5_Vinke suf6_Vinken pre6_Vinken CONTAIN_UPPER , LABEL W0_, W-1_Vinken W+1_61 W-10_Vinken_, W0+1_,_61 W-2-1_Pierre_Vinken W-2-10_Pierre_Vinken_, W-3-2-1__Pierre_Vinken suf1_, pre1_,
To train the model, input the following line:
The program consumes 20 to 30 gigabytes (it could have been programmed to swap out the data as it is accessed only sequentially, but I chose not to, because of the lack of time). The model will be output as wsj.model.
$ crfsuite learn -m wsj.model -t test_data.txt -f features.txt train_data.txt
To tag sentences using the model, input the following line:
$ crfsuite tag -m wsj.model test.txt
To evaluate the tagging performance, input to
$ crfsuite tag -m wsj.model -qt test.txt
CRFSuite for variable-order Markov model shares most of the options with the original CRFSuite. Please refer to the CRFSuite manual.
Here is the list of the differences.