[English] [Japanese]

Installation and usage of CRFSuite for variable-order Markov models

Installation

In order to build this software from the source code, you will need libLBFGS, as is also the case for the original CRFSuite (You don't need libLBFGS to use the Windows binary as they are statically linked. Skip to the usage). Here are typical steps to take on a Unix-compatible platforms.

First, download the source code package from the libLBFGS page and unpack it to a temporary directory. Execute 'configure' with a '--prefix' option to designate the install path. 'make' & 'make install'.


[/path/where/you/extracted/libLBFGS/] ./configure --prefix=/install/path/of/libLBFGS/
[/path/where/you/extracted/libLBFGS/] make
[/path/where/you/extracted/libLBFGS/] make install

Next, download the package of CRFSuite for variable-order Markov models (crfsuite-variableorder) from the github repository and unpack it to a temporary directory. Execute 'configure' with a '--with-liblbfgs' to designate the libLBFGS install path above and a '--prefix' option to designate the install path for crfsuite-variableorder. 'make' & 'make install'.


[/path/where/you/extracted/crfsuite-variableorder/] ./configure --with-liblbfgs=/install/path/of/libLBFGS/ --prefix=/install/path/of/crfsuite-variableorder/
[/path/where/you/extracted/crfsuite-variableorder/] make
[/path/where/you/extracted/crfsuite-variableorder/] make install

Usage

The basic usage is common as the original CRFSuite. The difference lies that crfsuite-variableorder does not auto-generate features and instead read them from a file, which should be specified by -f or --features option. Here is the format to describe a feature of (n-1)th order.

ATTR[tab]LABEL0[tab]LABEL1[tab]...LABELn[newline]

The order is : LABEL0 for the current position, LABEL1 for the previous position etc. For example, a feature activated when the label sequence to the current position is NN-VBZ-IN and the word at the current position is "like" should be described as follows:

W0_like[tab]IN[tab]VBZ[tab]NN[newline]

where "W0_" is an arbitrary prefix.

The format of training/evaluation data file is the same as the original CRFSuite. Please refer to the CRFSuite manual.

It is not realistic to manually write the features/training/evaluation files, so you will need a script to generate them. The source and binary packages contain an "example" directory, under which you can find a python script "conv.py". I wrote this script for the English POS-tagging task. You should customize the script in order to apply to other kind of tasks.

If you have Penn Treebank 3, you can reproduce the experiences in my thesis:

Open mrg_to_pos.py under "example" and modify the part that contains the path to the "mrg" directory. Run the script to convert the .mrg files into text files, which should look like:

NNP	Pierre
NNP	Vinken
,	,
CD	61
NNS	years
JJ	old
,	,
...

train.txt and test.txt respectively contain data in 0-18 and 22-24. Execute "conv.py" to generate the same dataset as in "maximum 2nd order" in my thesis (the number of the features won't be exactly the same because of the changes in end-of-sequence processing. Features/training/evaluation data should be output respectively as "features.txt", "train_data.txt", "test_data.txt". features.txt should look like as follows:

W-2-1_of_crop	.
W-2-1_of_crop	NN
W0+1_still-raging_bidding	VBG
W-2-10_did_a_story	NN
W-2-10_David_Boren_-LRB-	(
W-3-2-1__``_Five	VBD

train_data.txt/test_data.txt should look like as follows:

NNP	LABEL	W0_Pierre	W-1_	W+1_Vinken	W-10__Pierre	W0+1_Pierre_Vinken	W-2-1__	W-2-10___Pierre	W-3-2-1___	suf1_e	pre1_P	suf2_re	pre2_Pi	suf3_rre	pre3_Pie	suf4_erre	pre4_Pier	suf5_ierre	pre5_Pierr	suf6_Pierre	pre6_Pierre	CONTAIN_UPPER
NNP	LABEL	W0_Vinken	W-1_Pierre	W+1_,	W-10_Pierre_Vinken	W0+1_Vinken_,	W-2-1__Pierre	W-2-10__Pierre_Vinken	W-3-2-1___Pierre	suf1_n	pre1_V	suf2_en	pre2_Vi	suf3_ken	pre3_Vin	suf4_nken	pre4_Vink	suf5_inken	pre5_Vinke	suf6_Vinken	pre6_Vinken	CONTAIN_UPPER
,	LABEL	W0_,	W-1_Vinken	W+1_61	W-10_Vinken_,	W0+1_,_61	W-2-1_Pierre_Vinken	W-2-10_Pierre_Vinken_,	W-3-2-1__Pierre_Vinken	suf1_,	pre1_,

To train the model, input the following line:


$ crfsuite learn -m wsj.model -t test_data.txt -f features.txt train_data.txt

The program consumes 20 to 30 gigabytes (it could have been programmed to swap out the data as it is accessed only sequentially, but I chose not to, because of the lack of time). The model will be output as wsj.model.

To tag sentences using the model, input the following line:


$ crfsuite tag -m wsj.model test.txt

To evaluate the tagging performance, input to


$ crfsuite tag -m wsj.model -qt test.txt

Detailed usage

CRFSuite for variable-order Markov model shares most of the options with the original CRFSuite. Please refer to the CRFSuite manual.

Here is the list of the differences.

As stated above, you must provide a "-f" or "--features" option to specify the file to read features from when the "learn" option is selected.
All parameters of the original CRFSuite that have to do with the features are disabled, including "feature.minfreq". To achieve the same goal, you have to count the appearances of each feature in your conversion script and exclude the ones that don't match the criteria. The appearances of features listed in the features file will be computed automatically and used for learning.
Training by SGD is not supported. Consequently, the only option that can be specified by "algorithm" option is "lbfgs".
The "dump" option is not implemented yet.