[Home]

[English] [Japanese]

Installation and usage of CRFSuite for variable-order Markov models

Installation

In order to build this software from the source code, you will need libLBFGS, as is also the case for the original CRFSuite (You don't need libLBFGS to use the Windows binary as they are statically linked. Skip to the usage). Here are typical steps to take on a Unix-compatible platforms.

First, download the source code package from the libLBFGS page and unpack it to a temporary directory. Execute 'configure' with a '--prefix' option to designate the install path. 'make' & 'make install'.


[/path/where/you/extracted/libLBFGS/] ./configure --prefix=/install/path/of/libLBFGS/
[/path/where/you/extracted/libLBFGS/] make
[/path/where/you/extracted/libLBFGS/] make install

Next, download the package of CRFSuite for variable-order Markov models (crfsuite-variableorder) from the github repository and unpack it to a temporary directory. Execute 'configure' with a '--with-liblbfgs' to designate the libLBFGS install path above and a '--prefix' option to designate the install path for crfsuite-variableorder. 'make' & 'make install'.


[/path/where/you/extracted/crfsuite-variableorder/] ./configure --with-liblbfgs=/install/path/of/libLBFGS/ --prefix=/install/path/of/crfsuite-variableorder/
[/path/where/you/extracted/crfsuite-variableorder/] make
[/path/where/you/extracted/crfsuite-variableorder/] make install

Usage

The basic usage is similar to the original CRFSuite. The main difference is that crfsuite-variableorder does not auto-generate features and instead reads them from a file specified by the -f or --features option. Here is the format used to describe a feature of order (n-1).

ATTR[tab]LABEL0[tab]LABEL1[tab]...LABELn[newline]
The order is as follows: LABEL0 for the current position, LABEL1 for the previous position, and so on. For example, a feature activated when the label sequence up to the current position is NN-VBZ-IN and the word at the current position is "like" should be described as follows:
W0_like[tab]IN[tab]VBZ[tab]NN[newline]
where "W0_" is an arbitrary prefix.

The format of training/evaluation data file is the same as the original CRFSuite. Please refer to the CRFSuite manual.

It is not realistic to manually write the features, training, and evaluation files, so a script is usually needed to generate them. The source and binary packages contain an "example" directory with a Python script named "conv.py". The script was originally written for English POS tagging and should be customized for other kinds of tasks.

If you have Penn Treebank 3, you can reproduce the experiments described in Hiroshi Manabe's thesis:

Open mrg_to_pos.py under "example" and modify the part that contains the path to the "mrg" directory. Run the script to convert the .mrg files into text files, which should look like:

NNP	Pierre
NNP	Vinken
,	,
CD	61
NNS	years
JJ	old
,	,
...

train.txt and test.txt respectively contain data in 0-18 and 22-24. Execute "conv.py" to generate the same dataset as in "maximum 2nd order" in the thesis (the number of features will not be exactly the same because of changes in end-of-sequence processing). Features, training data, and evaluation data should be written respectively as "features.txt", "train_data.txt", and "test_data.txt". features.txt should look as follows:

W-2-1_of_crop	.
W-2-1_of_crop	NN
W0+1_still-raging_bidding	VBG
W-2-10_did_a_story	NN
W-2-10_David_Boren_-LRB-	(
W-3-2-1__``_Five	VBD
train_data.txt/test_data.txt should look like as follows:
NNP	LABEL	W0_Pierre	W-1_	W+1_Vinken	W-10__Pierre	W0+1_Pierre_Vinken	W-2-1__	W-2-10___Pierre	W-3-2-1___	suf1_e	pre1_P	suf2_re	pre2_Pi	suf3_rre	pre3_Pie	suf4_erre	pre4_Pier	suf5_ierre	pre5_Pierr	suf6_Pierre	pre6_Pierre	CONTAIN_UPPER
NNP	LABEL	W0_Vinken	W-1_Pierre	W+1_,	W-10_Pierre_Vinken	W0+1_Vinken_,	W-2-1__Pierre	W-2-10__Pierre_Vinken	W-3-2-1___Pierre	suf1_n	pre1_V	suf2_en	pre2_Vi	suf3_ken	pre3_Vin	suf4_nken	pre4_Vink	suf5_inken	pre5_Vinke	suf6_Vinken	pre6_Vinken	CONTAIN_UPPER
,	LABEL	W0_,	W-1_Vinken	W+1_61	W-10_Vinken_,	W0+1_,_61	W-2-1_Pierre_Vinken	W-2-10_Pierre_Vinken_,	W-3-2-1__Pierre_Vinken	suf1_,	pre1_,

To train the model, input the following line:


$ crfsuite learn -m wsj.model -t test_data.txt -f features.txt train_data.txt

The program consumes 20 to 30 gigabytes. The implementation keeps the data in memory rather than swapping it out during sequential access, so substantial memory is required. The model will be written to wsj.model.

To tag sentences using the model, input the following line:


$ crfsuite tag -m wsj.model test.txt

To evaluate the tagging performance, input:


$ crfsuite tag -m wsj.model -qt test.txt

Detailed usage

CRFSuite for variable-order Markov model shares most of the options with the original CRFSuite. Please refer to the CRFSuite manual.

Here is the list of the differences.