gapweightkernel {stringkernels}R Documentation

Gap-weighted string kernel

Description

Creates a kernel object for the gap-weighted string kernel. This kernel uses words as tokens by default. Use in conjunction with kernlab.

Usage

gapweightkernel(length = 2, lambda = 0.75, normalized = TRUE, 
    tokenizer = openNLP::tokenize, use_characters = FALSE)

Arguments

length Match length (excluding gaps)
lambda Gap length penalty factor
normalized Normalize kernel values (default: TRUE)
tokenizer String tokenizer function. By default, this uses openNLP's tokenize to split the text into words, but users may specify their own function. Ignored if use_characters is TRUE.
use_characters Split texts by character, rather than by word.

Details

This kernel generation function returns a kernel that computes the number of gapped (non-contiguous) matches of length matching tokens between two strings. Gaps are penalized by a factor lambda, i.e., each match is assigned a weight of lambda ^ (L - l), with L as the total match length and l as length.

By default, this kernel uses words (and punctuation marks) rather than characters as atomic tokens. This is usually yields better results than gapped character matching.

This implementation is based on the gapped substring kernel by Rousu/Shawe-Taylor. Note that this algorithm is optimized for large alphabets, usually consisting of words.

Value

An S4 object of class stringKernelEx.

Author(s)

Martin Kober
martin.kober@gmail.com

References

Juho Rousu and John Shawe-Taylor. Efficient computation of gapped substring kernels on large alphabets. Journal of Machine Learning Research, 6:1323-1344, 2005.

See Also

multigapweightkernel

Examples


s = "The cat was chased by the fat dog"
t = "The fat cat bit the dog"
gwk = gapweightkernel()
gwk(s,t)

gwk2 = gapweightkernel(length=4, normalized=FALSE)
gwk2(s,t)

gwk3 = gapweightkernel(lambda=1, normalized=FALSE)
gwk3(s,t)


[Package stringkernels version 0.8.8 Index]