gapweightkernel {stringkernels} | R Documentation |
Creates a kernel object for the gap-weighted string kernel. This kernel uses words as tokens by default. Use in conjunction with kernlab.
gapweightkernel(length = 2, lambda = 0.75, normalized = TRUE, tokenizer = openNLP::tokenize, use_characters = FALSE)
length |
Match length (excluding gaps) |
lambda |
Gap length penalty factor |
normalized |
Normalize kernel values (default: TRUE )
|
tokenizer |
String tokenizer function. By default, this uses openNLP's tokenize to split
the text into words, but users may specify their own function.
Ignored if use_characters is TRUE .
|
use_characters |
Split texts by character, rather than by word. |
This kernel generation function returns a kernel that computes the number of
gapped (non-contiguous) matches of length
matching tokens between two strings. Gaps are penalized by a factor
lambda
, i.e., each match is assigned a weight of
lambda ^ (L - l), with L as the total match length and l as length
.
By default, this kernel uses words (and punctuation marks) rather than characters as atomic tokens. This is usually yields better results than gapped character matching.
This implementation is based on the gapped substring kernel by Rousu/Shawe-Taylor. Note that this algorithm is optimized for large alphabets, usually consisting of words.
An S4 object of class stringKernelEx
.
Martin Kober
martin.kober@gmail.com
Juho Rousu and John Shawe-Taylor. Efficient computation of gapped substring kernels on large alphabets. Journal of Machine Learning Research, 6:1323-1344, 2005.
s = "The cat was chased by the fat dog" t = "The fat cat bit the dog" gwk = gapweightkernel() gwk(s,t) gwk2 = gapweightkernel(length=4, normalized=FALSE) gwk2(s,t) gwk3 = gapweightkernel(lambda=1, normalized=FALSE) gwk3(s,t)