textcat_options {textcat}R Documentation

Textcat Options

Description

Get and set options used for n-gram based text categorization.

Usage

textcat_options(option, value)

Arguments

option character string indicating the option to get or set (see Details). If missing, all options are returned as a list.
value Value to be set. If omitted, the current value of the given option is returned.

Details

Currently, the following options are available:

n:
the maximum number of character in the n-gram profiles.

Default: 5L.

split:
the regular expression pattern to be used in word splitting.

Default: "[[:space:][:punct:][:digit:]]+".

tolower:
A logical indicating whether to transform texts to lower case (after word splitting).

Default: TRUE.

reduce:
A logical indicating whether a representation of n-grams more efficient than the one used by Cavnar and Trenkle should be employed.

Default: TRUE.

useBytes:
A logical indicating whether to use byte n-grams rather than character n-grams.

Default: FALSE.

ignore:
a character vector of n-grams to be ignored when computing n-gram profiles.

Default: "_" (corresponding to a word boundary).

size:
The maximal number of n-grams used for a profile.

Default: 1000L.

method:
A character string or function specifying a method for computing distances between n-gram profiles (see textcat).

Default: "CT", giving the Cavnar-Trenkle out of place measure.

See Also

textcat_profile_db for how the first 6 options are used when computing n-gram profiles.

textcnt in package tau which provides the functionality for term or pattern counting of text documents employed by textcat.


[Package textcat version 0.0-1 Index]