termFreq {tm} | R Documentation |
Term Frequency Vector
Description
Generate a term frequency vector from a text document.
Usage
termFreq(doc, control = list())
Arguments
doc |
An object inheriting from TextDocument . |
control |
A list of control options. Possible settings are
tolower : A function converting characters to lower
case. Defaults to base::tolower .
tokenize : A function tokenizing documents to single
tokens. Defaults to function(x) unlist(strsplit(gsub("[^[:alnum:]]+", " ", x), " ", fixed = TRUE) .
removeNumbers : A logical value indicating whether
numbers should be removed from doc . Defaults to FALSE .
stemming : A Boolean value indicating whether tokens
should be stemmed. Defaults to FALSE .
stopwords : Either a Boolean value indicating stopword
removal using default language specific stopword lists shipped
with this package or a character vector holding custom stopwords.
dictionary : A character vector to be tabulated
against. No other terms will be listed in the result. Defaults to
no action (i.e., all terms are considered).
minDocFreq : An integer value. Words that appear less
often in doc than this number are discarded. Defaults to
1 (i.e., every token will be used).
minWordLength : An integer value. Words smaller than
this number are discarded. Defaults to length 3 .
|
Value
A named integer vector with term frequencies as values and tokens as
names.
Examples
data("crude")
termFreq(crude[[1]])
termFreq(crude[[1]], control = list(stemming = TRUE, minWordLength = 4))
[Package
tm version 0.3-1
Index]