termFreq {tm} | R Documentation |
Term Frequency Vector
Description
Generate a term frequency vector from a text document.
Usage
termFreq(doc, control = list())
Arguments
doc |
An object inheriting from TextDocument . |
control |
A list of control options. Possible settings are
tolower : A function converting characters to lower
case. Defaults to base::tolower .
tokenize : A function tokenizing documents to single
tokens. Defaults to function(x) unlist(strsplit(gsub("[^[:alnum:]]+", " ", x), " ", fixed = TRUE) .
removeNumbers : A logical value indicating whether
numbers should be removed from doc . Defaults to FALSE .
stemming : A Boolean value indicating whether tokens
should be stemmed. Defaults to FALSE .
stopwords : Either a Boolean value indicating stopword
removal using default language specific stopword lists shipped
with this package or a character vector holding custom
stopwords. Defaults to FALSE .
dictionary : A character vector to be tabulated
against. No other terms will be listed in the result. Terms from
the dictionary not occurring in the document at all will be
skipped for performance reasons. Defaults to no action (i.e., all
terms are considered).
minDocFreq : An integer value. Words that appear less
often in doc than this number are discarded. Defaults to
1 (i.e., every token will be used).
minWordLength : An integer value. Words smaller than
this number are discarded. Defaults to length 3 .
|
Value
A named integer vector with term frequencies as values and tokens as
names.
Examples
data("crude")
termFreq(crude[[1]])
termFreq(crude[[1]], control = list(stemming = TRUE, minWordLength = 4))
[Package
tm version 0.3-3
Index]