termFreq {tm}R Documentation

Term Frequency Vector

Description

Generate a term frequency vector from a text document.

Usage

termFreq(doc, control = list())

Arguments

doc an object inheriting from TextDocument.
control a list of control options. Possible settings are
  • tolower: a function converting characters to lower case. Defaults to base::tolower.
  • tokenize: a function tokenizing documents to single tokens. Defaults to function(x) unlist(strsplit(gsub("[^[:alnum:]]+", " ", x), " ", fixed = TRUE).
  • removeNumbers: a Boolean value indicating whether numbers should be removed from doc.
  • stemming: a Boolean value indicating whether tokens should be stemmed. Defaults to FALSE.
  • stopwords: either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package or a character vector holding custom stopwords.
  • dictionary: a character vector to be tabulated against. No other terms will be listed in the result. Defaults to no action (i.e., all terms are considered).
  • minDocFreq: an integer value. Words that appear less often in doc than this number are discarded. Defaults to 1 (i.e., every token will be used).
  • minWordLength: an integer value. Words smaller than this number are discarded. Defaults to length 3.

Value

A named integer vector with term frequencies as values and tokens as names.

Examples

data("crude")
termFreq(crude[[1]])
termFreq(crude[[1]], control = list(stemming = TRUE, minWordLength = 4))

[Package tm version 0.3 Index]