termFreq {tm}R Documentation

Term Frequency Vector

Description

Generate a term frequency vector from a text document.

Usage

termFreq(doc, control = list())

Arguments

doc An object inheriting from TextDocument.
control A list of control options. Possible settings are
  • tolower: A function converting characters to lower case. Defaults to base::tolower.
  • tokenize: A function tokenizing documents to single tokens. Defaults to function(x) unlist(strsplit(gsub("[^[:alnum:]]+", " ", x), " ", fixed = TRUE).
  • removeNumbers: A logical value indicating whether numbers should be removed from doc. Defaults to FALSE.
  • stemming: A Boolean value indicating whether tokens should be stemmed. Defaults to FALSE.
  • stopwords: Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package or a character vector holding custom stopwords.
  • dictionary: A character vector to be tabulated against. No other terms will be listed in the result. Defaults to no action (i.e., all terms are considered).
  • minDocFreq: An integer value. Words that appear less often in doc than this number are discarded. Defaults to 1 (i.e., every token will be used).
  • minWordLength: An integer value. Words smaller than this number are discarded. Defaults to length 3.

Value

A named integer vector with term frequencies as values and tokens as names.

Examples

data("crude")
termFreq(crude[[1]])
termFreq(crude[[1]], control = list(stemming = TRUE, minWordLength = 4))

[Package tm version 0.3-1 Index]