termFreq {tm}R Documentation

Term Frequency Vector

Description

Generate a term frequency vector from a text document.

Usage

termFreq(doc, control = list())

Arguments

doc An object inheriting from TextDocument.
control A list of control options. Possible settings are
  • tolower: A function converting characters to lower case. Defaults to base::tolower.
  • tokenize: A function tokenizing documents to single tokens. Defaults to function(x) unlist(strsplit(gsub("[^[:alnum:]]+", " ", x), " ", fixed = TRUE).
  • removeNumbers: A logical value indicating whether numbers should be removed from doc. Defaults to FALSE.
  • stemming: A Boolean value indicating whether tokens should be stemmed. Defaults to FALSE.
  • stopwords: Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package or a character vector holding custom stopwords. Defaults to FALSE.
  • dictionary: A character vector to be tabulated against. No other terms will be listed in the result. Terms from the dictionary not occurring in the document at all will be skipped for performance reasons. Defaults to no action (i.e., all terms are considered).
  • minDocFreq: An integer value. Words that appear less often in doc than this number are discarded. Defaults to 1 (i.e., every token will be used).
  • minWordLength: An integer value. Words smaller than this number are discarded. Defaults to length 3.

Value

A named integer vector with term frequencies as values and tokens as names.

Examples

data("crude")
termFreq(crude[[1]])
termFreq(crude[[1]], control = list(stemming = TRUE, minWordLength = 4))

[Package tm version 0.3-3 Index]