Corpus {tm}R Documentation

Corpus

Description

Constructs a text document collection (corpus).

Usage

## S4 method for signature 'Source':
Corpus(object, readerControl = list(reader = object@DefaultReader,
language = "en_US", load = TRUE), dbControl = list(useDb = FALSE, dbName = "",
dbType = "DB1"), ...)

Arguments

object A Source object.
readerControl A list with the named components reader representing a reading function capable of handling the file format found in object, language giving the text's language (preferably in Iso 639-1 format), and load being a logical value indicating whether the text corpus of documents should be loaded immediately into memory (load = TRUE) or loaded when necessary (load = FALSE). This allows to minimize memory demands for large document collections. If object does not support load on demand the text corpus is automatically loaded, i.e., this argument is overruled.
dbControl A list with the named components useDb indicating that database support should be activated, dbName giving the filename holding the sourced out objects (i.e., the database), and dbType holding a valid database type as supported by package filehash. Under activated database support the tm package tries to keep as few as possible resources in memory under usage of the database.
... Optional arguments for the reader.

Value

An S4 object of class Corpus which extends the class list containing a collection of text documents.

Author(s)

Ingo Feinerer

Examples

txt <- system.file("texts", "txt", package = "tm")
## Not run: 
(Corpus(DirSource(txt), readerControl = list(reader
= readPlain, language = "en_US", load = TRUE), dbControl = list(useDb =
TRUE, dbName = "oviddb", dbType = "DB1")))
## End(Not run)
reut21578 <- system.file("texts", "reut21578", package = "tm")
Corpus(DirSource(reut21578), readerControl = list(reader = readReut21578XML, language = "en_US", load = FALSE))

[Package tm version 0.3-3 Index]