readPDF {tm}R Documentation

Read In a PDF Document

Description

Returns a function which reads in a portable document format (PDF) document extracting both its text and its meta data.

Usage

readPDF(...)

Arguments

... Arguments for the generator function.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments via lexical scoping. This is especially useful for reader functions for complex data structures which need a lot of configuration options.

Note that this PDF reader needs both the tools pdftotext and pdfinfo installed and accessable on your system.

Value

A function with the signature elem, language, load, id:

elem A list with the two named elements content and uri. The first element must hold the document to be read in, the second element must hold a call to extract this document. The call is evaluated upon a request for load on demand.
language A character vector giving the text's language.
load A logical value indicating whether the document corpus should be immediately loaded into memory.
id A character vector representing a unique identification string for the returned text document.


The function returns a PlainTextDocument representing the text and meta data in content.

Author(s)

Ingo Feinerer

See Also

Use getReaders to list available reader functions.

Examples

f <- system.file("texts", "pdf", "pdfarchiving.pdf", package = "tm")
readPDF()
pdf <- readPDF()(elem = list(uri = substitute(file(f))), language = "en_US", load = TRUE, id = "id1")
meta(pdf)

[Package tm version 0.3-3 Index]