readPDF {tm} | R Documentation |
Returns a function which reads in a portable document format (PDF) document extracting both its text and its meta data.
readPDF(PdfinfoOptions = "", PdftotextOptions = "", ...)
PdfinfoOptions |
options passed over to pdfinfo . |
PdftotextOptions |
options passed over to pdftotext . |
... |
arguments for the generator function. |
Formally this function is a function generator, i.e., it returns a
function (which reads in a text document) with a well-defined
signature, but can access passed over arguments (e.g., options to
pdfinfo
or pdftotext
) via lexical scoping.
Note that this PDF reader needs both the tools pdftotext
and
pdfinfo
installed and accessable on your system.
A function
with the signature elem, language, load, id
:
elem |
A list with the two named elements content
and uri . The first element must hold the document to
be read in, the second element must hold a call to extract this
document. The call is evaluated upon a request for load on demand. |
language |
A character vector giving the text's language. |
load |
A logical value indicating whether the document
corpus should be immediately loaded into memory. |
id |
A character vector representing a unique identification
string for the returned text document. |
The function returns a PlainTextDocument
representing the text
and meta data in content
.
Ingo Feinerer
Use getReaders
to list available reader functions.
f <- system.file("texts", "pdf", "pdfarchiving.pdf", package = "tm") pdf <- readPDF(PdftotextOptions = "-layout")(elem = list(uri = substitute(file(f))), load = TRUE, language = "en_US", id = "id1") meta(pdf)