readHTML {tm} | R Documentation |
Returns a function which reads in a simple HTML
document extracting both its text and its metadata. The reader uses
h1
headings as structure information whereas text and tags
between headings are considered as textual information. Meta data is
extracted from meta
tags in the HTML head.
readHTML(...)
... |
arguments for the generator function. |
Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments via lexical scoping. This is especially useful for reader functions for complex data structures which need a lot of configuration options.
A function
with the signature elem, language, load, id
:
elem |
A list with the two named elements content
and uri . The first element must hold the document to
be read in, the second element must hold a call to extract this
document. The call is evaluated upon a request for load on demand. |
language |
A character vector giving the text's language. |
load |
A logical value indicating whether the document
corpus should be immediately loaded into memory. |
id |
A character vector representing a unique identification
string for the returned text document. |
The function returns a StructuredTextDocument
representing
content
.
Ingo Feinerer
Use getReaders
to list available reader functions.
html <- system.file("texts", "html", package = "tm") ## Not run: (Corpus(DirSource(html), readerControl = list(reader = readHTML, load = TRUE)))