readHTML {tm}R Documentation

Read In a Simple HTML Document

Description

Returns a function which reads in a simple HTML document extracting both its text and its metadata. The reader uses h1 headings as structure information whereas text and tags between headings are considered as textual information. Meta data is extracted from meta tags in the HTML head.

Usage

readHTML(...)

Arguments

... arguments for the generator function.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments via lexical scoping. This is especially useful for reader functions for complex data structures which need a lot of configuration options.

Value

A function with the signature elem, language, load, id:

elem A list with the two named elements content and uri. The first element must hold the document to be read in, the second element must hold a call to extract this document. The call is evaluated upon a request for load on demand.
language A character vector giving the text's language.
load A logical value indicating whether the document corpus should be immediately loaded into memory.
id A character vector representing a unique identification string for the returned text document.


The function returns a StructuredTextDocument representing content.

Author(s)

Ingo Feinerer

See Also

Use getReaders to list available reader functions.

Examples

html <- system.file("texts", "html", package = "tm")
## Not run: (Corpus(DirSource(html), readerControl = list(reader = readHTML, load = TRUE)))

[Package tm version 0.3-3 Index]