readHTML {tm}R Documentation

Read In a Simple HTML Document

Description

Returns a function which reads in a simple HTML document extracting both its text and its metadata. The reader uses h1 headings as structure information whereas text and tags between headings are considered as textual information. Meta data is extracted from meta tags in the HTML head.

Usage

readHTML(...)

Arguments

... arguments for the generator function.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments via lexical scoping. This is especially useful for reader functions for complex data structures which need a lot of configuration options.

Value

A function with the signature elem, language, load, id:

elem A list with the two named elements content and uri. The first element must hold the document corpus to be read in, the second element must hold a call to the document corpus. The call is evaluated upon a request for load on demand.
language A character giving the text's language.
load A logical value indicating whether the document corpus should be immediately loaded into memory.
id A character representing a unique identification string for the returned text document.


The function returns a StructuredTextDocument representing content.

Author(s)

Ingo Feinerer

Examples

html <- system.file("texts", "html", package = "tm")
## Not run: (Corpus(DirSource(html), readerControl = list(reader = readHTML, load = TRUE)))

[Package tm version 0.3-1 Index]