readXML {tm}R Documentation

Read In an XML Document

Description

Returns a function which reads in an XML document. The structure of the XML document can be described with a so-called specification.

Usage

readXML(spec, doc, ...)

Arguments

spec a named list of lists each containing two character vectors. The constructed reader will map each list entry to a slot or meta datum corresponding to the named list entry. Valid names include .Data to access the document's content, any valid slot name, and characters which are mapped to LocalMetaData entries.
Each list entry must consist of two character vectors: the first describes the type of the second argument, and the second is the specification entry. Valid combinations are:
type = "node", spec = "XPathExpression":
the XPath expression spec extracts information from an XML node.
type = "attribute", spec = "XPathExpression":
the XPath expression spec extracts information from an attribute of an XML node.
type = "function", spec = function(tree) ...:
The function spec is called, passing over a tree representation (as delivered by xmlInternalTreeParse from package XML) of the read in XML document as first argument.
type = "unevaluated", spec = "String":
the character vector spec is returned without modification.
doc an (empty) document of some subclass of TextDocument
... arguments for the generator function.

Details

Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.

Value

A function with the signature elem, language, load, id:

elem A list with the two named elements content and uri. The first element must hold the document to be read in, the second element must hold a call to extract this document. The call is evaluated upon a request for load on demand.
load A logical value indicating whether the document corpus should be immediately loaded into memory.
language A character vector giving the text's language.
id A character vector representing a unique identification string for the returned text document.


The function returns doc augmented by the parsed information out of the XML file as described by spec.

Author(s)

Ingo Feinerer

See Also

Vignette 'Extensions: How to Handle Custom File Formats'.

Use getReaders to list available reader functions.

Examples

## Not run: 
readReut21578XML <- readXML(
  spec = list(Author = list("node", "/REUTERS/TEXT/AUTHOR"),
              DateTimeStamp = list("function", function(node)
                strptime(sapply(XML::getNodeSet(node, "/REUTERS/DATE"), XML::xmlValue),
                         format = "
                         tz = "GMT")),
              Description = list("unevaluated", ""),
              Heading = list("node", "/REUTERS/TEXT/TITLE"),
              ID = list("attribute", "/REUTERS/@NEWID"),
              Origin = list("unevaluated", "Reuters-21578 XML"),
              Topics = list("node", "/REUTERS/TOPICS/D")),
  doc = new("Reuters21578Document"))
## End(Not run)

[Package tm version 0.3-4.1 Index]