readXML {tm} | R Documentation |
Returns a function which reads in an XML document. The structure of the XML document can be described with a so-called specification.
readXML(spec, doc, ...)
spec |
a named list of list s each containing two
character vectors. The constructed reader will map each list
entry to a slot or meta datum corresponding to the named list
entry. Valid names include .Data to access the document's
content, any valid slot name, and characters which are mapped to
LocalMetaData entries.
Each list entry must consist of two character vectors: the first describes the type of the second argument, and the second is the specification entry. Valid combinations are:
|
doc |
an (empty) document of some subclass of TextDocument |
... |
arguments for the generator function. |
Formally this function is a function generator, i.e., it returns a function (which reads in a text document) with a well-defined signature, but can access passed over arguments (e.g., the specification) via lexical scoping.
A function
with the signature elem, language, load, id
:
elem |
A list with the two named elements content
and uri . The first element must hold the document to
be read in, the second element must hold a call to extract this
document. The call is evaluated upon a request for load on demand. |
load |
A logical value indicating whether the document
corpus should be immediately loaded into memory. |
language |
A character vector giving the text's language. |
id |
A character vector representing a unique identification
string for the returned text document. |
The function returns doc
augmented by the parsed information
out of the XML file as described by spec
.
Ingo Feinerer
Vignette 'Extensions: How to Handle Custom File Formats'.
Use getReaders
to list available reader functions.
## Not run: readReut21578XML <- readXML( spec = list(Author = list("node", "/REUTERS/TEXT/AUTHOR"), DateTimeStamp = list("function", function(node) strptime(sapply(XML::getNodeSet(node, "/REUTERS/DATE"), XML::xmlValue), format = " tz = "GMT")), Description = list("unevaluated", ""), Heading = list("node", "/REUTERS/TEXT/TITLE"), ID = list("attribute", "/REUTERS/@NEWID"), Origin = list("unevaluated", "Reuters-21578 XML"), Topics = list("node", "/REUTERS/TOPICS/D")), doc = new("Reuters21578Document")) ## End(Not run)