track.design {trackObjs}R Documentation

Design of a tracking environment

Description

This document describes the layout of a tracking environment. Object tracking works by replacing a variable with an active binding, and keeping the actual value of the variable on disk and/or in another environment. Tracked objects are automatically resaved to disk when they are changed. Basic characteristics, such as class, size, extent, and creation and modification times are recorded in a summary of all tracked objects.

Details

Object tracking works by replacing a variable with an active binding, and keeping the actual value of the variable on disk and/or in another environment. Whenever the variable is fetched or assigned, the active binding is called, and it writes the object to disk if necessary, and records basic characteristics of the objects in a summary of all objects, including creation, modification and access times.

A tracking environment can be linked to one environment on the search path, but the tracking environment is not on the search path itself. An environment can only have one tracking environment linked to it. Variables cannot be tracked automatically: they must be registered with the tracking environment using the function track().

Any user-created environment on the search path, or the global environment, can be tracked.

The format used to store R objects in files is the one used by save()/load() – the objects in those files can be read using load() if desired.

The various variables and files involved in tracking are as follows (assuming the RData suffix being used is "rda"). Note that the default tracked visible environment is the global environment.


Tracked Visible Environment
(on search list)
attr(., "trackingEnv") -> Tracking Environment   +->  Tracking Directory (files)
+-----------------+       (not on search list)  /         |
|                 |       attr(., "trackingDir")          |
|                 |   +-------------------------+         +
|                 |   |        .trackingFileMap |         +- filemap.txt
|                 |   |        .trackingSummary |         +- .trackingSummary.rda
|                 |   |        .trackingUnsaved |         |
|                 |   | .trackingSummaryChanged |         |
|                 |   |        .trackingOptions |         |
|         x (*)   |   |                   x (@) |         +- x.rda
|       abc (*)   |   |                 abc (@) |         +- abc.rda
|         Y (*)   |   |                   Y (@) |         +- _1.rda
|        x1       |   +-------------------------+
|        x2       |        
+-----------------+

/tmp/trackdir1 
      |
      +- filemap.txt
      +- .trackingSummary.rda
      +- x.rda
      +- abc.rda
      +- _1.rda

Terminology

One could describe a tracking environment as "attached" to the tracked environment, but that using that term would risk confusion with the role of the attach() function and search path in R. So, instead the trackObjs package says that a tracking environment is "linked" to the tracked environment.

track:
The trackObjs tracks variables, by setting up a one-to-one relationship between R objects and files on disks so that when an object in R is modified, the file on disk is automatically updated.
tracked environment:
A tracked environment contains user variables and is usually on the search path.
tracked object:
A tracked object (in a tracked environment) that has an active binding so that when it is modified, the corresponding file on disk is also modified.
untracked object:
An untracked object in a tracked environment is an ordinary object that is not tracked and has no corresponding file.
tracking environment:
A tracking environment is a special environment used by the trackObjs package to track objects in the tracked environment
linked:
A tracking environment is linked to a tracked environment (by the trackingEnv attribute on the tracked environment, which points to the tracking environment.)
start tracking, stop tracking:
Tracking is started by creating a tracking environment, linking it to the tracked environment, and setting up bindings for tracked objects.
tracking database:
A tracking database is the collection of files and directories that stores the tracking information.
active tracking database:
A tracking database that is currently linked to an environment in a running R session.

Untrackable variables

Only ordinary variables can be tracked – variables that are active bindings cannot be tracked.

Several variable names are reserved and cannot be tracked: .trackingEnv, .trackingFileMap, .trackingUnsaved, .trackingSummary, .trackingSummaryChanged, .trackingOptions. Additionally, any variable with a newline character ("\n") as part of its name cannot be tracked (the main reason for this is that the mapping from object names to file names is stored in a text file, and newline character delimits the name).

The file map

The mapping from object names to file names is stored in the file fileMap.txt. This data is stored as ordinary text file to make it easy for users to see the object-file mappings outside of R.

Implementation considerations

The reason that objects must be explicitly registered for tracking is that there is currently no way of setting up a function to be called when a new object is created, so new objects are always created as ordinary R objects. Similarly, the R remove() functions does not have any hooks, so if remove() is called on a tracked variable, it will just remove the active binding in the visible environment, but will not disturb the underlying tracking environment. The track.remove() function will completely remove a tracked variable from the visible environment and the underlying tracking environment (including deleting an associated disk file.)

Object tracking was intended to be used in situations where large numbers of large objects must be manipulated. Consequently, there is a good chance of exhausting resources while using the trackObjs package. The trackObjs code tries to check return codes when creating objects or writing files, and in cases where it is unable to complete an operation it tries leave the tracking environment in a state from which objects can be salvaged. The functions track.rebuild() and track.flush() are provided to help recover from situations where resource limitations prevented successful operation. Note that files are generally written in a "unsafe" manner (i.e., existing files can be overwritten with partial new files), but in these cases data is retained in the memory and can be rewritten after resolving file system problems.

The R functions exists() should be used with care on tracked objects, because it will actually fetch the object, possibly needing to read it from disk. In the trackObjs code, the exists("x") function is not used to check existence of a possibly tracked object x, instead an idiom like is.element("x", objects(all=TRUE)) is used.

These statements about the available facilities in R were true as of R-2.4.1 (released Dec 2006).

The rules for how variable names are mapped to file names are based on trying to use filenames that will work properly on all three operating systems R works on (Linux, Windows, and Mac OS X). A somewhat obscure point that must be taken into account is the case-insensitivity of Mac OS X and Windows. Even though modern versions of the OS's seem to use case in their file names, this is because they are case preserving, but they are in fact still case insensitive. This means that a file created with the name "X.rda" is the same file as the "x.rda". Here is a short shell transcript showing this behavior in a bash shell running under Windows and Mac OS X (it's the same in both).

    $ echo 123 > X
    $ cat x
    123
    $ echo 456 > x
    $ cat x
    456
    $ cat X
    456
Thus, in order to work on OS's, file mapping must be used to create different filenames for the R objects "x" and "X" (which are in fact different in R.)

Portability

Tracking directories are intended to be operating-system independent and completely portable across different operating systems.

Author(s)

Tony Plate <tplate@acm.org>

References

Roger D. Peng. Interacting with data using the filehash package. R News, 6(4):19-24, October 2006. http://cran.r-project.org/doc/Rnews and http://sandybox.typepad.com/software

David E. Brahm. Delayed data packages. R News, 2(3):11-12, December 2002. http://cran.r-project.org/doc/Rnews

See Also

Overview of the trackObjs package.

Documentation for makeActiveBinding and related functions (in 'base' package).

Inspriation from the packages g.data and filehash.


[Package trackObjs version 0.8-3 Index]