ff {ff}R Documentation

ff classes for representing (large) atomic data

Description

The ff package provides atomic data structures that are stored on disk but behave (almost) as if they were in RAM by mapping only a section (pagesize) into main memory (the effective main memory consumption per ff object). Several access optimization techniques such as Hyrid Index Preprocessing (as.hi, update.ff) and Virtualization (virtual, vt, vw) are implemented to achieve good performance even with large datasets. In addition to the basic access functions, the ff package also provides compatibility functions that facilitate writing code for ff and ram objects (clone, as.ff, as.ram) and very basic support for operating on ff objects (ffapply). While the (possibly packed) raw data is stored on a flat file, meta informations about the atomic data structure such as its dimension, virtual storage mode (vmode), factor level encoding, internal length etc.. are stored as an ordinary R object (external pointer plus attributes) and can be saved in the workspace. The raw flat file data encoding is always in native machine format for optimal performance and provides several packing schemes for different data types such as logical, raw, integer and double (in an extended version support for more tighly packed virtual data types is supported). flatfile data files can be shared among ff objects in the same R process or even from different R processes due to Memory-Mapping, although the caching effects have not been tested extensively.
Please do read and understand the limitations and warnings in LimWarn before you do anything serious with package ff.

Usage

ff( initdata  = NULL
, length      = NULL
, levels      = NULL
, dim         = NULL
, dimorder    = NULL
, bydim       = NULL
, symmetric   = FALSE
, fixdiag     = NULL
, names       = NULL
, dimnames    = NULL
, ramclass    = NULL
, ramattribs  = NULL
, vmode       = NULL
, pattern     = "ff"
, filename    = NULL
, overwrite   = FALSE
, readonly    = FALSE
, pagesize    = NULL  # getOption("ffpagesize")
, caching     = NULL  # getOption("ffcaching")
, finalizer   = NULL
, finonexit   = NULL  # getOption("fffinonexit")
, FF_RETURN   = TRUE
, BATCHSIZE   = .Machine$integer.max
, BATCHBYTES  = getOption("ffbatchbytes")
, VERBOSE     = FALSE
)

Arguments

initdata scalar or vector of the .vimplemented vmodes, recycled if needed, default 0, see also as.vmode and vector.vmode
length optional vector length of the object (default: derive from 'initdata' or 'dim'), see length.ff
levels optional character vector of levels if (in this case initdata must be composed of these) (default: derive from initdata)
names NOT taken from initdata, see names
dim optional array dim, see dim.ff and array
dimorder physical layout (default 1:length(dim)), see dimorder and aperm
bydim dimorder by which to interpret the 'initdata', generalization of the 'byrow' paramter in matrix
symmetric extended feature: TRUE creates symmetric matrix (default FALSE)
fixdiag extended feature: non-NULL scalar requires fixed diagonal for symmetric matrix (default NULL is free diagonal)
dimnames NOT taken from initdata, see dimnames
ramclass class attribute attached when moving all or parts of this ff into ram, see ramclass
ramattribs additional attributes attached when moving all or parts of this ff into ram, see ramattribs
vmode virtual storage mode (default: derive from 'initdata'), see vmode and as.vmode
pattern root pattern for automatic ff filename creation (default "ff"), see also physical
filename ff filename (default tmpfile with 'pattern' prefix), see also physical
overwrite set to TRUE to allow overwriting existing files (default FALSE)
readonly set to TRUE to forbid writing to existing files
pagesize pagesize in bytes for the memory mapping (default from getOptions("ffpagesize") initialized by getdefaultpagesize), see also physical
caching caching scheme for the backend, currently 'mmnoflush' or 'mmeachflush' (flush mmpages at each swap, default from getOptions("ffcaching") initialized with 'mmeachflush'), see also physical
finalizer name of finalizer function called when ff object is removed, one of "close", "delete", "deleteIfOpen". (default to "close" when filename is given outside getOptions("fftempdir"), otherwise "delete" for temporary ff objects); standard finalizers are close.ff, delete.ff and deleteIfOpen.ff, see also reg.finalizer
finonexit logical scalar determining whether finalizer is also called when R is closed via q, (default TRUE from getOptions("fffinonexit"))
FF_RETURN logical scalar or ff object to be used. The default TRUE creates a new ff file. FALSE returns a ram object. Handing over an ff object here uses this or stops if not ffsuitable
BATCHSIZE integer scalar limiting the number of elements to be processed in update.ff when length(initdata)>1, default from .Machine$integer.max
BATCHBYTES integer scalar limiting the number of bytes to be processed in update.ff when length(initdata)>1, default from getOption("ffbatchbytes"), see also .rambytes
VERBOSE set to TRUE for verbosing in update.ff when length(initdata)>1, default FALSE

Details

The atomic data is stored in filename as a native encoded raw flat file on disk, OS specific limitations of the file system apply. The number of elements per ff object is limited to the integer indexing, i.e. .Machine$integer.max. Atomic objects created with ff are is.open, a C++ object is ready to access the file via memory-mapping. Currently the C++ backend provides two caching schemes: 'mmnoflush' let the OS decide when to flash memory mapped pages and 'mmeachflush' will flush memory mapped pages at each page swap per ff file. These minimal memory ressources can be released by closeing or deleteing the ff file. ff objects can be saved and loaded across R sessions. If the ff file still exists in the same location, it will be opened automatically at the first attempt to access its data. If the ff object is removed, at the next garbage collection (see gc) the ff object's finalizer is invoked. Raw data files can be made accessible as an ff object by explicitly given the filename and vmode but no size information (length or dim). The ff object will open the file and handle the data with respect to the given vmode. The close finalizer will close the ff file, the delete finalizer will delete the ff file. The default finalizer deleteIfOpen will delete open files and do nothing for closed files. If the default finalizer is used, two actions are needed to protect the ff file against deletion: create the file outside the standard 'fftempdir' and close the ff object before removing it or before quitting R. When R is exited through q, the finalizer will be invoked depending on the 'fffinonexit' option, furthermore the 'fftempdir' is unlinked.

Value

If (!FF_RETURN) then a ram object like those generated by vector, matrix, array but with attributes 'vmode', 'physical' and 'virtual' accessible via vmode, physical and virtual
If (FF_RETURN) an object of class 'ff' which is a a list with two components:

physical an external pointer of class 'ff_pointer' which carries attributes with copy by reference semantics: changing a physical attribute of a copy changes the original
virtual an empty list which carries attributes with copy by value semantics: changing a virtual attribute of a copy does not change the original

Physical object component

The 'ff_pointer' carries the following 'physical' or readonly attributes, which are accessible via physical:
vmode see vmode
maxlength see maxlength
pattern see parameter 'pattern'
filename see filename
pagesize see parameter 'pagesize'
caching see parameter 'caching'
finalizer see parameter 'finalizer'
finonexit see parameter 'finonexit'
readonly see is.readonly
class The external pointer needs class 'ff_pointer' to allow method dispatch of finalizers

Virtual object component

The 'virtual' component carries the following attributes (some of which might be NULL):
Length see length.ff
Levels see levels.ff
Names see names.ff
VW see vw.ff
Dim see dim.ff
Dimorder see dimorder
Symmetric see symmetric.ff
Fixdiag see fixdiag.ff
ramclass see ramclass
ramattribs see ramattribs

Class

You should not rely on the internal structure of ff objects or their ram versions. Instead use the accessor functions like vmode, physical and virtual. Still it would be wise to avoid attributes AND classes 'vmode', 'physical' and 'virtual' in any other packages. Note that the 'ff' object's class attribute also has copy-by-value semantics ('virtual'). For the 'ff' object the following class attritibutes are known:
vector c("ff_vector","ff")
matrix c("ff_matrix","ff_array","ff")
array c("ff_array","ff")
symmetric matrix c("ff_symm","ff")
distance matrix c("ff_dist","ff_symm","ff")
reserved for future use c("ff_mixed","ff")

Methods

The following methods and functions are available for ff objects:
Type Name Assign Comment
Basic functions
function ff constructor for ff and ram objects
generic update updates one ff object with the content of another
generic clone clones an ff object optionally changing some of its features
method print print ff
method print ff object structure
Class test and coercion
function is.ff check if inherits from ff
generic as.ff coerce to ff, if not yet
generic as.ram coerce to ram retaining some of the ff information
Virtual storage mode
generic vmode <- get and set virtual mode (setting only for ram, not for ff objects)
generic as.vmode coerce to vmode (only for ram, not for ff objects)
Physical attributes
function physical <- set and get physical attributes
generic filename get filename
generic maxlength get maxlength
generic is.sorted <- set and get if is marked as sorted
generic na.count <- set and get NA count, if set to non-NA only swap methods can change and na.count is maintained automatically
generic is.readonly get if is readonly
Virtual attributes
function virtual <- set and get virtual attributes
method length <- set and get length
method dim <- set and get dim
generic dimorder <- set and get the order of dimension interpretation
generic vw <- set and get virtual windows
method names <- set and get names
method dimnames <- set and get names
generic symmetric get if is symmetric
generic fixdiag <- set and get fixed diagonal of symmetric matrix
method levels <- levels of factor
method is.factor if is factor
method is.ordered if is ordered (factor)
Access functions
function get.ff get single ff element (currently [[ is a shortcut)
function set.ff set single ff element (currently [[<- is a shortcut)
function getset.ff set single ff element and get old value in one access operation
function read.ff get vector of contiguous elements
function write.ff set vector of contiguous elements
function readwrite.ff set vector of contiguous elements and get old values in one access operation
method [ get vector of indexed elements, uses HIP, see hi
method [<- set vector of indexed elements, uses HIP, see hi
generic swap set vector of indexed elements and get old values in one access operation
generic add (almost) unifies '+=' operation for ff and ram objects
Opening/Closing/Deleting
generic is.open check if ff is open
method open open ff object (is done automatically on access)
method close close ff object (releases C++ memory and protects against file deletion if deleteIfOpen) finalizer is used
generic delete deletes ff file (unconditionally)
generic deleteIfOpen deletes ff file if ff object is opened
Other
function geterror.ff get error code
function geterrstr.ff get error message

ff options

Through options or getOption one can change and query global features of the ff package:
option description default
fftempdir default directory for creating ff files tempdir
fffinalizer name of default finalizer deleteIfOpen
fffinonexit default for invoking finalizer on exit of R TRUE
ffpagesize default pagesize getdefaultpagesize
ffcaching caching scheme for the C++ backend 'mmnoflush'
ffdrop default for the drop parameter in the ff subscript methods TRUE
ffbatchbytes default for the byte limit in batched/chunked processing memory.limit() %/% 100

OS specific

The following table gives an overview of file size limits for common file systems (see http://en.wikipedia.org/wiki/Comparison_of_file_systems for further details):
File System File size limit
FAT16 2GB
FAT32 4GB
NTFS 16GB
ext2/3/4 16GB to 2TB
ReiserFS 4GB (up to version 3.4) / 8TB (from version 3.5)
XFS 8EB
JFS 4PB
HFS 2GB
HFS Plus 16GB
USF1 4GB to 256TB
USF2 512GB to 32PB
UDF 16EB

Credits

Package Version 1.0
Daniel Adler dadler@uni-goettingen.de
R package design, C++ generic file vectors, Memory-Mapping, 64-bit Multi-Indexing adapter and Documentation, Platform ports
Oleg Nenadic onenadi@uni-goettingen.de
Index sequence packing, Documentation
Walter Zucchini wzucchi@uni-goettingen.de
Array Indexing, Sampling, Documentation
Christian Gläser christian_glaeser@gmx.de
Wrapper for biglm package

Package Version 2.0
Jens Oehlschlägel Jens.Oehlschlaegel@truecluster.com
R package redesign; Hybrid Index Preprocessing; transparent object creation and finalization; vmode design; virtualization and hybrid copying; arrays with dimorder and bydim; symmetric matrices; factors and POSIXct; virtual windows and transpose; new generics update, clone, swap, add, as.ff and as.ram; ffapply and collapsing functions. R-coding, C-coding and Rd-documentation.
Daniel Adler dadler@uni-goettingen.de
C++ generic file vectors, vmode implementation and low-level bit-packing/unpacking, arithmetic operations and NA handling, Memory-Mapping and backend caching. C++ coding and platform ports. R-code extensions for opening existing flat files readonyl and shared.

Licence

Package under GPL-2, included C++ code released by Daniel Adler under the less restrictive ISCL

Note

Note that the standard finalizers are generic functions, their dispatch to the 'ff_pointer' method happens at finalization time, their 'ff' methods exist for direct calling.

See Also

vector, matrix, array, as.ff, as.ram

Examples

  cat("make sure you understand the following ff options before you start using the ff package!!\n")
  oldoptions <- options(fffinalizer="deleteIfOpen", fffinonexit="TRUE", fftempdir=tempdir())
  ff(1:12)                        # an integer vector
  ff(0, 12)                       # a double vector of length 12
  ff(vmode="logical", length=12)  # a logical vector of length 12 (due to NA using 2 bit per cell on disk, vmode="boolean" uses 1 bit)
  ff(1:12, dim=c(3,4))            # an integer matrix 3x4 (standard colwise physical layout)
  ff(1:12, dim=c(3,4), dimorder=c(2,1)) # an integer matrix 3x4 (rowwise physical layout, but filled in standard colwise order)
  ff(1:12, dim=c(3,4), bydim=c(2,1)) # an integer matrix 3x4 (standard colwise physical layout, but filled in rowwise order aka matrix(, byrow=TRUE))
  options(oldoptions)

  if (ffxtensions()){
     a <- ff(vmode="boolean", dim=rep(2, 26)) # a 26-dimensional boolean array using 1-bit representation (file size 8 MB compared to 256 MB int in ram)
     dimnames(a) <- dummy.dimnames(a)
  }

  ## Not run: 

     cat("This 2GB biglm example can take long, you might want to change the size in order to define a size appropriate for your computer\n")
     require(biglm)

     b <- 1000
     n <- 100000
     k <- 3
     memory.size(max = TRUE)
     system.time(
     x <- ff(vmode="double", dim=c(b*n,k), dimnames=list(NULL, LETTERS[1:k]))
     )
     memory.size(max = TRUE)
     system.time(
     ffrowapply({
        l <- i2 - i1 + 1
        z <- rnorm(l)
        for (i in 1:k)
          x[i1:i2,i] <- z + rnorm(l)
     }, X=x, VERBOSE=TRUE, BATCHSIZE=n)
     )
     memory.size(max = TRUE)

     form <- A ~ B + C
     first <- TRUE
     system.time(
     ffrowapply({
        if (first){
          first <- FALSE
          fit <- biglm(form, as.data.frame(x[i1:i2,,drop=FALSE]))
        }else
          fit <- update(fit, as.data.frame(x[i1:i2,,drop=FALSE]))
     }, X=x, VERBOSE=TRUE, BATCHSIZE=n)
     )
     memory.size(max = TRUE)
     first
     fit
     summary(fit)
  ## End(Not run)

[Package ff version 2.0.0 Index]