data.table {data.table}R Documentation

Just like a data.frame, but without row names

Description

Same as data.frame() but the result has no row names. In some cases (see example) rownames alone are responsible for 90% of the memory used by a data.frame. Removing them can therefore mean up to 10 times less memory, and be 10 times faster to create, and 10 times faster to copy. 1:nrow stored in character form is inefficient since rows can be indexed by their integer position. For example, DF[6,] and DF["6",] both work but the former is more efficient.

Usage

data.table(..., keep.rownames = FALSE, check.names = TRUE)

Arguments

... Just as ... in data.frame()
keep.rownames If ... is a data.frame itself, TRUE will retain the rownames in the first column
check.names Just as in data.frame()

Details

This class really does very little. The only reason for its existence is that the white book specifies that data.frame must have rownames.

Most of the code is copied from base functions with the code manipulating row.names removed.

A data.table is identical to a data.frame other than:
it doesn't have rownames
[,drop] by default is FALSE, so selecting a single row will always return a single row data.table not a vector
The comma is optional inside [], so DT[3] returns the 3rd row as a 1 row data.table
[] is like a call to subset()
[,...], is like a call to with(). (not yet implemented)

Motivation:
up to 10 times less memory
up to 10 times faster to create, and copy
simpler R code by allowing column name expressions within []
the white book defines rownames, so data.frame itself can't be changed ... => new class

Value

Identical to the result of data.frame, but without the row.names attribute.

Note

Author(s)

Matt Dowle

References

http://tolstoy.newcastle.edu.au/R/devel/05/12/3439.html

See Also

data.frame

Examples

nr = 1000000
D = rep(1:5,nr/5)
system.time(DF <<- data.frame(colA=D, colB=D))  # 2.08 
system.time(DT <<- data.table(colA=D, colB=D))  # 0.15  (over 10 times faster to create)
identical(as.data.table(DF), DT)
identical(dim(DT),dim(DF))
object.size(DF)/object.size(DT)                 # 10 times less memory

tt = subset(DF,colA>3)
ss = DT[colA>3]
identical(as.data.table(tt), ss)

mean(subset(DF,colA+colB>5,"colB"))
mean(DT[colA+colB>5]$colB)

tt = with(subset(DF,colA>3),colA+colB)
ss = with(DT[colA>3],colA+colB)                 # but could be:  DT[colA>3,colA+colB]  (not yet implemented)
identical(tt, ss)

tt = DF[with(DF,tapply(1:nrow(DF),colB,last)),] # select last row grouping by colB
ss = DT[tapply(1:nrow(DT),colB,last)]           # but could be:  DT[last,group=colB]  (not yet implemented)
identical(as.data.table(tt), ss)

Lkp=1:3
tt = DF[with(DF,colA %in% Lkp),]              
ss = DT[colA %in% Lkp]                        # expressions inside the [] can see objects in the calling frame
identical(as.data.table(tt), ss)


[Package data.table version 1.0 Index]