filematrix packageThe filematrix package was originally conceived as an
alternative to bigmemory package for three reasons.
First, matrices created with bigmemory on NFS (network
file system) have often been corrupted (contained all
zeros). This is most likely a fault of memory-mapped files on NFS.
Second, bigmemory was not available for
Windows for a long periof of time. It is now fully cross
platform.
Finally, bigmemory package uses memory mapped file
interface to work with data files. This delivers great performance for
matrices smaller than the amount of computer memory, but were
experiencing major slowdown for larger matrices.
filematrix and
bigmemory packagesThe packages use different libraries to read from and write to their
big files. The filematrix package uses readBin
and writeBin R functions. The bigmemory
package memory-mapped file access via BH R package
interface (Boost C++).
Note that filematrix can store real values in
short 4 byte format. This feature is not available in
bigmemory.
Due to different file access approach:
bigmemory accumulates changes to the matrix in memory
and writes them to the file upon call of flush or file
closure.filematrix writes the changes to the file upon the
request without delay.Consequently:
bigmemory works well for matrices smaller than the
system memory. Writing to larger matrices is much slower due to system
trying to keep as much of the matrix in the system memory (cache) as
possible.
filematrix’s performance does not deteriorate on
matrices many times larger than the system memory.
bigmemory is better for random access of small file
matrices.
filematrix is equally good or better for block and
column-wise access of the file matrices.
filematrix is much more efficient than
bigmemoryLet us consider a simple task of filling in a large matrix (twice
memory size). Below is the code using filematrix. It
finishes in 10 minutes and does not interfere with other programs.
library(filematrix)
fm = fm.create(
filenamebase = "big_fm",
nrow = 1e5,
ncol = 1e5)
tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
message(i, " of ", ncol(fm))
fm[,i] = i + 1:nrow(fm)
}
toc = proc.time()
show(toc-tic)
# Cleanup
closeAndDeleteFiles(fm)Filling the same sized big matrix with bigmemory can be
very slow (2.5 times slower in this experiment). The
bigmemory package uses memory mapped file technique to
access the file. When the matrix is written to, the memory mapped file
occupies all available RAM and the computer slows to a
halt.
Please excercise caution when running the code below.
library(bigmemory)
fm = filebacked.big.matrix(
nrow = 1e5,
ncol = 1e5,
type = "double",
backingfile = "big_bm.bmat",
backingpath = "./",
descriptorfile = "big_bm.desc.txt")
tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
message(i, " of ", ncol(fm))
fm[,i] = i + 1:nrow(fm)
}
flush(fm)
toc = proc.time()
show(toc-tic)
# Cleanup
rm(fm)
gc()
unlink("big_bm.bmat")
unlink("big_bm.desc.txt")