filematrix
packageThe filematrix
package was originally conceived as an
alternative to bigmemory
package for three reasons.
First, matrices created with bigmemory
on NFS (network
file system) have often been corrupted (contained all
zeros). This is most likely a fault of memory-mapped files on NFS.
Second, bigmemory
was not available for
Windows for a long periof of time. It is now fully cross
platform.
Finally, bigmemory
package uses memory mapped file
interface to work with data files. This delivers great performance for
matrices smaller than the amount of computer memory, but were
experiencing major slowdown for larger matrices.
filematrix
and
bigmemory
packagesThe packages use different libraries to read from and write to their
big files. The filematrix
package uses readBin
and writeBin
R functions. The bigmemory
package memory-mapped file access via BH
R package
interface (Boost C++).
Note that filematrix
can store real values in
short
4 byte format. This feature is not available in
bigmemory
.
Due to different file access approach:
bigmemory
accumulates changes to the matrix in memory
and writes them to the file upon call of flush
or file
closure.filematrix
writes the changes to the file upon the
request without delay.Consequently:
bigmemory
works well for matrices smaller than the
system memory. Writing to larger matrices is much slower due to system
trying to keep as much of the matrix in the system memory (cache) as
possible.
filematrix
’s performance does not deteriorate on
matrices many times larger than the system memory.
bigmemory
is better for random access of small file
matrices.
filematrix
is equally good or better for block and
column-wise access of the file matrices.
filematrix
is much more efficient than
bigmemory
Let us consider a simple task of filling in a large matrix (twice
memory size). Below is the code using filematrix
. It
finishes in 10 minutes and does not interfere with other programs.
library(filematrix)
fm = fm.create(
filenamebase = "big_fm",
nrow = 1e5,
ncol = 1e5)
tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
message(i, " of ", ncol(fm))
fm[,i] = i + 1:nrow(fm)
}
toc = proc.time()
show(toc-tic)
# Cleanup
closeAndDeleteFiles(fm)
Filling the same sized big matrix with bigmemory
can be
very slow (2.5 times slower in this experiment). The
bigmemory
package uses memory mapped file technique to
access the file. When the matrix is written to, the memory mapped file
occupies all available RAM and the computer slows to a
halt.
Please excercise caution when running the code below.
library(bigmemory)
fm = filebacked.big.matrix(
nrow = 1e5,
ncol = 1e5,
type = "double",
backingfile = "big_bm.bmat",
backingpath = "./",
descriptorfile = "big_bm.desc.txt")
tic = proc.time()
for( i in seq_len(ncol(fm)) ) {
message(i, " of ", ncol(fm))
fm[,i] = i + 1:nrow(fm)
}
flush(fm)
toc = proc.time()
show(toc-tic)
# Cleanup
rm(fm)
gc()
unlink("big_bm.bmat")
unlink("big_bm.desc.txt")