The filematrix
package provides functions to create and
access large matrices stored in files, not computer memory. Filematrices
can be as large as the storage allows. The package has been tested on
matrices multiple terabytes in size.
File matrices can be created using functions fm.create
,
fm.create.from.matrix
, or
fm.create.from.text.file
:
The code above creates two files: fmatrix.bmat
which
stores the filematrix values, and fmatrix.desc.txt
which
stores the filematrix description, such as dimensions, data type, and
data type size.
Here is the content of the description file:
# Information file for R filematrix object
nrow=200
ncol=200
type=double
size=8
The elements of a filematrix can be read and written to using the
same syntax as is used for regular R matrices. Any changes to a
filematrix are written to the .bmat
file without extra
buffering.
## [,1] [,2] [,3]
## [1,] 1 4 0
## [2,] 2 5 0
## [3,] 3 6 0
## [4,] 0 0 0
## [1] 6 15 0 0
Elements of a filematrix can also be accessed as elements of a vector
(in which elements proceed sequentially down columns stacked 1:n). Thus,
as fm
has nrow(fm)
rows, fm[1,2]
accesses the same element as fm[nrow(fm)+1]
.
## [1] 1 2 3 0
## [1] 4 5 6 0
File matrices can also have row and column names, like regular R matrices.
The row and column names of the filematrix fm
are stored
in fmatrix.nmsrow.txt
and fmatrix.nmscol.txt
respectively.
An open filematrix object can be closed with close
function. This closes the internal file handle (connection). Closing
filematrix objects is optional, changes would
not be lost if the object is not closed.
An existing filematrix can be opened with fm.open
.
To prevent any changes to the values of the filematrix set
readonly = TRUE
.
An existing filematrix that would fit in memory can be loaded fully
with fm.load
The values of a filematrix are stored by columns, as with regular R matrices:
## [,1] [,2] [,3] [,4]
## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Thus, access to a filematrix values by columns is much faster than access by rows:
timerow = system.time( { sum(fm[1:10, ]) } )[3]
timecol = system.time( { sum(fm[ ,1:10]) } )[3]
cat("Reading ", nrow(fm)*10, " values from 10 columns takes ", timecol, " seconds", "\n",
"Reading ", ncol(fm)*10, " values from 10 rows takes ", timerow, " seconds", "\n", sep = "")
## Reading 2000 values from 10 columns takes 0.001 seconds
## Reading 2000 values from 10 rows takes 0.01 seconds
The performance difference may not be observed on small matrices, as in this example. Change the size from 200 x 200 to 10,000 x 10,000 to see the difference (it is at least hundred fold).
Unlike with regular R matrices, columns can be appended to the right side of a filematrix with very little computational overhead.
## [1] 200 200
## [1] 200 202
## [,1] [,2]
## [1,] 2 199
## [2,] 1 200
If you no longer need a filematrix and want to delete its files from
the hard drive and close the object, please use
closeAndDeleteFiles()
## R version 4.4.1 (2024-06-14)
## Platform: x86_64-pc-linux-gnu
## Running under: Ubuntu 24.04.1 LTS
##
## Matrix products: default
## BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
## LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.26.so; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=C
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Etc/UTC
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] filematrix_1.3 knitr_1.48
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.37 R6_2.5.1 fastmap_1.2.0 xfun_0.49
## [5] maketools_1.3.1 cachem_1.1.0 htmltools_0.5.8.1 rmarkdown_2.28
## [9] buildtools_1.0.0 lifecycle_1.0.4 cli_3.6.3 sass_0.4.9
## [13] jquerylib_0.1.4 compiler_4.4.1 sys_3.4.3 tools_4.4.1
## [17] evaluate_1.0.1 bslib_0.8.0 yaml_2.3.10 jsonlite_1.8.9
## [21] rlang_1.1.4