Title: | Fast Enrichment Analysis via Circular Permutations |
---|---|
Description: | Fast enrichment analysis for locally correlated statistics via circular permutations. The analysis can be performed at multiple significance thresholds for both primary and auxiliary data sets with efficient correction for multiple testing. |
Authors: | Andrey A Shabalin [aut, cre] , Edwin J C G van den Oord [aut] |
Maintainer: | Andrey A Shabalin <[email protected]> |
License: | LGPL-3 |
Version: | 1.5 |
Built: | 2024-11-04 21:43:20 UTC |
Source: | https://github.com/andreyshabalin/shiftr |
Fast enrichment analysis for locally correlated statistics via circular permutations. The analysis can be performed at multiple significance thresholds for both primary and auxiliary data sets with with efficient correction for multiple testing.
Package: | shiftR |
Type: | Package |
License: | LGPL-3 |
Depends: | methods |
Andrey A Shabalin [email protected]
See the main function shiftrPermBinary
for more info.
Run browseVignettes("shiftR")
for the vignette.
This functions calculates Cramer's V coefficient for overlap of two binary data sets.
cramerV(sum12, sum1, sum2, len)
cramerV(sum12, sum1, sum2, len)
len |
Total number of elements in each data set. |
sum1 |
Number of active features in data set 1. |
sum2 |
Number of active features in data set 1. |
sum12 |
Number of simultaneously active features in the data sets. |
Returns the Cramer's V coefficient.
The parameters can be single values or vectors.
Andrey A Shabalin [email protected]
# Zero score for perfect independence cramerV(100,10000,10000,1000000) # Positive score for increased overlap cramerV(150,10000,10000,1000000) # Negative score for decreased overlap cramerV( 50,10000,10000,1000000) # We can input a vector for sum12 cramerV(99:101,10000,10000,1000000)
# Zero score for perfect independence cramerV(100,10000,10000,1000000) # Positive score for increased overlap cramerV(150,10000,10000,1000000) # Negative score for decreased overlap cramerV( 50,10000,10000,1000000) # We can input a vector for sum12 cramerV(99:101,10000,10000,1000000)
This function performs enrichment analysis on two sets of matching test statistics. The circular permutation scheme accounts for possible local correlation of test statistcs. The testing is performed using the quantile thresholds provided for each data set.
For every permutation the enrichment is measure with Cramer's V coefficient. The maximum/minimum coefficient across all considered thresholds is recorded. It is then compared with the maximum/minimum coefficient observed without permuting the data.
For matching data sets calculated at different genomic locations
please use matchDatasets
.
enrichmentAnalysis( pvstats1, pvstats2, percentiles1 = NULL, percentiles2 = NULL, npermute, margin = 0.05, threads = 1)
enrichmentAnalysis( pvstats1, pvstats2, percentiles1 = NULL, percentiles2 = NULL, npermute, margin = 0.05, threads = 1)
pvstats1 |
The vector of statistics for primary data set. |
pvstats2 |
The vector of statistics for auxiliary data set. |
percentiles1 |
These quantile thresholds are used to cut off top results
in the primary data set
for matching with the top results in the auxiliary. |
percentiles2 |
Same as |
npermute |
Number of permutations to perform. |
margin |
The minimum offset in the circular permutation to consider. |
threads |
The number of CPU cores to use for calculations. |
Returns a list with:
overallPV |
The p-values for the overall test across all thresholds. |
byThresholdPV |
The p-values for tests for each individual threshold. |
Andrey A Shabalin [email protected]
### Data size n = 1e5 ### Generate vectors of test statistics with local correlation window = 1000 pvstats1 = diff(cumsum(runif(n+window)), lag = window) pvstats2 = diff(cumsum(runif(n+window)), lag = window) # Add a bit of dependence pvstats1 = pvstats1 + 0.5 * pvstats2 # test top 0.1, 1, 3, 5, and 10 percent percentiles1 = c(0.001, 0.01, 0.03, 0.05, 0.1) percentiles2 = c(0.001, 0.01, 0.03, 0.05, 0.1) # The offset margin margin = 0.05 # Set the number of permutations # to the maximum npermute = 1e3 enr = enrichmentAnalysis( pvstats1, pvstats2, percentiles1, percentiles2, npermute, margin , threads = 2) # View the results enr
### Data size n = 1e5 ### Generate vectors of test statistics with local correlation window = 1000 pvstats1 = diff(cumsum(runif(n+window)), lag = window) pvstats2 = diff(cumsum(runif(n+window)), lag = window) # Add a bit of dependence pvstats1 = pvstats1 + 0.5 * pvstats2 # test top 0.1, 1, 3, 5, and 10 percent percentiles1 = c(0.001, 0.01, 0.03, 0.05, 0.1) percentiles2 = c(0.001, 0.01, 0.03, 0.05, 0.1) # The offset margin margin = 0.05 # Set the number of permutations # to the maximum npermute = 1e3 enr = enrichmentAnalysis( pvstats1, pvstats2, percentiles1, percentiles2, npermute, margin , threads = 2) # View the results enr
This functions generate offsets for permutation analysis
with shiftrPermBinary
.
Random, uniformly spaced, and complete sets are available via
getOffsetsRandom
, getOffsetsUniform
, and
getOffsetsAll
functions respectively.
The function getNOffestsMax
calculates
the maximum number of permutations (given the margin).
getOffsetsRandom(n, npermute, margin = 0.05) getOffsetsUniform(n, npermute, margin = 0.05) getOffsetsAll(n, margin) getNOffsetsMax(n, margin)
getOffsetsRandom(n, npermute, margin = 0.05) getOffsetsUniform(n, npermute, margin = 0.05) getOffsetsAll(n, margin) getNOffsetsMax(n, margin)
n |
Number of features in the permuted sets. |
npermute |
The number of offsets to be generated (number of permutations). |
margin |
Offsets by less than |
Returns a set of permutation offsets for use in
shiftrPermBinary
function.
The set of offsets is
random for getOffsetsRandom
,
uniformly spaced for getOffsetsUniform
, or
all possible for getOffsetsAll
.
The function getNOffestsMax
returns the
maximum number of permutations (given the margin).
Andrey A Shabalin [email protected]
### Number of features, permutations, and margin n = 100 npermute = 20 margin = 0.1 ### Maximum number of permutations # Should be 81 (from 10 to 90) getNOffsetsMax(n, margin) ### Random offsets getOffsetsRandom(n, npermute, margin) ### Uniformly spaced offsets getOffsetsUniform(n, npermute, margin) ### All possible offsets getOffsetsAll(n, margin)
### Number of features, permutations, and margin n = 100 npermute = 20 margin = 0.1 ### Maximum number of permutations # Should be 81 (from 10 to 90) getNOffsetsMax(n, margin) ### Random offsets getOffsetsRandom(n, npermute, margin) ### Uniformly spaced offsets getOffsetsUniform(n, npermute, margin) ### All possible offsets getOffsetsAll(n, margin)
The goal of this function is to match records in the data sets for subsequent enrichment analysis.
For each record in the primary data set (data1
)
it finds the record in the auxiliary data set (data1
)
which overlap with it or lie within the flanking distance (flank
).
If multiple such auxiliary record are found,
we select the one with the center closest to
the center of the primary record.
If no such record is available, no matching is made for the primary record.
matchDatasets(data1, data2, flank = 0)
matchDatasets(data1, data2, flank = 0)
data1 |
A data frame with the primary data set, must have at least 4 columns:
|
data2 |
A data frame with the auxiliary data set. |
flank |
Allowed distance between matched records. |
Returns a list with matched data sets.
data1 |
The primary data sets without unmatched records. |
data2 |
The auxiliary data set records matching those in
|
For a technical reason, the chromosome positions are assumed to be
no greater than 1e9
.
Andrey A Shabalin [email protected]
data1 = read.csv(text = "chr,start,end,stat chr1,100,200,1 chr1,150,250,2 chr1,200,300,3 chr1,300,400,4 chr1,997,997,5 chr1,998,998,6 chr1,999,999,7") data2 = read.csv(text = "chr,start,end,stat chr1,130,130,1 chr1,140,140,2 chr1,165,165,3 chr1,200,200,4 chr1,240,240,5 chr1,340,340,6 chr1,350,350,7 chr1,360,360,8 chr1,900,900,9") # Match data sets exactly. matchDatasets(data1, data2, 0) # Match data sets with a flank. # The last records are now matched. matchDatasets(data1, data2, 100)
data1 = read.csv(text = "chr,start,end,stat chr1,100,200,1 chr1,150,250,2 chr1,200,300,3 chr1,300,400,4 chr1,997,997,5 chr1,998,998,6 chr1,999,999,7") data2 = read.csv(text = "chr,start,end,stat chr1,130,130,1 chr1,140,140,2 chr1,165,165,3 chr1,200,200,4 chr1,240,240,5 chr1,340,340,6 chr1,350,350,7 chr1,360,360,8 chr1,900,900,9") # Match data sets exactly. matchDatasets(data1, data2, 0) # Match data sets with a flank. # The last records are now matched. matchDatasets(data1, data2, 100)
This function performs very fast feature enrichment
permutation testing between two binary data sets.
Circular permutations are used instead of simple permutations to
preserve local dependence of test statistics.
The input data sets can be preprocessed with
shiftrPrepareLeft
and shiftrPrepareRight
functions.
shiftrPermBinary( left, right, offsets, alsoDoFisher = TRUE, returnPermOverlaps = FALSE)
shiftrPermBinary( left, right, offsets, alsoDoFisher = TRUE, returnPermOverlaps = FALSE)
left |
The first vector of binary (0/1) outcomes. |
right |
The second vector of binary (0/1) outcomes. |
offsets |
Vector of offsets, can be generated by |
alsoDoFisher |
If |
returnPermOverlaps |
If |
Returns a list with:
nfeatures |
Number of features in input data sets. |
lfeatures |
Number of active features in the left data set. |
rfeatures |
Number of active features in the right data set. |
overlap |
Number of features simultaneously active in both data sets. |
overlapUnderNull |
Expected value of |
enrichment |
Enrichment ratio, equal to |
permPVenrich |
Permutation p-value for enrichment (one-sided). |
permPVdeplete |
Permutation p-value for depletion (one-sided). |
permPV |
Permutation p-value for depletion (two-sided). |
permZ |
Permutation z-statistic,
calculated by fitting normal distribution to the |
fisherTest |
Fisher exact test, as output by |
fisherMat |
Input 2x2 matrix for Fisher exact test. |
overlapsPerm |
Vector of length |
Andrey A Shabalin [email protected]
This function essentially involves npermute
calls of
singlePermutation
function
and calculation of summary statistics and p-values.
### Number of features nf = 1e6 npermute = 10000 ### Generate data sets # The vector of a few common active feature to create dependence common = sample(c(0L,1L), size = nf, replace = TRUE, prob = c(0.999,0.001)) # Left and right data sets with the common active features lset = sample(c(0L,1L), size = nf, replace = TRUE, prob = c(0.8,0.2)) | common rset = sample(c(0L,1L), size = nf, replace = TRUE, prob = c(0.8,0.2)) | common offsets = getOffsetsUniform(n = nf, npermute = npermute) show(head(offsets)) show(tail(offsets)) z = shiftrPermBinary(lset, rset, offsets) show(z)
### Number of features nf = 1e6 npermute = 10000 ### Generate data sets # The vector of a few common active feature to create dependence common = sample(c(0L,1L), size = nf, replace = TRUE, prob = c(0.999,0.001)) # Left and right data sets with the common active features lset = sample(c(0L,1L), size = nf, replace = TRUE, prob = c(0.8,0.2)) | common rset = sample(c(0L,1L), size = nf, replace = TRUE, prob = c(0.8,0.2)) | common offsets = getOffsetsUniform(n = nf, npermute = npermute) show(head(offsets)) show(tail(offsets)) z = shiftrPermBinary(lset, rset, offsets) show(z)
The concept of circular permutations is symmetric with respect to the input data sets. The algorithm for circular permutation calculation is, however, not symmetric with respect to two datasets and thus the required data preprocessing is also different. For simplicity, we call the data sets 'left' and 'right'.
shiftrPrepareLeft(set) shiftrPrepareRight(set)
shiftrPrepareLeft(set) shiftrPrepareRight(set)
set |
A 0/1 vector defining selected (genomic) features.
The 'left' and 'right' sets must have equal length.
The enrichment of their overlap can be assessed w
ith |
Returns objects of class fcpLeft
and fcpRight
respectively.
The returned objects are used in singlePermutation
and
shiftrPermBinary
functions.
Andrey A Shabalin [email protected]
See codeshiftrPermBinary function and the respective example.
### Number of features nf = 1e6 ### Generate left and right sets lset = sample(c(0L,1L), size = nf, replace = TRUE) rset = sample(c(0L,1L), size = nf, replace = TRUE) # Prepare binary sets: lbin = shiftrPrepareLeft(lset) rbin = shiftrPrepareRight(rset) ### Check object sizes # Notice asymetry in binary object sizes object.size(lset) object.size(rset) object.size(lbin) object.size(rbin)
### Number of features nf = 1e6 ### Generate left and right sets lset = sample(c(0L,1L), size = nf, replace = TRUE) rset = sample(c(0L,1L), size = nf, replace = TRUE) # Prepare binary sets: lbin = shiftrPrepareLeft(lset) rbin = shiftrPrepareRight(rset) ### Check object sizes # Notice asymetry in binary object sizes object.size(lset) object.size(rset) object.size(lbin) object.size(rbin)
These functions generate two artificial data sets with local dependence of observations.
simulateNumeric(n, corWithin, corAcross = 0) simulateBinary(n, corWithin, corAcross = 0)
simulateNumeric(n, corWithin, corAcross = 0) simulateBinary(n, corWithin, corAcross = 0)
n |
Total number of elements in each data set. |
corWithin |
Correlation of adjacent observations within each data set. |
corAcross |
Correlation of observations across data sets. |
Returns the Cramer's V coefficient.
The simulateNumeric
function generates two data sets with elements
having standard normal distribution.
The simulateBinary
function generates data sets with 0/1 values
by thresholding the numeric data sets from simulateNumeric
.
The simulatePValues
function generates data sets of p-values
by applying pnorm
to the data sets
from simulateNumeric
.
Andrey A Shabalin [email protected]
n = 100000 sim = simulateNumeric(n, 0.5, 0.3) # Means should be close to 0 (zero) mean(sim$data1) mean(sim$data2) # Variances should be close to 1 var(sim$data1) var(sim$data2) # Correlation of adjacent observations # should be close to 0.5 cor(sim$data1[-1], sim$data1[-n]) cor(sim$data2[-1], sim$data2[-n]) # Correlation between data sets # should be close to 0.3 cor(sim$data1, sim$data2)
n = 100000 sim = simulateNumeric(n, 0.5, 0.3) # Means should be close to 0 (zero) mean(sim$data1) mean(sim$data2) # Variances should be close to 1 var(sim$data1) var(sim$data2) # Correlation of adjacent observations # should be close to 0.5 cor(sim$data1[-1], sim$data1[-n]) cor(sim$data2[-1], sim$data2[-n]) # Correlation between data sets # should be close to 0.3 cor(sim$data1, sim$data2)
This function performs fast feature overlap
count under a circular permutation.
The input data sets must be preprocessed with
shiftrPrepareLeft
and shiftrPrepareRight
functions.
singlePermutation(left, right, offset)
singlePermutation(left, right, offset)
left |
Feature set prepared with |
right |
Feature set prepared with |
offset |
Offset of one feature set relative to another.
See the example below for clarity. |
Returns count of feature overlap under a circular permutation.
Andrey A Shabalin [email protected]
### Number of features nf = 1e6 ### Generate left and right sets lset = sample(c(0L,1L), size = nf, replace = TRUE) rset = sample(c(0L,1L), size = nf, replace = TRUE) | lset # Prepare binary sets: lbin = shiftrPrepareLeft(lset) rbin = shiftrPrepareRight(rset) ### count feature overlap # R calculations overlapS = sum(lset & rset) # Binary calculations overlapF = singlePermutation(lbin, rbin, 0) message("Feature overlap: ", overlapS, " / ", overlapF, " (slow/fast count)") stopifnot( overlapS == overlapF ) ### Count overlap with offset offset = 2017 # R calculations overlapOS = sum(lset[ c((offset+1):nf, 1:offset)] & rset) # Binary calculations overlapOF = singlePermutation(lbin, rbin, offset) message("Feature overlap at offset: ", overlapOS, " / ", overlapOF, " (slow/fast count)") stopifnot( overlapOS == overlapOF )
### Number of features nf = 1e6 ### Generate left and right sets lset = sample(c(0L,1L), size = nf, replace = TRUE) rset = sample(c(0L,1L), size = nf, replace = TRUE) | lset # Prepare binary sets: lbin = shiftrPrepareLeft(lset) rbin = shiftrPrepareRight(rset) ### count feature overlap # R calculations overlapS = sum(lset & rset) # Binary calculations overlapF = singlePermutation(lbin, rbin, 0) message("Feature overlap: ", overlapS, " / ", overlapF, " (slow/fast count)") stopifnot( overlapS == overlapF ) ### Count overlap with offset offset = 2017 # R calculations overlapOS = sum(lset[ c((offset+1):nf, 1:offset)] & rset) # Binary calculations overlapOF = singlePermutation(lbin, rbin, offset) message("Feature overlap at offset: ", overlapOS, " / ", overlapOF, " (slow/fast count)") stopifnot( overlapOS == overlapOF )