Batch-robust CCA that operates directly in gene space instead of PCA space. For multi-slide data, uses average per-slide canonical correlation where each slide's contribution is normalized by its own score variance, preventing batch-level mean shifts from inflating the objective.
Usage
runGeneSpaceCCA(
object,
sigma,
nCC = 2,
clip = "quantile",
min_prevalence = 0.008,
min_cells = 20,
max_iter = 3000,
tol = 1e-06,
streaming = FALSE,
distanceArgs = list(),
kernelArgs = list(),
verbose = TRUE
)
# S4 method for class 'CoPro'
runGeneSpaceCCA(
object,
sigma,
nCC = 2,
clip = "quantile",
min_prevalence = 0.008,
min_cells = 20,
max_iter = 3000,
tol = 1e-06,
streaming = FALSE,
distanceArgs = list(),
kernelArgs = list(),
verbose = TRUE
)
# S4 method for class 'CoProMulti'
runGeneSpaceCCA(
object,
sigma,
nCC = 2,
clip = "quantile",
min_prevalence = 0.008,
min_cells = 20,
max_iter = 3000,
tol = 1e-06,
streaming = FALSE,
distanceArgs = list(),
kernelArgs = list(),
verbose = TRUE
)Arguments
- object
A CoPro object with kernel matrices computed.
- sigma
Sigma value to use (single numeric value). Must match a sigma for which kernels were computed.
- nCC
Number of canonical components (default 2).
- clip
Clipping method for gene expression:
"quantile"for 98th percentile (default), or a numeric value for a fixed threshold (e.g., 1 for count data).- min_prevalence
Minimum fraction of cells expressing a gene (default 0.008 = 0.8 percent).
- min_cells
Minimum number of cells expressing a gene (default 20).
- max_iter
Maximum iterations per component (default 3000).
- tol
Convergence tolerance (default 1e-6).
- streaming
Logical. If
TRUE, fuse distance + kernel + covariance reduction into a per-slide loop and free the n x n matrices between slides. BypassescomputeDistanceandcomputeKernelMatrix:object@distancesandobject@kernelMatricesare not populated. DefaultFALSE.- distanceArgs
Named list of distance parameters passed through to the streaming path (e.g.,
distType,normalizeDistance,normalizeTarget,truncateLowDist, scaling factors, morphology-aware parameters). Ignored whenstreaming = FALSE.- kernelArgs
Named list of kernel parameters passed through to the streaming path (e.g.,
lowerLimit,upperQuantile,normalizeKernel). Ignored whenstreaming = FALSE.- verbose
Print progress messages (default TRUE).
Value
The CoPro object with gene weights in geneScores,
cell scores in cellScores, and weight vectors in skrCCAOut.
Details
Requires kernel matrices to already be computed via
computeKernelMatrix. Does NOT require
computePCA.
The objective maximized is: $$f_{avg}(w) = \frac{1}{S} \sum_{s=1}^{S} \sum_{A<B} \frac{w_A^\top C_{AB}^{(s)} w_B}{\sigma_A^{(s)} \sigma_B^{(s)}}$$ where \(\sigma_A^{(s)} = \sqrt{w_A^\top C_{AA}^{(s)} w_A}\) is the per-slide score standard deviation. Subsequent components use Gram-Schmidt deflation in weight space. The power iteration uses a frozen-sigma surrogate (sigma values held fixed at the previous iterate when computing each weight update), making the algorithm an ALS-style alternating maximization rather than exact coordinate ascent.
Memory scales as \(O(G^2 \times S \times (C + C(C-1)/2))\) for precomputed covariance matrices: \(S\) slides, each storing \(C\) self-covariances and \(C(C-1)/2\) cross-covariances of size \(G \times G\). For example G=5000, S=10, C=3 gives \(10 \times 6\) matrices of \(200\) MB each, approximately 12 GB.
Per-slide gene handling: each (slide, cell type) expression matrix is independently centered and scaled. A gene with zero variance on a particular (slide, cell type) pair is set to zero on that slide and contributes nothing from it; the gene is still retained globally if it passes the prevalence filter, with its weight driven by other slides where it has variance. Slides where any requested cell type has fewer than 10 cells are dropped with a warning.
Storage: gene-space CCA results live in @skrCCAOut under the
gscca_sigma_<value> key (distinct from runSkrCCA's
sigma_<value> keys) so the two CCA flavors do not collide.
computeGeneAndCellScores() only operates on runSkrCCA
(PCA-space) outputs; gene-space CCA already populates
@geneScores and @cellScores directly, so no further call
is needed.
Streaming mode (streaming = TRUE) reduces peak memory to roughly
one slide's pairwise n x n footprint at a time. Distance normalization
defaults to normalizationScope = "global" (a single factor across
all slides, matching the slot-based pipeline's semantics under
normalizeDistance = TRUE); under a fixed RNG seed the streaming
result is bit-identical to the slot-based path. Pass
distanceArgs = list(normalizationScope = "per_slide") to opt into
per-slide factors instead – useful when slides have different coordinate
scales, but be aware the per-slide percentile jitter can perturb
degenerate canonical components on heterogeneous datasets. Use
streaming = FALSE when downstream code reads
object@distances or object@kernelMatrices.
See also
runSkrCCA(), computeKernelMatrix()
Other spatial-pipeline:
computeDistance(),
computeKernelMatrix(),
computePCA(),
runSkrCCA()