Function to obtain a SuperSig — get

Generate a tissue-specific SuperSig for a given dataset of mutations and exposure factor. Returns the SuperSig and a classification model trained with the SuperSig.

get_signature(data, factor, wgs = FALSE)

Arguments

data	a data frame of mutations containing columns for `sample_id`, `age`, `IndVar`, and the 96 trinucleotide mutations (see vignette for details)
factor	the factor/exposure (e.g. "age", "smoking"). If the factor = "age", the SuperSig is computed using counts. Otherwise, rates (counts/age) are used.
wgs	logical value indicating whether sequencing data is whole-genome (wgs = `TRUE`) or whole-exome (wgs = `FALSE`)

Value

get_signature returns an object of class SuperSig

Examples


head(example_dt) # use example data from package
#>   sample_id age chromosome  position ref alt
#> 1         1  50       chr1  94447621   G   C
#> 2         1  50       chr2 202005395   A   C
#> 3         1  50       chr7  20784978   T   A
#> 4         1  50       chr7  87179255   C   G
#> 5         1  50      chr19   1059712   G   T
#> 6         2  55       chr1  76226977   T   C
input_dt <- make_matrix(example_dt) # convert to correct format
input_dt$IndVar <- c(1, 1, 1, 0, 0) # add IndVar column
get_signature(data = input_dt, factor = "Age") # get SuperSig
#> Begin feature engineering...
#> Begin cross-validated selection over 4 features and 15 inner folds...
#> ...testing inner fold 1
#> ...testing inner fold 2
#> ...testing inner fold 3
#> ...testing inner fold 4
#> ...testing inner fold 5
#> ...testing inner fold 6
#> ...testing inner fold 7
#> ...testing inner fold 8
#> ...testing inner fold 9
#> ...testing inner fold 10
#> ...testing inner fold 11
#> ...testing inner fold 12
#> ...testing inner fold 13
#> ...testing inner fold 14
#> ...testing inner fold 15
#> Signature:
#> # A tibble: 1 x 1
#>       X1
#>    <dbl>
#> 1 0.0396
#> Features:
#> $X1
#>       F21       F22       F23      F216      F217      F218      F219      F232 
#> "A[C>G]A" "A[C>G]C" "A[C>G]G" "A[C>T]A" "A[C>T]C" "A[C>T]G" "A[C>T]T" "A[T>A]A" 
#>      F233      F234      F235      F247      F248      F249      F250      F263 
#> "A[T>A]C" "A[T>A]G" "A[T>A]T" "A[T>C]A" "A[T>C]C" "A[T>C]G" "A[T>C]T" "A[T>G]A" 
#>      F264      F265      F266       F24       F25       F26       F27      F220 
#> "A[T>G]C" "A[T>G]G" "A[T>G]T" "C[C>G]A" "C[C>G]C" "C[C>G]G" "C[C>G]T" "C[C>T]A" 
#>      F221      F222      F223      F236      F237      F238      F251      F252 
#> "C[C>T]C" "C[C>T]G" "C[C>T]T" "C[T>A]C" "C[T>A]G" "C[T>A]T" "C[T>C]A" "C[T>C]C" 
#>      F253      F254      F267      F268      F269      F270       F28       F29 
#> "C[T>C]G" "C[T>C]T" "C[T>G]A" "C[T>G]C" "C[T>G]G" "C[T>G]T" "G[C>G]A" "G[C>G]C" 
#>      F210      F211      F224      F225      F226      F227      F239      F240 
#> "G[C>G]G" "G[C>G]T" "G[C>T]A" "G[C>T]C" "G[C>T]G" "G[C>T]T" "G[T>A]A" "G[T>A]C" 
#>      F241      F242      F255      F256      F257      F258      F271      F272 
#> "G[T>A]G" "G[T>A]T" "G[T>C]A" "G[T>C]C" "G[T>C]G" "G[T>C]T" "G[T>G]A" "G[T>G]C" 
#>      F273      F274      F212      F213      F214      F215      F228      F229 
#> "G[T>G]G" "G[T>G]T" "T[C>G]A" "T[C>G]C" "T[C>G]G" "T[C>G]T" "T[C>T]A" "T[C>T]C" 
#>      F230      F231      F243      F244      F245      F246      F259      F260 
#> "T[C>T]G" "T[C>T]T" "T[T>A]A" "T[T>A]C" "T[T>A]G" "T[T>A]T" "T[T>C]A" "T[T>C]C" 
#>      F261      F262      F275      F276      F277      F278 
#> "T[T>C]G" "T[T>C]T" "T[T>G]A" "T[T>G]C" "T[T>G]G" "T[T>G]T" 
#> 
#> Model:
#> $Logit
#> 
#> Call:  glm(formula = IndVar ~ ., family = binomial(), data = x)
#> 
#> Coefficients:
#> (Intercept)           X1  
#>      -1.773        1.079  
#> 
#> Degrees of Freedom: 4 Total (i.e. Null);  3 Residual
#> Null Deviance:	    6.73 
#> Residual Deviance: 5.384 	AIC: 9.384
#>