gnocis¶
Gnocis is divided into a number of modules, with discrete functions.
gnocis.biomarkers module¶
-
class
gnocis.biomarkers.
biomarkers
¶ Bases:
object
The biomarkers class represents a set of biomarker regions. Up to multiple regions marking a phenomenon of interest can be registered in a biomarker object.
Parameters: - name (str) – Name of the biomarker set.
- regionSets (list, optional) – Name of the biomarker set. Defaults to [].
- positiveThreshold (int, optional) – Threshold for Highly BioMarker-Enriched (HBME) loci. Defaults to -1.
- negativeThreshold (int, optional) – Threshold for Lowly BioMarker-Enriched (LBME) loci. Defaults to -1.
After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.
-
HBMEs
()¶ Gets highly biomarker-enriched regions of a set. This is determined either by an optional threshold argument, or, if unspecified, by the positiveThreshold member. The resulting set is the merged biomarker spectrum for enrichment levels >= to the threshold.
Parameters: rs (regions) – Region set to extract enriched subset of. Returns: Highly BioMarker-Enriched regions (HBMEs). Return type: regions
-
LBMEs
()¶ Gets lowly biomarker-enriched regions of a set. This is determined either by an optional threshold argument, or, if unspecified, by the negativeThreshold member. The resulting set is the merged biomarker spectrum for enrichment levels <= to the threshold.
Parameters: rs (regions) – Region set to extract enriched subset of. Returns: Highly BioMarker-Enriched regions (HBMEs) Return type: regions
-
add
()¶ Adds a biomarker to the set, by name and region set. The name of the region set is taken as the name of the biomarker.
Parameters: rs (regions) – Region set to add as biomarker.
-
biomarkers
¶
-
enrichmentSpectrum
()¶ Returns the biomarker spectrum. This is a dictionary with numerical keys, containing subsets enriched in N biomarkers, for every valid N as key.
Parameters: rs (regions) – Region set to add as biomarker. Returns: Biomarker spectrum Return type: dict
-
name
¶
-
negativeThreshold
¶
-
positiveThreshold
¶
gnocis.common module¶
-
gnocis.common.
CI
()¶ Calculates a confidence interval of the mean for a set of values. By default, Gnocis calculates a 95% confidence interval with a normal distribution. It is recommended to replace this with an appropriate distribution depending on the analysis performed.
Parameters: X (list) – Values. Returns: Confidence interval difference from the mean. Return type: float
-
gnocis.common.
KLdiv
()¶
-
gnocis.common.
SE
()¶
-
gnocis.common.
getFloat
()¶ Helper function to get floating point values of the same precision as used for float by Cython.
Parameters: v (float) – Value.
-
gnocis.common.
getReverseComplementaryDNASequence
()¶ Returns the reverse complement of a DNA sequence.
Parameters: seq (str) – Sequence to get reverse complement of.
-
gnocis.common.
mean
()¶
-
gnocis.common.
setConfidenceIntervalFunction
()¶ Sets the function for calculating confidence intervals.
Parameters: CIfunc (function) – Function that takes a set of values and outputs the confidence interval difference from the mean. A symmetric confidence interval is assumed.
-
gnocis.common.
setSeed
()¶ Sets the random seed.
Parameters: seed (int) – Sequence to get reverse complement of.
-
gnocis.common.
std
()¶
-
gnocis.common.
useSciPyConfidenceIntervals
()¶ Sets the function for calculating confidence intervals.
Parameters: CIfunc (function) – Function that takes a set of values and outputs the confidence interval difference from the mean. A symmetric confidence interval is assumed.
gnocis.features module¶
-
class
gnocis.features.
feature
¶ Bases:
object
The feature class is a base class for sequence features.
-
cachedSequence
¶
-
cachedValue
¶
-
-
class
gnocis.features.
featureMotifOccurrenceFrequency
¶ Bases:
gnocis.features.feature
Occurrence frequency feature for a motif.
Parameters: _motif (motif) – The motif that the feature is for. -
get
()¶ Extracts the feature value from a sequence.
Parameters: seq (sequence) – The sequence to extract the feature value from. Returns: Feature value Return type: float
-
m
¶
-
-
class
gnocis.features.
featurePREdictorMotifPairOccurrenceFrequency
¶ Bases:
gnocis.features.feature
PREdictor pair occurrence frequency feature for two motifs and a distance cutoff.
Parameters: - motifA (motif) – The first motif.
- motifB (motif) – The second motif.
- distanceCutoff (int) – Pairing cutoff distance. The cutoff is for nucleotides inbetween occurrences of motifA and motifB.
-
distCut
¶
-
get
()¶ Extracts the feature value from a sequence.
Parameters: seq (sequence) – The sequence to extract the feature value from. Returns: Feature value Return type: float
-
mA
¶
-
mB
¶
-
class
gnocis.features.
featureScaler
¶ Bases:
gnocis.features.features
The featureScaler class scales an input feature set, based on a binary training set.
Parameters:
-
class
gnocis.features.
features
¶ Bases:
object
The features class represents a set of features.
Parameters: - name (str) – Name of the feature set
- f (list, optional) – Features to include. Defaults to [].
After constructing a features object, for a set of features FS, len(FS) gives the number of features, FS[i] gives feature with index i, and [ x for x in FS ] gives the list of feature objects. Two features objects A and B can be added together with A+B, yielding a concatenated set of features.
-
diffsummary
()¶
-
features
¶
-
getAll
()¶ Extracts the feature values from a sequence.
Parameters: seq (sequence) – The sequence to extract the feature value from. Returns: Feature vector Return type: list
-
static
kSpectrum
()¶ Constructs the k-mer spectrum kernel, the occurrence frequencies of all motifs of length k, with no ambiguous positions.
Parameters: k (int) – k-mer length. Returns: Feature set Return type: features
-
static
kSpectrumGPS
()¶ Constructs the k-mer spectrum graded position stranded kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position.
Parameters: k (int) – k-mer length. Returns: Feature set Return type: features
-
static
kSpectrumMM
()¶ Constructs the k-mer spectrum mismatch kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position.
Parameters: k (int) – k-mer length. Returns: Feature set Return type: features
-
static
kSpectrumMMD
()¶ Constructs the k-mer spectrum mismatch kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position. This formulation counts the base k-mer k times, and mutated k-mers one time.
Parameters: k (int) – k-mer length. Returns: Feature set Return type: features
-
static
motifPairSpectrum
()¶ Constructs a feature set for the PREdictor motif pair occurrence spectrum, given a set of motifs.
Parameters: motifs (motifs) – The motifs to use for the feature space. Returns: Feature set Return type: features
-
static
motifSpectrum
()¶ Constructs a feature set for the motif occurrence spectrum, given a set of motifs.
Parameters: motifs (motifs) – The motifs to use for the feature space. Returns: Feature set Return type: features
-
name
¶
-
nvalues
¶
-
scale
()¶
-
summary
()¶
-
table
()¶
-
values
¶
-
class
gnocis.features.
kSpectrum
¶ Bases:
object
Feature set that extracts the occurrence frequencies of all unambiguous motifs of length k. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.
Parameters: nspectrum (int) – Length of motifs, k, to use. -
bitmask
¶
-
cachedSequence
¶
-
cachedSpectrum
¶
-
features
¶
-
kmerByIndex
¶
-
nFeatures
¶
-
nspectrum
¶
-
-
class
gnocis.features.
kSpectrumFeatureGPS
¶ Bases:
gnocis.features.feature
-
get
()¶ Extracts the feature value from a sequence.
Parameters: seq (sequence) – The sequence to extract the feature value from. Returns: Feature value Return type: float
-
index
¶
-
kmer
¶
-
parent
¶
-
section
¶
-
-
class
gnocis.features.
kSpectrumGPS
¶ Bases:
object
Feature set that extracts the occurrence frequencies of all motifs of length k, position and strand represented. To represent position and strandedness, four features are generated per k-mer: one for each combination of strandedness and sequence end for proximity. Graded distance to each end of the sequence is used in order to represent position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.
Parameters: nspectrum (int) – Length of motifs, k, to use. -
bitmask
¶
-
cachedSequence
¶
-
cachedSpectrum
¶
-
features
¶
-
kmerByIndex
¶
-
nFeatures
¶
-
nkmers
¶
-
nspectrum
¶
-
-
class
gnocis.features.
kSpectrumMM
¶ Bases:
object
Feature set that extracts the occurrence frequencies of all motifs of length k, with one mismatch allowed in an arbitrary position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.
Parameters: nspectrum (int) – Length of motifs, k, to use. -
bitmask
¶
-
cachedSequence
¶
-
cachedSpectrum
¶
-
features
¶
-
kmerByIndex
¶
-
nFeatures
¶
-
nspectrum
¶
-
-
class
gnocis.features.
kSpectrumMMD
¶ Bases:
object
Feature set that extracts the occurrence frequencies of all motifs of length k, with one mismatch allowed in an arbitrary position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.
Parameters: nspectrum (int) – Length of motifs, k, to use. -
bitmask
¶
-
cachedSequence
¶
-
cachedSpectrum
¶
-
features
¶
-
kmerByIndex
¶
-
nFeatures
¶
-
nspectrum
¶
-
-
class
gnocis.features.
kSpectrumMMDFeature
¶ Bases:
gnocis.features.feature
-
get
()¶ Extracts the feature value from a sequence.
Parameters: seq (sequence) – The sequence to extract the feature value from. Returns: Feature value Return type: float
-
index
¶
-
kmer
¶
-
parent
¶
-
-
class
gnocis.features.
scaledFeature
¶ Bases:
gnocis.features.feature
The scaledFeature class scales and shifts an input feature.
Parameters: - _feature (feature) – The feature to scale.
- vScale (float) – Value to scale by.
- vSub (float) – Shifting.
After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.
-
feature
¶
-
get
()¶ Extracts the feature value from a sequence.
Parameters: seq (sequence) – The sequence to extract the feature value from. Returns: Feature value Return type: float
-
vScale
¶
-
vSub
¶
gnocis.models module¶
-
class
gnocis.models.
CVModelPredictions
¶ Bases:
object
Helper class that represents cross-validated model predictions..
Parameters: - model (sequenceModel) – Model.
- labelPredictionCurves (dict) – Labelled prediction curves.
-
getStatTable
¶
-
predictCore
¶
-
regions
¶
-
setCorePredictions
¶
-
gnocis.models.
createDummyPREdictorModel
()¶
-
gnocis.models.
crossvalidate
()¶
-
class
gnocis.models.
crossvalidation
¶ Bases:
object
Helper class for cross-validations. Accepts binary training and validation sets and constructs cross-validation sets for a desired number of repeats. If a separate validation set is not given, the training set is used. The cross-validation set for each repeat contains numbers of training and test sequences determined by a training-to-testing sequence ratio, as well as a negative-per-positive test sequence ratio. When constructing the validation set, identities are checked for against the training set, to avoid contamination (will not work if sequences are cloned). Holds models and validation statistics. Integrates with terminal and IPython for visualization of results. Stores training and testing data, so new models can be added.
Parameters: - models (list) – List of models to cross-validate.
- tpos (sequences) – Positive training set.
- tneg (sequences) – Negative training set.
- vpos (sequences, optional) – Positive validation set.
- vneg (sequences, optional) – Negative validation set.
- repeats (int, optional) – Number of experimental repeats. Default = 20.
- ratioTrainTest (float, optional) – Ratio of training to testing sequences. Default = 80%.
- ratioNegPos (float, optional) – Ratio of validation negatives to positives. Default = 100.
-
addModel
¶
-
calibrate
¶
-
getAUCTable
¶
-
getConfigurationTable
¶
-
plotPRC
¶
-
plotROC
¶
-
predict
¶
-
class
gnocis.models.
sequenceModel
¶ Bases:
object
The sequenceModel class is an abstract class for sequence models. A number of methods are implemented for machine learning and prediction with DNA sequences.
Parameters: - name (string) – Name of model.
- enableMultiprocessing (bool) – If True (default), multiprocessing is enabled.
-
calibrateGenomewidePrecision
¶ Calibrates the model threshold for an expected precision genome-wide. Returns self to facilitate chaining of operations. However, this operation does mutate the model object. A scaling factor can be applied to the genome with the ‘factor’ argument. If, for instance, the positive set has been divided in half for independent training and calibration, a factor of 0.5 can be used.
Parameters: - positives (sequences/sequenceStream) – Positive sequences.
- genome (sequences/sequenceStream/str) – The genome.
- factor (float, optional) – Scaling factor for positive set versus genome.
- precision (float, optional) – The precision to approximate.
Returns: List of the maximum window score per sequence
Return type: list
-
getConfusionMatrix
¶ Calculates and returns a confusion matrix for sets of positive and negative sequences.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: Confusion matrix
Return type: dict
-
getOptimalAccuracyThreshold
¶ Gets a threshold value optimized for accuracy to a set of positive and a set of negative sequences.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: List of the maximum window score per sequence
Return type: list
-
getPRC
¶ Calculates and returns a Precision/Recall curve (PRC) for sets of positive and negative sequences.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: Precision/Recall curve (PRC)
Return type: list
-
getPRCAUC
¶ Calculates and returns the area under a Precision/Recall curve (PRCAUC) for sets of positive and negative sequences.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: Area under the Precision/Recall curve (PRCAUC)
Return type: list
-
getPrecisionThreshold
¶ Gets a threshold value for a desired precision to a set of positive and a set of negative sequences. Linear interpolation is used in order to achieve a close approximation.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
- wantedPrecision (float) – The precision to approximate.
Returns: List of the maximum window score per sequence
Return type: list
-
getROC
¶ Calculates and returns a Receiver Operating Characteristic (ROC) curve for sets of positive and negative sequences.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: Receiver Operating Characteristic (ROC) curve
Return type: list
-
getROCAUC
¶ Calculates and returns the area under a Receiver Operating Characteristic (ROCAUC) curve for sets of positive and negative sequences.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: Area under the Receiver Operating Characteristic (ROCAUC) curve
Return type: list
-
getSequenceScores
¶ Scores a set of sequences, returning the maximum window score for each. Multiprocessing is enabled by default, but can be disabled in the constructor.
Parameters: seqs (sequences/sequenceStream) – Sequences to score. Returns: List of the maximum window score per sequence Return type: list
-
getValidationStatistics
¶ Returns common model validation statistics (confusion matrix values; ROCAUC; PRCAUC).
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
Returns: Validation statistics
Return type: list
-
plotPRC
¶ Plots a Precision/Recall curve and either displays it in an IPython session or saves it to a file.
Parameters: - seqs (sequences) – Sequences.
- labelPositive (sequenceLabel) – Label of positives.
- labelNegative (sequenceLabel) – Label of negatives.
- figsize (tuple, optional) – Tuple of figure dimensions.
- outpath (str, optional) – Path to save generated plot to. If not set, the plot will be output to IPython.
- style (str, optional) – Matplotlib style to use.
-
predict
¶ Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.
Parameters: stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction. Returns: Set of non-overlapping (merged) predicted regions Return type: regions
-
predictCore
¶ Predicts cores. Not implemented by default. Models that use core prediction should implement this.
Parameters: genome (genome) – Genome. Returns: Predicted core regions if implemented, or None otherwise Return type: regions
-
predictSequenceStreamCurves
¶ Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.
Parameters: stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction. Returns: Set of non-overlapping (merged) predicted regions Return type: regions
-
predictSequenceStreamRegions
¶ Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.
Parameters: stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction. Returns: Set of non-overlapping (merged) predicted regions Return type: regions
-
printTestStatistics
¶
-
score
¶ Scores a sequence of set of sequences.
Parameters: - target (sequence/sequences/sequenceStream) – Sequence(s) to score.
- asTable (bool) – If True, a table will be output. Otherwise, a list.
Returns: Score or list of scores
Return type: float/list
-
scoreSequence
¶ Scores a single sequence. The score is determined by applying the model with a sliding window, and taking the maximum score.
Parameters: seq (sequences/sequenceStream) – The sequence. Returns: Maximal window score Return type: float
-
train
¶ Scores a sequence of set of sequences.
Parameters: ts (sequences) – Training set. Returns: Trained model Return type: sequenceModel
-
class
gnocis.models.
sequenceModelDummy
¶ Bases:
gnocis.models.sequenceModel
This model takes a feature set, and scores input sequences by summing feature values, without weighting.
Parameters: - name (str) – Model name.
- features (features) – The feature set.
- windowSize (int) – Window size to use.
- windowStep (int) – Window step size to use.
-
getTrainer
¶
-
scoreWindow
¶
-
class
gnocis.models.
sequenceModelLogOdds
¶ Bases:
gnocis.models.sequenceModel
Constructs a log-odds model based on an input feature set and binary training set, and scores input sequences by summing log-odds-weighted feature values.
Parameters: - name (str) – Model name.
- features (features) – The feature set.
- trainingSet (sequences) – Training sequences.
- windowSize (int) – Window size to use.
- windowStep (int) – Window step size to use.
- labelPositive (sequenceLabel) – Positive training class label.
- labelNegative (sequenceLabel) – Negative training class label.
-
getTrainer
¶
-
scoreWindow
¶
-
gnocis.models.
setNCores
()¶ Sets the number of cores to use for multiprocessing.
Parameters: n (int) – The number of cores to use.
-
gnocis.models.
setNThreadFetch
()¶
-
gnocis.models.
trainPREdictorModel
()¶
-
gnocis.models.
trainSinglePREdictorModel
()¶
gnocis.motifs module¶
-
class
gnocis.motifs.
IUPACMotif
¶ Bases:
object
Represents an IUPAC motif.
Parameters: - name (str) – Name of the motif.
- motif (str) – Motif sequence, in IUPAC nucleotide codes (https://www.bioinformatics.org/sms/iupac.html).
-
c
¶
-
cRC
¶
-
cachedOcc
¶
-
cachedSequence
¶
-
find
()¶ Finds the occurrences of the motif in a sequence.
Parameters: - seq (sequence) – The sequence.
- cache (bool) – Whether or not to cache.
Returns: List of motif occurrences
Return type: list
-
motif
¶
-
name
¶
-
nmismatches
¶
-
regexMotif
¶
-
regexMotifRC
¶
-
class
gnocis.motifs.
PWMMotif
¶ Bases:
object
Represents an Position Weight Matrix (PWM) motif.
Parameters: - name (str) – Name of the motif.
- pwm (str) – Position Weight Matrix.
- path (str) – Path to file that the PWM was loaded from.
-
PWMF
¶
-
PWMRC
¶
-
cachedOcc
¶
-
cachedSequence
¶
-
find
()¶ Finds the occurrences of the motif in a sequence.
Parameters: - seq (sequence) – The sequence.
- cache (bool) – Whether or not to cache.
Returns: List of motif occurrences
Return type: list
-
getEValueThreshold
()¶ Calibrates the threshold for a desired E-value.
Parameters: - bgModel (object) – Background model to use for calibration.
- Evalue (float) – The desired E-value.
- EvalueUnit (float) – E-value unit.
- seedThreshold (float) – Seed threshold.
- iterations (int) – Number of iterations.
Returns: Threshold
Return type: float
-
name
¶
-
path
¶
-
setPWM
()¶ Sets the Position Weight Matrix.
Parameters: pwm (list) – Position Weight Matrix.
-
threshold
¶
-
gnocis.motifs.
loadMEMEPWMDatabase
()¶
-
class
gnocis.motifs.
motifOccurrence
¶ Bases:
object
Represents motif occurrences.
Parameters: - motif (object) – The motif.
- seq – Name of the sequence the motif occurred on.
- start – Occurrence start nucleotide.
- end – Occurrence end nucleotide.
- strand – Occurrence strand. True is the +/forward strand, and False is the -/backward strand.
- seq – str
- start – int
- end – int
- strand – bool
After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.
-
end
¶
-
motif
¶
-
seq
¶
-
start
¶
-
strand
¶
-
class
gnocis.motifs.
motifs
¶ Bases:
object
The motifs class represents a set of motifs.
Parameters: - name (str) – Name of the motif set.
- motifs – List of motifs.
- motifs – list
-
Ringrose2003
¶ Preset for generating the Ringrose et al. (2003) motif set.
-
Ringrose2003GTGT
¶ Preset for generating the Ringrose et al. (2003) motif set, with GTGT added (as in Bredesen et al. 2019).
-
motifs
¶
-
name
¶
-
occFreq
¶ Generates an occurrence frequency feature set
-
pairFreq
¶ Generates a pair occurrence frequency feature set
-
table
¶
gnocis.regions module¶
-
gnocis.regions.
loadBED
()¶ Loads regions from a BED (https://www.ensembl.org/info/website/upload/bed.html) file.
Parameters: path (str) – Path to the input file. Returns: Loaded regions. Return type: regions
-
gnocis.regions.
loadBEDGZ
()¶ Loads regions from a gzipped BED (https://www.ensembl.org/info/website/upload/bed.html) file.
Parameters: path (str) – Path to the input file. Returns: Loaded regions. Return type: regions
-
gnocis.regions.
loadCoordinateList
()¶ Loads regions from a coordinate list file (one region per line, with the format seq:start..end).
Parameters: path (str) – Path to the input file. Returns: Loaded regions. Return type: regions
-
gnocis.regions.
loadGFF
()¶ Loads regions from a General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.
Parameters: path (str) – Path to the input file. Returns: Loaded regions. Return type: regions
-
gnocis.regions.
loadGFFGZ
()¶ Loads regions from a gzipped General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.
Parameters: path (str) – Path to the input file. Returns: Loaded regions. Return type: regions
-
gnocis.regions.
nucleotidePrecisionBarplot
()¶ Generates a prediction nucleotide precision barplot.
Parameters: - predictionSets (list) – List of prediction region sets.
- regionSets (list) – List of validation region sets.
- figsize (tuple, optional) – Figure size.
- outpath (str, optional) – Output path.
- returnHTML (bool, optional) – If True, an HTML node will be returned.
- fontsizeLabels (float, optional) – Size of label font.
- fontsizeLegend (float, optional) – Size of legend font.
- fontsizeAxis (float, optional) – Size of axis font.
- style (str, optional) – Plot style to use.
- showLegend (bool, optional) – Flag for whether or not to render legend.
- bboxAnchorTo (tuple, optional) – Legend anchor point.
- legendLoc (str, optional) – Legend location.
- showValues (bool, optional) – If True, values will be plotted.
-
gnocis.regions.
nucleotideRegionF1Barplot
()¶ Generates a barplot of F1-measures, where the recall is based on region overlap sensitivity and precision is based on nucleotide precision.
Parameters: - predictionSets (list) – List of prediction region sets.
- regionSets (list) – List of validation region sets.
- figsize (tuple, optional) – Figure size.
- outpath (str, optional) – Output path.
- returnHTML (bool, optional) – If True, an HTML node will be returned.
- fontsizeLabels (float, optional) – Size of label font.
- fontsizeLegend (float, optional) – Size of legend font.
- fontsizeAxis (float, optional) – Size of axis font.
- style (str, optional) – Plot style to use.
- showLegend (bool, optional) – Flag for whether or not to render legend.
- bboxAnchorTo (tuple, optional) – Legend anchor point.
- legendLoc (str, optional) – Legend location.
- showValues (bool, optional) – If True, values will be plotted.
-
gnocis.regions.
overlapPrecisionBarplot
()¶ Generates a prediction overlap precision barplot.
Parameters: - predictionSets (list) – List of prediction region sets.
- regionSets (list) – List of validation region sets.
- figsize (tuple, optional) – Figure size.
- outpath (str, optional) – Output path.
- returnHTML (bool, optional) – If True, an HTML node will be returned.
- fontsizeLabels (float, optional) – Size of label font.
- fontsizeLegend (float, optional) – Size of legend font.
- fontsizeAxis (float, optional) – Size of axis font.
- style (str, optional) – Plot style to use.
- showLegend (bool, optional) – Flag for whether or not to render legend.
- bboxAnchorTo (tuple, optional) – Legend anchor point.
- legendLoc (str, optional) – Legend location.
- showValues (bool, optional) – If True, values will be plotted.
-
gnocis.regions.
overlapSensitivityBarplot
()¶ Generates a prediction overlap sensitivity barplot.
Parameters: - predictionSets (list) – List of prediction region sets.
- regionSets (list) – List of validation region sets.
- figsize (tuple, optional) – Figure size.
- outpath (str, optional) – Output path.
- returnHTML (bool, optional) – If True, an HTML node will be returned.
- fontsizeLabels (float, optional) – Size of label font.
- fontsizeLegend (float, optional) – Size of legend font.
- fontsizeAxis (float, optional) – Size of axis font.
- style (str, optional) – Plot style to use.
- showLegend (bool, optional) – Flag for whether or not to render legend.
- bboxAnchorTo (tuple, optional) – Legend anchor point.
- legendLoc (str, optional) – Legend location.
- showValues (bool, optional) – If True, values will be plotted.
-
gnocis.regions.
predictionBarplot
()¶ Generates a prediction overlap precision barplot.
Parameters: - predictionSets (list) – List of prediction region sets.
- figsize (tuple, optional) – Figure size.
- outpath (str, optional) – Output path.
- returnHTML (bool, optional) – If True, an HTML node will be returned.
- fontsizeLabels (float, optional) – Size of label font.
- fontsizeLegend (float, optional) – Size of legend font.
- fontsizeAxis (float, optional) – Size of axis font.
- style (str, optional) – Plot style to use.
- showLegend (bool, optional) – Flag for whether or not to render legend.
- bboxAnchorTo (tuple, optional) – Legend anchor point.
- legendLoc (str, optional) – Legend location.
- showValues (bool, optional) – If True, values will be plotted.
-
class
gnocis.regions.
region
¶ Bases:
object
The region class represents a region within a sequence. A region has a sequence name, a start and end, and a strandedness, with optional additional annotation.
Parameters: - seq (str) – Sequence name.
- start (int) – Start coordinate.
- end (int) – End coordinate (inclusive).
- strand (bool, optional) – Strandedness. True for the + (forward) strand (default), and False for the - (reverse) strand.
- score (float, optional) – Score.
- source (str, optional) – Soure.
- feature (str, optional) – Feature.
- group (str, optional) – Group.
- dropChr (bool, optional) – True if a leading chr in seq is to be dropped (default).
The length of a region r can be calculated with len(r).
-
bstr
()¶ Returns: Returns the region formatted as a string: seq:start..end (strand). Return type: str
-
center
()¶
-
end
¶
-
ext
¶
-
feature
¶
-
group
¶
-
markers
¶
-
score
¶
-
seq
¶
-
singleton
()¶
-
source
¶
-
start
¶
-
strand
¶
-
gnocis.regions.
regionBarplot
()¶ Generates a barplot of a measure (lambda) for predicted regions.
Parameters: - mainRegions (list) – List of main region sets.
- measure (function) – Function to compute measure.
- measureName (str) – Name of measure.
- testRegions (list, optional) – List of test region sets, to check against (or None if not relevant).
- figsize (tuple, optional) – Figure size.
- outpath (str, optional) – Output path.
- returnHTML (bool, optional) – If True, an HTML node will be returned.
- fontsizeLabels (float, optional) – Size of label font.
- fontsizeLegend (float, optional) – Size of legend font.
- fontsizeAxis (float, optional) – Size of axis font.
- style (str, optional) – Plot style to use.
- showLegend (bool, optional) – Flag for whether or not to render legend.
- bboxAnchorTo (tuple, optional) – Legend anchor point.
- legendLoc (str, optional) – Legend location.
- showValues (bool, optional) – If True, values will be plotted.
- isPercent (bool, optional) – If True, values will be plotted as percentages.
-
class
gnocis.regions.
regions
¶ Bases:
object
The regions class represents a set of regions.
Parameters: - name (str) – Name for the sequence set.
- rgns (list) – List of regions to include.
- initialSort (bool) – Whether or not to sort the regions (default is True).
For a region set RS, len(RS) gives the number of regions, RS[i] gives the i`th region in the set, and `[ r for r in RS ] lists regions in the set. For sets A and B, A+B gives a sorted region set of regions from both sets, A|B gives the merged set, A&B the conjunction, and A^B the regions of A with B excluded.
-
bstr
()¶ Returns: Returns a concatenated list of regions formatted as seq:start..end (strand). Return type: str
-
deltaResize
()¶ Gets a set of resized regions. The delta size is subtracted from the start and added to the end.
Parameters: sizeDelta – Desired region size change. Returns: Set of resized regions. Return type: regions
-
dummy
()¶ Generates a dummy region set, with the same region lengths but random positions.
Parameters: - genome (genome) – Genome.
- useSeq (list) – Chromosomes to focus on.
Returns: Dummy region set.
Return type:
-
exclusion
()¶ Gets the regions of this set with regions of another set excluded.
Parameters: other (regions) – Other region set. Returns: Excluded region set. Return type: regions
-
expected
()¶ Calculates an expected statistic by random dummy set creation and averaging.
Parameters: - genome (genome) – Genome.
- statfun (Function) – Function to apply to region sets that returns desired statistic.
- useSeq (list) – Chromosomes to focus on.
- repeats (int) – Number of repeats for calculating statistic.
Returns: Average statistic.
Return type: float
-
extract
()¶ Extracts region sequences from a sequence set, sequence stream, or a file by path.
Parameters: src (sequences/sequenceStream/str) – Sequence set, sequence stream, or sequence file path. Returns: Extracted region sequences. Return type: sequences
-
extractSequences
()¶ Extracts region sequences from a sequence set or stream.
Parameters: seqs (sequences/sequenceStream) – Sequences/sequence stream to extract region sequences from. Returns: Extracted region sequences. Return type: sequences
-
extractSequencesFrom2bit
()¶ Extracts region sequences from a 2bit file.
Parameters: path (str) – Path to 2bit file. Returns: Extracted region sequences. Return type: sequences
-
extractSequencesFromFASTA
()¶ Extracts region sequences from a FASTA file.
Parameters: path (str) – Path to FASTA file. Returns: Extracted region sequences. Return type: sequences
-
filter
()¶ Returns a region set filtered with the input lambda function.
Parameters: - fltName (str) – Name of the filter. Appended to the region set name.
- flt (lambda) – Lambda function to be applied to every region, returning True if the region is to be included, or otherwise False.
Returns: Region set filtered with the lambda function.
Return type:
-
flatten
()¶ Returns a flattened set of regions, with internally overlapping regions merged.
Returns: Flattened region set. Return type: regions
-
intersection
()¶ Gets the intersection of regions in this set and another.
Parameters: other (regions) – Other region set. Returns: Intersection region set. Return type: regions
-
merge
()¶ Gets the merged set of regions in this set and another.
Parameters: other (regions) – Other region set. Returns: Merged region set. Return type: regions
-
name
¶
-
nonOverlap
()¶ Gets the regions in this set that do not overlap with another.
Parameters: other (regions) – Other region set. Returns: Set of non-overlapping regions. Return type: regions
-
nucleotidePrecision
()¶ Calculates the nucleotide precision to another set.
Parameters: other (regions) – Other region set. Returns: Nucleotide precision. Return type: float
-
overlap
()¶ Gets the regions in this set that overlap with another.
Parameters: other (regions) – Other region set. Returns: Set of overlapping regions. Return type: regions
-
overlapPrecision
()¶ Calculates the overlap precision to another set.
Parameters: other (regions) – Other region set. Returns: Overlap precision. Return type: float
-
overlapSensitivity
()¶ Calculates the overlap sensitivity to another set.
Parameters: other (regions) – Other region set. Returns: Overlap sensitivity. Return type: float
-
printStatistics
()¶ Outputs basic statistics.
-
randomSplit
()¶ Parameters: ratio (float) – Ratio for the split, between 0 and 1. For the return touple, (A, B), a ratio of 0 means all the regions are in B, and opposite for a ratio of 1. Returns: Returns a random split of the regions into two independent sets, with the given ratio. Return type: regions
-
recenter
()¶ Gets a set of randomly recentered regions. If the target size is smaller than a given region, a region of the desired size is randomly placed within the region. If the desired size is larger, a region of the desired size is placed with equal center to the center of the region.
Parameters: size (int) – Desired region size. Returns: Set of randomly recentered regions. Return type: regions
-
regions
¶
-
sample
()¶ Parameters: n (int) – Number of regions to pick. If 0, the same number will be selected as are in the full set. Returns: Returns a random sample of the regions of size n. Regions are selected with replacement, and the same region can occur multiple times. Return type: regions
-
saveBED
()¶ Saves the regions to a BED (https://www.ensembl.org/info/website/upload/bed.html) file.
Parameters: path (str) – Output file path.
-
saveCoordinateList
()¶ Saves the regions to a coordinate list file (a coordinate per line, with the format seq:start..end).
Parameters: path (str) – Output file path.
-
saveGFF
()¶ Saves the regions to a General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.
Parameters: path (str) – Output file path.
-
sort
()¶ Sorts the set, first by start coordinate, then by sequence.
-
table
()¶
gnocis.sequences module¶
-
class
gnocis.sequences.
IID
¶ Bases:
gnocis.sequences.sequenceGenerator
The IID class trains an i.i.d. model for generating sequences.
Parameters: - trainingSequences (sequences/sequenceStream/str) – Training sequences.
- pseudoCounts (bool, optional) – Whether or not to use pseudocounts. Default is True.
- addComplements (bool, optional) – Whether or not to add complements. Default is True.
-
addComplements
¶
-
generate
()¶ Generates a sequence.
Parameters: length (int) – Length of the sequence to generate. Returns: Random sequence. Return type: sequence
-
nGenerated
¶
-
ntDistribution
¶
-
prepared
¶
-
pseudoCounts
¶
-
spectrum
¶
-
trainingSequences
¶
-
class
gnocis.sequences.
MarkovChain
¶ Bases:
gnocis.sequences.sequenceGenerator
The MarkovChain class trains a Markov chain for generating sequences.
Parameters: - trainingSequences (sequences/sequenceStream/str) – Training sequences.
- degree (int, optional) – Markov chain degree. Default is 4.
- pseudoCounts (bool, optional) – Whether or not to use pseudocounts. Default is True.
- addReverseComplements (bool, optional) – Whether or not to add reverse complements. Default is True.
-
addReverseComplements
¶
-
comparableSpectrum
¶
-
degree
¶
-
generate
()¶ Generates a sequence.
Parameters: length (int) – Length of the sequence to generate. Returns: Random sequence. Return type: sequence
-
initialDistribution
¶
-
nGenerated
¶
-
prepared
¶
-
probspectrum
¶
-
pseudoCounts
¶
-
spectrum
¶
-
trainingSequences
¶
-
gnocis.sequences.
getSequenceStreamFromPath
()¶ Creates a sequence stream from a path, deducing the file type.
Parameters: - path (str) – Path to the input file.
- wantBlockSize (int, optional) – Desired block size.
- spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
- dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
- restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns: Generated sequence stream.
Return type:
-
gnocis.sequences.
getSequenceWindowRegions
()¶ Generates sequence window regions for a sequence set or stream.
Parameters: - stream – Input sequences.
- windowSize (int) – Window size.
- windowStep (int) – Window step size.
Returns: Sequence window regions.
Return type:
-
gnocis.sequences.
loadFASTA
()¶ Loads sequences from a FASTA file.
Parameters: path (str) – Path to the input file. Returns: Loaded sequences. Return type: sequences
-
gnocis.sequences.
loadFASTAGZ
()¶ Loads sequences from a gzipped FASTA file.
Parameters: path (str) – Path to the input file. Returns: Loaded sequences. Return type: sequences
-
class
gnocis.sequences.
sequence
¶ Bases:
object
The sequence class represents a DNA sequence.
Parameters: - name (str) – Name of the sequence.
- seq (str) – Sequence.
- path (str, optional) – Source file path.
- sourceRegion (region, optional) – Source region.
- annotation (str, optional) – Annotation.
For a sequence instance S, len(S) gives the sequence length. [ nt for nt in S ] lists the nucleotides. For two sequences A and B, A+B gives the concatenated sequence.
-
annotation
¶
-
label
()¶ Parameters: label (sequenceLabel) – Label to add. Returns: Returns a labelled sequence. Return type: sequence
-
labels
¶
-
name
¶
-
path
¶
-
seq
¶
-
sourceRegion
¶
-
class
gnocis.sequences.
sequenceGenerator
¶ Bases:
object
The sequenceGenerator class is an abstract class for sequence generators.
-
generateFASTA
()¶ Generates sequences and saves them to a FASTA file.
Parameters: - path (str) – Output path.
- n (int) – Number of sequences to generate.
- length (int) – Length of each sequence to generate.
-
generateSet
()¶ Parameters: - n (int) – Number of sequences to generate.
- length (int) – Length of each sequence to generate.
Returns: Sequence set of n randomly generated sequences, each of length length.
Return type:
-
generateStream
()¶ Parameters: - n (int) – Number of sequences to generate.
- length (int) – Length of each sequence to generate.
- wantBlockSize (int, optional) – Desired block size.
Returns: Sequence stream of n randomly generated sequences, each of length length.
Return type: sequencesStream
-
-
class
gnocis.sequences.
sequenceLabel
¶ Bases:
object
The sequenceLabel class represents a label that can be assigned to sequences to be used for model training or testing, such as “positive” or “negative”.
Parameters: - name (str) – Name of the sequence label.
- value (float) – Value to represent the label.
-
name
¶
-
value
¶
-
class
gnocis.sequences.
sequenceStream
¶ Bases:
object
The sequenceStream class represents sequence streams.
-
fetch
()¶ Fetches n sequences (or less, if fewer or none are left).
Parameters: - n (int) – Number of sequences to fetch.
- maxnt (int) – Maximum number of nucleotides to fetch.
:return Yields sequences. :rtype: sequences (yield)
-
sequenceLengths
()¶ Returns: Returns a dictionary of the lengths of all sequences in the stream. Return type: dict
-
streamFullSequences
()¶ Streams and yields whole sequences.
:return Yields whole sequence. :rtype: sequence (yield)
-
windowRegions
()¶ Generates and returns a set of sliding window regions.
Parameters: - size (int) – Window size.
- step (int) – Window step size.
Returns: Region set of all sliding window regions across all sequences in the stream.
Return type:
-
-
class
gnocis.sequences.
sequenceStream2bit
¶ Bases:
gnocis.sequences.sequenceStream
Streams a 2bit file.
Parameters: - path (str) – Path to the input file.
- wantBlockSize (int) – Desired block size.
- spacePrune (bool) – Whether or not to prune sequence name spaces.
- dropChr (bool) – Whether or not to prune sequence name “chr”-prefixes.
- restrictToSequences (list) – List of sequence names to restrict to/focus on.
-
class
gnocis.sequences.
sequenceStreamFASTA
¶ Bases:
gnocis.sequences.sequenceStream
Streams a FASTA file.
Parameters: - path (str) – Path to the input file.
- wantBlockSize (int) – Desired block size.
- spacePrune (bool) – Whether or not to prune sequence name spaces.
- dropChr (bool) – Whether or not to prune sequence name “chr”-prefixes.
- restrictToSequences (list) – List of sequence names to restrict to/focus on.
- isGZ (bool) – Whether or not the input file is GZipped.
-
class
gnocis.sequences.
sequences
¶ Bases:
object
The sequences class represents a set of DNA sequences.
Parameters: - name (str) – Name of the sequence set.
- seq (list) – List of sequences to include.
For a sequence instance S, len(S) gives the sequence length, and [ s for s in S ] lists the sequences. For two sequence sets A and B, A+B gives the set containing all sequences from both sets.
-
label
()¶ Adds a label to sequences.
Parameters: label (sequenceLabel) – Label to add to sequences. Returns: Sequences with label added. Return type: sequences
-
labels
()¶ Gets labels for sequences.
Returns: Set of labels. Return type: set
-
name
¶
-
printStatistics
()¶ Outputs basic statistics.
-
randomSplit
()¶ Parameters: ratio (float) – Ratio for the split, between 0 and 1. For the return touple, (A, B), a ratio of 0 means all the sequences are in B, and opposite for a ratio of 1. Returns: Returns a random split of the sequences into two independent sets, with the given ratio. Return type: sequences
-
sample
()¶ Parameters: n (int) – Number of sequences to pick. If 0, the same number will be selected as are in the full set. Returns: Returns a random sample of the sequences of size n. Sequences are selected with replacement, and the same sequence can occur multiple times. Return type: sequences
-
saveFASTA
()¶ Saves the regions to a FASTA file.
Parameters: path (str) – Output file path.
-
sequenceLengths
()¶ Returns: Returns a dictionary of the lengths of all sequences in the set. Return type: dict
-
sequences
¶
-
sourceRegions
()¶ Returns: Region set of source regions for all sequences in the set. Return type: regions
-
split
()¶ Parameters: n (int) – Number of sets to split into. Returns: Returns a list of n independent splits of the sequence set, without shuffling. Return type: list
-
table
()¶
-
windowRegions
()¶ Generates and returns a set of sliding window regions.
Parameters: - size (int) – Window size.
- step (int) – Window step size.
Returns: Region set of all sliding window regions across all sequences in the set.
Return type:
-
windows
()¶ Generates and returns a set of sliding window sequences.
Parameters: - size (int) – Window size.
- step (int) – Window step size.
Returns: Sequence set of all sliding window sequences across all sequences in the set.
Return type:
-
withLabel
()¶ Extracts sequences with the given label or labels.
Parameters: - labels (list, sequenceLabel) – List of labels, or single labels, to extract sequences with.
- ensureAllLabels (bool) – If true, an exception is raised if no sequences with a label were found. Default true.
Returns: Single label or list of labels.
Return type: sequences, list
If multiple labels are given, a list of sequences instances are returned in the same order as the labels.
-
gnocis.sequences.
stream2bit
()¶ Streams a 2bit file.
Parameters: - path (str) – Path to the input file.
- wantBlockSize (int, optional) – Desired block size.
- spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
- dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
- restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns: Generated sequence stream.
Return type:
-
gnocis.sequences.
streamFASTA
()¶ Streams a FASTA file.
Parameters: - path (str) – Path to the input file.
- wantBlockSize (int, optional) – Desired block size.
- spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
- dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
- restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns: Generated sequence stream.
Return type:
-
gnocis.sequences.
streamFASTAGZ
()¶ Streams a gzipped FASTA file.
Parameters: - path (str) – Path to the input file.
- wantBlockSize (int, optional) – Desired block size.
- spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
- dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
- restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns: Generated sequence stream.
Return type:
-
gnocis.sequences.
streamSequenceWindows
()¶ Yields sequence windows for a sequence set, sequence stream or a path to a sequence file.
Parameters: - src (sequences/sequenceStream/str) – Input sequences.
- windowSize (int) – Window size.
- windowStep (int) – Window step size.
Returns: Sequence windows.
Return type: sequence (yield)
gnocis.sklearnModels module¶
-
class
gnocis.sklearnModels.
RF
(model=None, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>, nTrees=100, maxDepth=None)[source]¶ Bases:
gnocis.featurenetwork.baseModel
-
class
gnocis.sklearnModels.
SVM
(model=None, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>, kDegree=1, C=4)[source]¶ Bases:
gnocis.featurenetwork.baseModel
-
class
gnocis.sklearnModels.
sequenceModelLasso
(name, features, trainingSet, windowSize, windowStep, alpha=1.0, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶ Bases:
gnocis.models.sequenceModel
The sequenceModelLasso class trains a Lasso model using scikit-learn.
Parameters:
-
class
gnocis.sklearnModels.
sequenceModelRF
(name, features, trainingSet, windowSize, windowStep, nTrees=100, maxDepth=None, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶ Bases:
gnocis.models.sequenceModel
The sequenceModelRF class trains a Random Forest (RF) model using scikit-learn.
Parameters: - name (str) – Model name.
- features (features) – Feature set.
- positives (sequences) – Positive training sequences.
- negatives (sequences) – Negative training sequences.
- windowSize (int) – Window size.
- windowStep (int) – Window step size.
- nTrees (int) – Number of trees.
- maxDepth (int) – Maximum tree depth.
-
class
gnocis.sklearnModels.
sequenceModelSVM
(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶ Bases:
gnocis.models.sequenceModel
The sequenceModelSVM class trains a Support Vector Machine (SVM) using scikit-learn.
Parameters:
gnocis.sklearnModelsOpt module¶
-
class
gnocis.sklearnModelsOpt.
sequenceModelSVMOptimizedQuadratic
(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶ Bases:
gnocis.models.sequenceModel
The sequenceModelSVMOptimizedQuadratic class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication.
Parameters:
-
class
gnocis.sklearnModelsOpt.
sequenceModelSVMOptimizedQuadraticAutoScale
(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶ Bases:
gnocis.models.sequenceModel
The sequenceModelSVMOptimizedQuadratic class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication.
Parameters:
-
class
gnocis.sklearnModelsOpt.
sequenceModelSVMOptimizedQuadraticCUDA
(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶ Bases:
gnocis.models.sequenceModel
The sequenceModelSVMOptimizedQuadraticCUDA class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication with CUDA.
Parameters:
gnocis.validation module¶
-
gnocis.validation.
getAUC
()¶ Calculates the Area Under the Curve (AUC) for an input curve.
Parameters: curve (list) – Curve to calculate area under. Returns: Area Under the Curve (AUC). Return type: float
-
gnocis.validation.
getConfusionMatrix
()¶ Calculates confusion matrix based on input validation pairs, and an optional threshold.
Parameters: - vPos (list) – Positive validation pairs.
- vNeg (list) – Negative validation pairs.
- threshold (float) – Classification threshold.
Returns: Dictionary with confusion matrix entries.
Return type: dict
-
gnocis.validation.
getConfusionMatrixStatistics
()¶ Calculates confusion matrix statistics in an input confusion matrix dictionary.
Parameters: CM (dict) – Confusion matrix dictionary. Returns: Dictionary with confusion matrix statistics. Return type: dict
-
gnocis.validation.
getPRC
()¶ Generates a Precision/Recall Curve (PRC).
Parameters: - vPos (list) – Positive validation pairs.
- vNeg (list) – Negative validation pairs.
Returns: List of 2D-points for curve
Return type: list
-
gnocis.validation.
getROC
()¶ Generates a Receiver Operating Characteristic (ROC) curve.
Parameters: - vPos (list) – Positive validation pairs.
- vNeg (list) – Negative validation pairs.
Returns: List of 2D-points for curve
Return type: list
-
class
gnocis.validation.
point2D
¶ Bases:
object
Two-dimensional point.
Parameters: - x (float) – X-coordinate.
- y (float) – Y-coordinate.
- rank (float, optional) – Rank.
-
rank
¶
-
x
¶
-
y
¶
-
gnocis.validation.
printValidationStatistics
()¶ Prints model confusion matrix statistics from a dictionary.
Parameters: stats (dict) – Confusion matrix statistics dictionary.