gnocis

Gnocis is divided into a number of modules, with discrete functions.

gnocis.biomarkers module

class gnocis.biomarkers.biomarkers

Bases: object

The biomarkers class represents a set of biomarker regions. Up to multiple regions marking a phenomenon of interest can be registered in a biomarker object.

Parameters:
  • name (str) – Name of the biomarker set.
  • regionSets (list, optional) – Name of the biomarker set. Defaults to [].
  • positiveThreshold (int, optional) – Threshold for Highly BioMarker-Enriched (HBME) loci. Defaults to -1.
  • negativeThreshold (int, optional) – Threshold for Lowly BioMarker-Enriched (LBME) loci. Defaults to -1.

After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.

HBMEs()

Gets highly biomarker-enriched regions of a set. This is determined either by an optional threshold argument, or, if unspecified, by the positiveThreshold member. The resulting set is the merged biomarker spectrum for enrichment levels >= to the threshold.

Parameters:rs (regions) – Region set to extract enriched subset of.
Returns:Highly BioMarker-Enriched regions (HBMEs).
Return type:regions
LBMEs()

Gets lowly biomarker-enriched regions of a set. This is determined either by an optional threshold argument, or, if unspecified, by the negativeThreshold member. The resulting set is the merged biomarker spectrum for enrichment levels <= to the threshold.

Parameters:rs (regions) – Region set to extract enriched subset of.
Returns:Highly BioMarker-Enriched regions (HBMEs)
Return type:regions
add()

Adds a biomarker to the set, by name and region set. The name of the region set is taken as the name of the biomarker.

Parameters:rs (regions) – Region set to add as biomarker.
biomarkers
enrichmentSpectrum()

Returns the biomarker spectrum. This is a dictionary with numerical keys, containing subsets enriched in N biomarkers, for every valid N as key.

Parameters:rs (regions) – Region set to add as biomarker.
Returns:Biomarker spectrum
Return type:dict
name
negativeThreshold
positiveThreshold

gnocis.common module

gnocis.common.CI()

Calculates a confidence interval of the mean for a set of values. By default, Gnocis calculates a 95% confidence interval with a normal distribution. It is recommended to replace this with an appropriate distribution depending on the analysis performed.

Parameters:X (list) – Values.
Returns:Confidence interval difference from the mean.
Return type:float
gnocis.common.KLdiv()
gnocis.common.SE()
gnocis.common.getFloat()

Helper function to get floating point values of the same precision as used for float by Cython.

Parameters:v (float) – Value.
gnocis.common.getReverseComplementaryDNASequence()

Returns the reverse complement of a DNA sequence.

Parameters:seq (str) – Sequence to get reverse complement of.
gnocis.common.mean()
gnocis.common.setConfidenceIntervalFunction()

Sets the function for calculating confidence intervals.

Parameters:CIfunc (function) – Function that takes a set of values and outputs the confidence interval difference from the mean. A symmetric confidence interval is assumed.
gnocis.common.setSeed()

Sets the random seed.

Parameters:seed (int) – Sequence to get reverse complement of.
gnocis.common.std()
gnocis.common.useSciPyConfidenceIntervals()

Sets the function for calculating confidence intervals.

Parameters:CIfunc (function) – Function that takes a set of values and outputs the confidence interval difference from the mean. A symmetric confidence interval is assumed.

gnocis.features module

class gnocis.features.feature

Bases: object

The feature class is a base class for sequence features.

cachedSequence
cachedValue
get()

Extracts the feature value from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature value
Return type:float
class gnocis.features.featureMotifOccurrenceFrequency

Bases: gnocis.features.feature

Occurrence frequency feature for a motif.

Parameters:_motif (motif) – The motif that the feature is for.
get()

Extracts the feature value from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature value
Return type:float
m
class gnocis.features.featurePREdictorMotifPairOccurrenceFrequency

Bases: gnocis.features.feature

PREdictor pair occurrence frequency feature for two motifs and a distance cutoff.

Parameters:
  • motifA (motif) – The first motif.
  • motifB (motif) – The second motif.
  • distanceCutoff (int) – Pairing cutoff distance. The cutoff is for nucleotides inbetween occurrences of motifA and motifB.
distCut
get()

Extracts the feature value from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature value
Return type:float
mA
mB
class gnocis.features.featureScaler

Bases: gnocis.features.features

The featureScaler class scales an input feature set, based on a binary training set.

Parameters:
  • _features (features) – The feature set to scale.
  • trainingSet (sequences) – Training sequences.
getAll()

Extracts the feature values from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature vector
Return type:list
class gnocis.features.features

Bases: object

The features class represents a set of features.

Parameters:
  • name (str) – Name of the feature set
  • f (list, optional) – Features to include. Defaults to [].

After constructing a features object, for a set of features FS, len(FS) gives the number of features, FS[i] gives feature with index i, and [ x for x in FS ] gives the list of feature objects. Two features objects A and B can be added together with A+B, yielding a concatenated set of features.

diffsummary()
features
getAll()

Extracts the feature values from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature vector
Return type:list
static kSpectrum()

Constructs the k-mer spectrum kernel, the occurrence frequencies of all motifs of length k, with no ambiguous positions.

Parameters:k (int) – k-mer length.
Returns:Feature set
Return type:features
static kSpectrumGPS()

Constructs the k-mer spectrum graded position stranded kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position.

Parameters:k (int) – k-mer length.
Returns:Feature set
Return type:features
static kSpectrumMM()

Constructs the k-mer spectrum mismatch kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position.

Parameters:k (int) – k-mer length.
Returns:Feature set
Return type:features
static kSpectrumMMD()

Constructs the k-mer spectrum mismatch kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position. This formulation counts the base k-mer k times, and mutated k-mers one time.

Parameters:k (int) – k-mer length.
Returns:Feature set
Return type:features
static motifPairSpectrum()

Constructs a feature set for the PREdictor motif pair occurrence spectrum, given a set of motifs.

Parameters:motifs (motifs) – The motifs to use for the feature space.
Returns:Feature set
Return type:features
static motifSpectrum()

Constructs a feature set for the motif occurrence spectrum, given a set of motifs.

Parameters:motifs (motifs) – The motifs to use for the feature space.
Returns:Feature set
Return type:features
name
nvalues
scale()
summary()
table()
values
class gnocis.features.kSpectrum

Bases: object

Feature set that extracts the occurrence frequencies of all unambiguous motifs of length k. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:nspectrum (int) – Length of motifs, k, to use.
bitmask
cachedSequence
cachedSpectrum
features
kmerByIndex
nFeatures
nspectrum
class gnocis.features.kSpectrumFeatureGPS

Bases: gnocis.features.feature

get()

Extracts the feature value from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature value
Return type:float
index
kmer
parent
section
class gnocis.features.kSpectrumGPS

Bases: object

Feature set that extracts the occurrence frequencies of all motifs of length k, position and strand represented. To represent position and strandedness, four features are generated per k-mer: one for each combination of strandedness and sequence end for proximity. Graded distance to each end of the sequence is used in order to represent position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:nspectrum (int) – Length of motifs, k, to use.
bitmask
cachedSequence
cachedSpectrum
features
kmerByIndex
nFeatures
nkmers
nspectrum
class gnocis.features.kSpectrumMM

Bases: object

Feature set that extracts the occurrence frequencies of all motifs of length k, with one mismatch allowed in an arbitrary position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:nspectrum (int) – Length of motifs, k, to use.
bitmask
cachedSequence
cachedSpectrum
features
kmerByIndex
nFeatures
nspectrum
class gnocis.features.kSpectrumMMD

Bases: object

Feature set that extracts the occurrence frequencies of all motifs of length k, with one mismatch allowed in an arbitrary position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:nspectrum (int) – Length of motifs, k, to use.
bitmask
cachedSequence
cachedSpectrum
features
kmerByIndex
nFeatures
nspectrum
class gnocis.features.kSpectrumMMDFeature

Bases: gnocis.features.feature

get()

Extracts the feature value from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature value
Return type:float
index
kmer
parent
class gnocis.features.scaledFeature

Bases: gnocis.features.feature

The scaledFeature class scales and shifts an input feature.

Parameters:
  • _feature (feature) – The feature to scale.
  • vScale (float) – Value to scale by.
  • vSub (float) – Shifting.

After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.

feature
get()

Extracts the feature value from a sequence.

Parameters:seq (sequence) – The sequence to extract the feature value from.
Returns:Feature value
Return type:float
vScale
vSub

gnocis.models module

class gnocis.models.CVModelPredictions

Bases: object

Helper class that represents cross-validated model predictions..

Parameters:
  • model (sequenceModel) – Model.
  • labelPredictionCurves (dict) – Labelled prediction curves.
getStatTable
predictCore
regions
setCorePredictions
gnocis.models.createDummyPREdictorModel()
gnocis.models.crossvalidate()
class gnocis.models.crossvalidation

Bases: object

Helper class for cross-validations. Accepts binary training and validation sets and constructs cross-validation sets for a desired number of repeats. If a separate validation set is not given, the training set is used. The cross-validation set for each repeat contains numbers of training and test sequences determined by a training-to-testing sequence ratio, as well as a negative-per-positive test sequence ratio. When constructing the validation set, identities are checked for against the training set, to avoid contamination (will not work if sequences are cloned). Holds models and validation statistics. Integrates with terminal and IPython for visualization of results. Stores training and testing data, so new models can be added.

Parameters:
  • models (list) – List of models to cross-validate.
  • tpos (sequences) – Positive training set.
  • tneg (sequences) – Negative training set.
  • vpos (sequences, optional) – Positive validation set.
  • vneg (sequences, optional) – Negative validation set.
  • repeats (int, optional) – Number of experimental repeats. Default = 20.
  • ratioTrainTest (float, optional) – Ratio of training to testing sequences. Default = 80%.
  • ratioNegPos (float, optional) – Ratio of validation negatives to positives. Default = 100.
addModel
calibrate
getAUCTable
getConfigurationTable
plotPRC
plotROC
predict
class gnocis.models.sequenceModel

Bases: object

The sequenceModel class is an abstract class for sequence models. A number of methods are implemented for machine learning and prediction with DNA sequences.

Parameters:
  • name (string) – Name of model.
  • enableMultiprocessing (bool) – If True (default), multiprocessing is enabled.
calibrateGenomewidePrecision

Calibrates the model threshold for an expected precision genome-wide. Returns self to facilitate chaining of operations. However, this operation does mutate the model object. A scaling factor can be applied to the genome with the ‘factor’ argument. If, for instance, the positive set has been divided in half for independent training and calibration, a factor of 0.5 can be used.

Parameters:
  • positives (sequences/sequenceStream) – Positive sequences.
  • genome (sequences/sequenceStream/str) – The genome.
  • factor (float, optional) – Scaling factor for positive set versus genome.
  • precision (float, optional) – The precision to approximate.
Returns:

List of the maximum window score per sequence

Return type:

list

getConfusionMatrix

Calculates and returns a confusion matrix for sets of positive and negative sequences.

Parameters:
Returns:

Confusion matrix

Return type:

dict

getOptimalAccuracyThreshold

Gets a threshold value optimized for accuracy to a set of positive and a set of negative sequences.

Parameters:
Returns:

List of the maximum window score per sequence

Return type:

list

getPRC

Calculates and returns a Precision/Recall curve (PRC) for sets of positive and negative sequences.

Parameters:
Returns:

Precision/Recall curve (PRC)

Return type:

list

getPRCAUC

Calculates and returns the area under a Precision/Recall curve (PRCAUC) for sets of positive and negative sequences.

Parameters:
Returns:

Area under the Precision/Recall curve (PRCAUC)

Return type:

list

getPrecisionThreshold

Gets a threshold value for a desired precision to a set of positive and a set of negative sequences. Linear interpolation is used in order to achieve a close approximation.

Parameters:
  • seqs (sequences) – Sequences.
  • labelPositive (sequenceLabel) – Label of positives.
  • labelNegative (sequenceLabel) – Label of negatives.
  • wantedPrecision (float) – The precision to approximate.
Returns:

List of the maximum window score per sequence

Return type:

list

getROC

Calculates and returns a Receiver Operating Characteristic (ROC) curve for sets of positive and negative sequences.

Parameters:
Returns:

Receiver Operating Characteristic (ROC) curve

Return type:

list

getROCAUC

Calculates and returns the area under a Receiver Operating Characteristic (ROCAUC) curve for sets of positive and negative sequences.

Parameters:
Returns:

Area under the Receiver Operating Characteristic (ROCAUC) curve

Return type:

list

getSequenceScores

Scores a set of sequences, returning the maximum window score for each. Multiprocessing is enabled by default, but can be disabled in the constructor.

Parameters:seqs (sequences/sequenceStream) – Sequences to score.
Returns:List of the maximum window score per sequence
Return type:list
getValidationStatistics

Returns common model validation statistics (confusion matrix values; ROCAUC; PRCAUC).

Parameters:
Returns:

Validation statistics

Return type:

list

plotPRC

Plots a Precision/Recall curve and either displays it in an IPython session or saves it to a file.

Parameters:
  • seqs (sequences) – Sequences.
  • labelPositive (sequenceLabel) – Label of positives.
  • labelNegative (sequenceLabel) – Label of negatives.
  • figsize (tuple, optional) – Tuple of figure dimensions.
  • outpath (str, optional) – Path to save generated plot to. If not set, the plot will be output to IPython.
  • style (str, optional) – Matplotlib style to use.
predict

Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.

Parameters:stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction.
Returns:Set of non-overlapping (merged) predicted regions
Return type:regions
predictCore

Predicts cores. Not implemented by default. Models that use core prediction should implement this.

Parameters:genome (genome) – Genome.
Returns:Predicted core regions if implemented, or None otherwise
Return type:regions
predictSequenceStreamCurves

Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.

Parameters:stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction.
Returns:Set of non-overlapping (merged) predicted regions
Return type:regions
predictSequenceStreamRegions

Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.

Parameters:stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction.
Returns:Set of non-overlapping (merged) predicted regions
Return type:regions
printTestStatistics
score

Scores a sequence of set of sequences.

Parameters:
  • target (sequence/sequences/sequenceStream) – Sequence(s) to score.
  • asTable (bool) – If True, a table will be output. Otherwise, a list.
Returns:

Score or list of scores

Return type:

float/list

scoreSequence

Scores a single sequence. The score is determined by applying the model with a sliding window, and taking the maximum score.

Parameters:seq (sequences/sequenceStream) – The sequence.
Returns:Maximal window score
Return type:float
train

Scores a sequence of set of sequences.

Parameters:ts (sequences) – Training set.
Returns:Trained model
Return type:sequenceModel
class gnocis.models.sequenceModelDummy

Bases: gnocis.models.sequenceModel

This model takes a feature set, and scores input sequences by summing feature values, without weighting.

Parameters:
  • name (str) – Model name.
  • features (features) – The feature set.
  • windowSize (int) – Window size to use.
  • windowStep (int) – Window step size to use.
getTrainer
scoreWindow
class gnocis.models.sequenceModelLogOdds

Bases: gnocis.models.sequenceModel

Constructs a log-odds model based on an input feature set and binary training set, and scores input sequences by summing log-odds-weighted feature values.

Parameters:
  • name (str) – Model name.
  • features (features) – The feature set.
  • trainingSet (sequences) – Training sequences.
  • windowSize (int) – Window size to use.
  • windowStep (int) – Window step size to use.
  • labelPositive (sequenceLabel) – Positive training class label.
  • labelNegative (sequenceLabel) – Negative training class label.
getTrainer
scoreWindow
gnocis.models.setNCores()

Sets the number of cores to use for multiprocessing.

Parameters:n (int) – The number of cores to use.
gnocis.models.setNThreadFetch()
gnocis.models.trainPREdictorModel()
gnocis.models.trainSinglePREdictorModel()

gnocis.motifs module

class gnocis.motifs.IUPACMotif

Bases: object

Represents an IUPAC motif.

Parameters:
c
cRC
cachedOcc
cachedSequence
find()

Finds the occurrences of the motif in a sequence.

Parameters:
  • seq (sequence) – The sequence.
  • cache (bool) – Whether or not to cache.
Returns:

List of motif occurrences

Return type:

list

motif
name
nmismatches
regexMotif
regexMotifRC
class gnocis.motifs.PWMMotif

Bases: object

Represents an Position Weight Matrix (PWM) motif.

Parameters:
  • name (str) – Name of the motif.
  • pwm (str) – Position Weight Matrix.
  • path (str) – Path to file that the PWM was loaded from.
PWMF
PWMRC
cachedOcc
cachedSequence
find()

Finds the occurrences of the motif in a sequence.

Parameters:
  • seq (sequence) – The sequence.
  • cache (bool) – Whether or not to cache.
Returns:

List of motif occurrences

Return type:

list

getEValueThreshold()

Calibrates the threshold for a desired E-value.

Parameters:
  • bgModel (object) – Background model to use for calibration.
  • Evalue (float) – The desired E-value.
  • EvalueUnit (float) – E-value unit.
  • seedThreshold (float) – Seed threshold.
  • iterations (int) – Number of iterations.
Returns:

Threshold

Return type:

float

name
path
setPWM()

Sets the Position Weight Matrix.

Parameters:pwm (list) – Position Weight Matrix.
threshold
gnocis.motifs.loadMEMEPWMDatabase()
class gnocis.motifs.motifOccurrence

Bases: object

Represents motif occurrences.

Parameters:
  • motif (object) – The motif.
  • seq – Name of the sequence the motif occurred on.
  • start – Occurrence start nucleotide.
  • end – Occurrence end nucleotide.
  • strand – Occurrence strand. True is the +/forward strand, and False is the -/backward strand.
  • seq – str
  • start – int
  • end – int
  • strand – bool

After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.

end
motif
seq
start
strand
class gnocis.motifs.motifs

Bases: object

The motifs class represents a set of motifs.

Parameters:
  • name (str) – Name of the motif set.
  • motifs – List of motifs.
  • motifs – list
Ringrose2003

Preset for generating the Ringrose et al. (2003) motif set.

Ringrose2003GTGT

Preset for generating the Ringrose et al. (2003) motif set, with GTGT added (as in Bredesen et al. 2019).

motifs
name
occFreq

Generates an occurrence frequency feature set

pairFreq

Generates a pair occurrence frequency feature set

table

gnocis.regions module

gnocis.regions.loadBED()

Loads regions from a BED (https://www.ensembl.org/info/website/upload/bed.html) file.

Parameters:path (str) – Path to the input file.
Returns:Loaded regions.
Return type:regions
gnocis.regions.loadBEDGZ()

Loads regions from a gzipped BED (https://www.ensembl.org/info/website/upload/bed.html) file.

Parameters:path (str) – Path to the input file.
Returns:Loaded regions.
Return type:regions
gnocis.regions.loadCoordinateList()

Loads regions from a coordinate list file (one region per line, with the format seq:start..end).

Parameters:path (str) – Path to the input file.
Returns:Loaded regions.
Return type:regions
gnocis.regions.loadGFF()

Loads regions from a General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.

Parameters:path (str) – Path to the input file.
Returns:Loaded regions.
Return type:regions
gnocis.regions.loadGFFGZ()

Loads regions from a gzipped General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.

Parameters:path (str) – Path to the input file.
Returns:Loaded regions.
Return type:regions
gnocis.regions.nucleotidePrecisionBarplot()

Generates a prediction nucleotide precision barplot.

Parameters:
  • predictionSets (list) – List of prediction region sets.
  • regionSets (list) – List of validation region sets.
  • figsize (tuple, optional) – Figure size.
  • outpath (str, optional) – Output path.
  • returnHTML (bool, optional) – If True, an HTML node will be returned.
  • fontsizeLabels (float, optional) – Size of label font.
  • fontsizeLegend (float, optional) – Size of legend font.
  • fontsizeAxis (float, optional) – Size of axis font.
  • style (str, optional) – Plot style to use.
  • showLegend (bool, optional) – Flag for whether or not to render legend.
  • bboxAnchorTo (tuple, optional) – Legend anchor point.
  • legendLoc (str, optional) – Legend location.
  • showValues (bool, optional) – If True, values will be plotted.
gnocis.regions.nucleotideRegionF1Barplot()

Generates a barplot of F1-measures, where the recall is based on region overlap sensitivity and precision is based on nucleotide precision.

Parameters:
  • predictionSets (list) – List of prediction region sets.
  • regionSets (list) – List of validation region sets.
  • figsize (tuple, optional) – Figure size.
  • outpath (str, optional) – Output path.
  • returnHTML (bool, optional) – If True, an HTML node will be returned.
  • fontsizeLabels (float, optional) – Size of label font.
  • fontsizeLegend (float, optional) – Size of legend font.
  • fontsizeAxis (float, optional) – Size of axis font.
  • style (str, optional) – Plot style to use.
  • showLegend (bool, optional) – Flag for whether or not to render legend.
  • bboxAnchorTo (tuple, optional) – Legend anchor point.
  • legendLoc (str, optional) – Legend location.
  • showValues (bool, optional) – If True, values will be plotted.
gnocis.regions.overlapPrecisionBarplot()

Generates a prediction overlap precision barplot.

Parameters:
  • predictionSets (list) – List of prediction region sets.
  • regionSets (list) – List of validation region sets.
  • figsize (tuple, optional) – Figure size.
  • outpath (str, optional) – Output path.
  • returnHTML (bool, optional) – If True, an HTML node will be returned.
  • fontsizeLabels (float, optional) – Size of label font.
  • fontsizeLegend (float, optional) – Size of legend font.
  • fontsizeAxis (float, optional) – Size of axis font.
  • style (str, optional) – Plot style to use.
  • showLegend (bool, optional) – Flag for whether or not to render legend.
  • bboxAnchorTo (tuple, optional) – Legend anchor point.
  • legendLoc (str, optional) – Legend location.
  • showValues (bool, optional) – If True, values will be plotted.
gnocis.regions.overlapSensitivityBarplot()

Generates a prediction overlap sensitivity barplot.

Parameters:
  • predictionSets (list) – List of prediction region sets.
  • regionSets (list) – List of validation region sets.
  • figsize (tuple, optional) – Figure size.
  • outpath (str, optional) – Output path.
  • returnHTML (bool, optional) – If True, an HTML node will be returned.
  • fontsizeLabels (float, optional) – Size of label font.
  • fontsizeLegend (float, optional) – Size of legend font.
  • fontsizeAxis (float, optional) – Size of axis font.
  • style (str, optional) – Plot style to use.
  • showLegend (bool, optional) – Flag for whether or not to render legend.
  • bboxAnchorTo (tuple, optional) – Legend anchor point.
  • legendLoc (str, optional) – Legend location.
  • showValues (bool, optional) – If True, values will be plotted.
gnocis.regions.predictionBarplot()

Generates a prediction overlap precision barplot.

Parameters:
  • predictionSets (list) – List of prediction region sets.
  • figsize (tuple, optional) – Figure size.
  • outpath (str, optional) – Output path.
  • returnHTML (bool, optional) – If True, an HTML node will be returned.
  • fontsizeLabels (float, optional) – Size of label font.
  • fontsizeLegend (float, optional) – Size of legend font.
  • fontsizeAxis (float, optional) – Size of axis font.
  • style (str, optional) – Plot style to use.
  • showLegend (bool, optional) – Flag for whether or not to render legend.
  • bboxAnchorTo (tuple, optional) – Legend anchor point.
  • legendLoc (str, optional) – Legend location.
  • showValues (bool, optional) – If True, values will be plotted.
class gnocis.regions.region

Bases: object

The region class represents a region within a sequence. A region has a sequence name, a start and end, and a strandedness, with optional additional annotation.

Parameters:
  • seq (str) – Sequence name.
  • start (int) – Start coordinate.
  • end (int) – End coordinate (inclusive).
  • strand (bool, optional) – Strandedness. True for the + (forward) strand (default), and False for the - (reverse) strand.
  • score (float, optional) – Score.
  • source (str, optional) – Soure.
  • feature (str, optional) – Feature.
  • group (str, optional) – Group.
  • dropChr (bool, optional) – True if a leading chr in seq is to be dropped (default).

The length of a region r can be calculated with len(r).

bstr()
Returns:Returns the region formatted as a string: seq:start..end (strand).
Return type:str
center()
end
ext
feature
group
markers
score
seq
singleton()
source
start
strand
gnocis.regions.regionBarplot()

Generates a barplot of a measure (lambda) for predicted regions.

Parameters:
  • mainRegions (list) – List of main region sets.
  • measure (function) – Function to compute measure.
  • measureName (str) – Name of measure.
  • testRegions (list, optional) – List of test region sets, to check against (or None if not relevant).
  • figsize (tuple, optional) – Figure size.
  • outpath (str, optional) – Output path.
  • returnHTML (bool, optional) – If True, an HTML node will be returned.
  • fontsizeLabels (float, optional) – Size of label font.
  • fontsizeLegend (float, optional) – Size of legend font.
  • fontsizeAxis (float, optional) – Size of axis font.
  • style (str, optional) – Plot style to use.
  • showLegend (bool, optional) – Flag for whether or not to render legend.
  • bboxAnchorTo (tuple, optional) – Legend anchor point.
  • legendLoc (str, optional) – Legend location.
  • showValues (bool, optional) – If True, values will be plotted.
  • isPercent (bool, optional) – If True, values will be plotted as percentages.
class gnocis.regions.regions

Bases: object

The regions class represents a set of regions.

Parameters:
  • name (str) – Name for the sequence set.
  • rgns (list) – List of regions to include.
  • initialSort (bool) – Whether or not to sort the regions (default is True).

For a region set RS, len(RS) gives the number of regions, RS[i] gives the i`th region in the set, and `[ r for r in RS ] lists regions in the set. For sets A and B, A+B gives a sorted region set of regions from both sets, A|B gives the merged set, A&B the conjunction, and A^B the regions of A with B excluded.

bstr()
Returns:Returns a concatenated list of regions formatted as seq:start..end (strand).
Return type:str
deltaResize()

Gets a set of resized regions. The delta size is subtracted from the start and added to the end.

Parameters:sizeDelta – Desired region size change.
Returns:Set of resized regions.
Return type:regions
dummy()

Generates a dummy region set, with the same region lengths but random positions.

Parameters:
  • genome (genome) – Genome.
  • useSeq (list) – Chromosomes to focus on.
Returns:

Dummy region set.

Return type:

regions

exclusion()

Gets the regions of this set with regions of another set excluded.

Parameters:other (regions) – Other region set.
Returns:Excluded region set.
Return type:regions
expected()

Calculates an expected statistic by random dummy set creation and averaging.

Parameters:
  • genome (genome) – Genome.
  • statfun (Function) – Function to apply to region sets that returns desired statistic.
  • useSeq (list) – Chromosomes to focus on.
  • repeats (int) – Number of repeats for calculating statistic.
Returns:

Average statistic.

Return type:

float

extract()

Extracts region sequences from a sequence set, sequence stream, or a file by path.

Parameters:src (sequences/sequenceStream/str) – Sequence set, sequence stream, or sequence file path.
Returns:Extracted region sequences.
Return type:sequences
extractSequences()

Extracts region sequences from a sequence set or stream.

Parameters:seqs (sequences/sequenceStream) – Sequences/sequence stream to extract region sequences from.
Returns:Extracted region sequences.
Return type:sequences
extractSequencesFrom2bit()

Extracts region sequences from a 2bit file.

Parameters:path (str) – Path to 2bit file.
Returns:Extracted region sequences.
Return type:sequences
extractSequencesFromFASTA()

Extracts region sequences from a FASTA file.

Parameters:path (str) – Path to FASTA file.
Returns:Extracted region sequences.
Return type:sequences
filter()

Returns a region set filtered with the input lambda function.

Parameters:
  • fltName (str) – Name of the filter. Appended to the region set name.
  • flt (lambda) – Lambda function to be applied to every region, returning True if the region is to be included, or otherwise False.
Returns:

Region set filtered with the lambda function.

Return type:

regions

flatten()

Returns a flattened set of regions, with internally overlapping regions merged.

Returns:Flattened region set.
Return type:regions
intersection()

Gets the intersection of regions in this set and another.

Parameters:other (regions) – Other region set.
Returns:Intersection region set.
Return type:regions
merge()

Gets the merged set of regions in this set and another.

Parameters:other (regions) – Other region set.
Returns:Merged region set.
Return type:regions
name
nonOverlap()

Gets the regions in this set that do not overlap with another.

Parameters:other (regions) – Other region set.
Returns:Set of non-overlapping regions.
Return type:regions
nucleotidePrecision()

Calculates the nucleotide precision to another set.

Parameters:other (regions) – Other region set.
Returns:Nucleotide precision.
Return type:float
overlap()

Gets the regions in this set that overlap with another.

Parameters:other (regions) – Other region set.
Returns:Set of overlapping regions.
Return type:regions
overlapPrecision()

Calculates the overlap precision to another set.

Parameters:other (regions) – Other region set.
Returns:Overlap precision.
Return type:float
overlapSensitivity()

Calculates the overlap sensitivity to another set.

Parameters:other (regions) – Other region set.
Returns:Overlap sensitivity.
Return type:float
printStatistics()

Outputs basic statistics.

randomSplit()
Parameters:ratio (float) – Ratio for the split, between 0 and 1. For the return touple, (A, B), a ratio of 0 means all the regions are in B, and opposite for a ratio of 1.
Returns:Returns a random split of the regions into two independent sets, with the given ratio.
Return type:regions
recenter()

Gets a set of randomly recentered regions. If the target size is smaller than a given region, a region of the desired size is randomly placed within the region. If the desired size is larger, a region of the desired size is placed with equal center to the center of the region.

Parameters:size (int) – Desired region size.
Returns:Set of randomly recentered regions.
Return type:regions
regions
rename()
Parameters:newname (str) – New name.
Returns:Renamed region set.
Return type:regions
sample()
Parameters:n (int) – Number of regions to pick. If 0, the same number will be selected as are in the full set.
Returns:Returns a random sample of the regions of size n. Regions are selected with replacement, and the same region can occur multiple times.
Return type:regions
saveBED()

Saves the regions to a BED (https://www.ensembl.org/info/website/upload/bed.html) file.

Parameters:path (str) – Output file path.
saveCoordinateList()

Saves the regions to a coordinate list file (a coordinate per line, with the format seq:start..end).

Parameters:path (str) – Output file path.
saveGFF()

Saves the regions to a General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.

Parameters:path (str) – Output file path.
sort()

Sorts the set, first by start coordinate, then by sequence.

table()

gnocis.sequences module

class gnocis.sequences.IID

Bases: gnocis.sequences.sequenceGenerator

The IID class trains an i.i.d. model for generating sequences.

Parameters:
  • trainingSequences (sequences/sequenceStream/str) – Training sequences.
  • pseudoCounts (bool, optional) – Whether or not to use pseudocounts. Default is True.
  • addComplements (bool, optional) – Whether or not to add complements. Default is True.
addComplements
generate()

Generates a sequence.

Parameters:length (int) – Length of the sequence to generate.
Returns:Random sequence.
Return type:sequence
nGenerated
ntDistribution
prepared
pseudoCounts
spectrum
trainingSequences
class gnocis.sequences.MarkovChain

Bases: gnocis.sequences.sequenceGenerator

The MarkovChain class trains a Markov chain for generating sequences.

Parameters:
  • trainingSequences (sequences/sequenceStream/str) – Training sequences.
  • degree (int, optional) – Markov chain degree. Default is 4.
  • pseudoCounts (bool, optional) – Whether or not to use pseudocounts. Default is True.
  • addReverseComplements (bool, optional) – Whether or not to add reverse complements. Default is True.
addReverseComplements
comparableSpectrum
degree
generate()

Generates a sequence.

Parameters:length (int) – Length of the sequence to generate.
Returns:Random sequence.
Return type:sequence
initialDistribution
nGenerated
prepared
probspectrum
pseudoCounts
spectrum
trainingSequences
gnocis.sequences.getSequenceStreamFromPath()

Creates a sequence stream from a path, deducing the file type.

Parameters:
  • path (str) – Path to the input file.
  • wantBlockSize (int, optional) – Desired block size.
  • spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
  • dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
  • restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:

Generated sequence stream.

Return type:

sequenceStream

gnocis.sequences.getSequenceWindowRegions()

Generates sequence window regions for a sequence set or stream.

Parameters:
  • stream – Input sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
Returns:

Sequence window regions.

Return type:

regions

gnocis.sequences.loadFASTA()

Loads sequences from a FASTA file.

Parameters:path (str) – Path to the input file.
Returns:Loaded sequences.
Return type:sequences
gnocis.sequences.loadFASTAGZ()

Loads sequences from a gzipped FASTA file.

Parameters:path (str) – Path to the input file.
Returns:Loaded sequences.
Return type:sequences
class gnocis.sequences.sequence

Bases: object

The sequence class represents a DNA sequence.

Parameters:
  • name (str) – Name of the sequence.
  • seq (str) – Sequence.
  • path (str, optional) – Source file path.
  • sourceRegion (region, optional) – Source region.
  • annotation (str, optional) – Annotation.

For a sequence instance S, len(S) gives the sequence length. [ nt for nt in S ] lists the nucleotides. For two sequences A and B, A+B gives the concatenated sequence.

annotation
label()
Parameters:label (sequenceLabel) – Label to add.
Returns:Returns a labelled sequence.
Return type:sequence
labels
name
path
seq
sourceRegion
windows()
Parameters:
  • size (int) – Window size.
  • step (int) – Window step size.
  • includeCroppedEnds (bool, optional) – Whether or not to include cropped windows on ends of sequence. Default is False.
Returns:

Returns a sequence set of sliding window sequences over the sequence.

Return type:

sequences

class gnocis.sequences.sequenceGenerator

Bases: object

The sequenceGenerator class is an abstract class for sequence generators.

generateFASTA()

Generates sequences and saves them to a FASTA file.

Parameters:
  • path (str) – Output path.
  • n (int) – Number of sequences to generate.
  • length (int) – Length of each sequence to generate.
generateSet()
Parameters:
  • n (int) – Number of sequences to generate.
  • length (int) – Length of each sequence to generate.
Returns:

Sequence set of n randomly generated sequences, each of length length.

Return type:

sequences

generateStream()
Parameters:
  • n (int) – Number of sequences to generate.
  • length (int) – Length of each sequence to generate.
  • wantBlockSize (int, optional) – Desired block size.
Returns:

Sequence stream of n randomly generated sequences, each of length length.

Return type:

sequencesStream

class gnocis.sequences.sequenceLabel

Bases: object

The sequenceLabel class represents a label that can be assigned to sequences to be used for model training or testing, such as “positive” or “negative”.

Parameters:
  • name (str) – Name of the sequence label.
  • value (float) – Value to represent the label.
name
value
class gnocis.sequences.sequenceStream

Bases: object

The sequenceStream class represents sequence streams.

fetch()

Fetches n sequences (or less, if fewer or none are left).

Parameters:
  • n (int) – Number of sequences to fetch.
  • maxnt (int) – Maximum number of nucleotides to fetch.

:return Yields sequences. :rtype: sequences (yield)

sequenceLengths()
Returns:Returns a dictionary of the lengths of all sequences in the stream.
Return type:dict
streamFullSequences()

Streams and yields whole sequences.

:return Yields whole sequence. :rtype: sequence (yield)

windowRegions()

Generates and returns a set of sliding window regions.

Parameters:
  • size (int) – Window size.
  • step (int) – Window step size.
Returns:

Region set of all sliding window regions across all sequences in the stream.

Return type:

regions

windows()

Generates and returns a set of sliding window sequences.

Parameters:
  • size (int) – Window size.
  • step (int) – Window step size.
Returns:

Sequence set of all sliding window sequences across all sequences in the stream.

Return type:

sequences

class gnocis.sequences.sequenceStream2bit

Bases: gnocis.sequences.sequenceStream

Streams a 2bit file.

Parameters:
  • path (str) – Path to the input file.
  • wantBlockSize (int) – Desired block size.
  • spacePrune (bool) – Whether or not to prune sequence name spaces.
  • dropChr (bool) – Whether or not to prune sequence name “chr”-prefixes.
  • restrictToSequences (list) – List of sequence names to restrict to/focus on.
class gnocis.sequences.sequenceStreamFASTA

Bases: gnocis.sequences.sequenceStream

Streams a FASTA file.

Parameters:
  • path (str) – Path to the input file.
  • wantBlockSize (int) – Desired block size.
  • spacePrune (bool) – Whether or not to prune sequence name spaces.
  • dropChr (bool) – Whether or not to prune sequence name “chr”-prefixes.
  • restrictToSequences (list) – List of sequence names to restrict to/focus on.
  • isGZ (bool) – Whether or not the input file is GZipped.
class gnocis.sequences.sequences

Bases: object

The sequences class represents a set of DNA sequences.

Parameters:
  • name (str) – Name of the sequence set.
  • seq (list) – List of sequences to include.

For a sequence instance S, len(S) gives the sequence length, and [ s for s in S ] lists the sequences. For two sequence sets A and B, A+B gives the set containing all sequences from both sets.

label()

Adds a label to sequences.

Parameters:label (sequenceLabel) – Label to add to sequences.
Returns:Sequences with label added.
Return type:sequences
labels()

Gets labels for sequences.

Returns:Set of labels.
Return type:set
name
printStatistics()

Outputs basic statistics.

randomSplit()
Parameters:ratio (float) – Ratio for the split, between 0 and 1. For the return touple, (A, B), a ratio of 0 means all the sequences are in B, and opposite for a ratio of 1.
Returns:Returns a random split of the sequences into two independent sets, with the given ratio.
Return type:sequences
rename()
Parameters:newname (str) – New name.
Returns:Renamed sequence set.
Return type:sequences
sample()
Parameters:n (int) – Number of sequences to pick. If 0, the same number will be selected as are in the full set.
Returns:Returns a random sample of the sequences of size n. Sequences are selected with replacement, and the same sequence can occur multiple times.
Return type:sequences
saveFASTA()

Saves the regions to a FASTA file.

Parameters:path (str) – Output file path.
sequenceLengths()
Returns:Returns a dictionary of the lengths of all sequences in the set.
Return type:dict
sequences
sourceRegions()
Returns:Region set of source regions for all sequences in the set.
Return type:regions
split()
Parameters:n (int) – Number of sets to split into.
Returns:Returns a list of n independent splits of the sequence set, without shuffling.
Return type:list
table()
windowRegions()

Generates and returns a set of sliding window regions.

Parameters:
  • size (int) – Window size.
  • step (int) – Window step size.
Returns:

Region set of all sliding window regions across all sequences in the set.

Return type:

regions

windows()

Generates and returns a set of sliding window sequences.

Parameters:
  • size (int) – Window size.
  • step (int) – Window step size.
Returns:

Sequence set of all sliding window sequences across all sequences in the set.

Return type:

sequences

withLabel()

Extracts sequences with the given label or labels.

Parameters:
  • labels (list, sequenceLabel) – List of labels, or single labels, to extract sequences with.
  • ensureAllLabels (bool) – If true, an exception is raised if no sequences with a label were found. Default true.
Returns:

Single label or list of labels.

Return type:

sequences, list

If multiple labels are given, a list of sequences instances are returned in the same order as the labels.

gnocis.sequences.stream2bit()

Streams a 2bit file.

Parameters:
  • path (str) – Path to the input file.
  • wantBlockSize (int, optional) – Desired block size.
  • spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
  • dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
  • restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:

Generated sequence stream.

Return type:

sequenceStream

gnocis.sequences.streamFASTA()

Streams a FASTA file.

Parameters:
  • path (str) – Path to the input file.
  • wantBlockSize (int, optional) – Desired block size.
  • spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
  • dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
  • restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:

Generated sequence stream.

Return type:

sequenceStream

gnocis.sequences.streamFASTAGZ()

Streams a gzipped FASTA file.

Parameters:
  • path (str) – Path to the input file.
  • wantBlockSize (int, optional) – Desired block size.
  • spacePrune (bool, optional) – Whether or not to prune sequence name spaces.
  • dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes.
  • restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:

Generated sequence stream.

Return type:

sequenceStream

gnocis.sequences.streamSequenceWindows()

Yields sequence windows for a sequence set, sequence stream or a path to a sequence file.

Parameters:
  • src (sequences/sequenceStream/str) – Input sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
Returns:

Sequence windows.

Return type:

sequence (yield)

gnocis.sklearnModels module

class gnocis.sklearnModels.RF(model=None, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>, nTrees=100, maxDepth=None)[source]

Bases: gnocis.featurenetwork.baseModel

score(featureVectors)[source]
train(trainingSet)[source]
weights(featureNames)[source]
class gnocis.sklearnModels.SVM(model=None, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>, kDegree=1, C=4)[source]

Bases: gnocis.featurenetwork.baseModel

score(featureVectors)[source]
train(trainingSet)[source]
weights(featureNames)[source]
class gnocis.sklearnModels.sequenceModelLasso(name, features, trainingSet, windowSize, windowStep, alpha=1.0, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]

Bases: gnocis.models.sequenceModel

The sequenceModelLasso class trains a Lasso model using scikit-learn.

Parameters:
  • name (str) – Model name.
  • features (features) – Feature set.
  • positives (sequences) – Positive training sequences.
  • negatives (sequences) – Negative training sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
  • alpha (float) – Alpha parameter for Lasso.
getSequenceFeatureVector(seq)[source]
getTrainer()[source]
scoreWindow(seq)[source]
class gnocis.sklearnModels.sequenceModelRF(name, features, trainingSet, windowSize, windowStep, nTrees=100, maxDepth=None, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]

Bases: gnocis.models.sequenceModel

The sequenceModelRF class trains a Random Forest (RF) model using scikit-learn.

Parameters:
  • name (str) – Model name.
  • features (features) – Feature set.
  • positives (sequences) – Positive training sequences.
  • negatives (sequences) – Negative training sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
  • nTrees (int) – Number of trees.
  • maxDepth (int) – Maximum tree depth.
getSequenceFeatureVector(seq)[source]
getTrainer()[source]
scoreWindow(seq)[source]
class gnocis.sklearnModels.sequenceModelSVM(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]

Bases: gnocis.models.sequenceModel

The sequenceModelSVM class trains a Support Vector Machine (SVM) using scikit-learn.

Parameters:
  • name (str) – Model name.
  • features (features) – Feature set.
  • positives (sequences) – Positive training sequences.
  • negatives (sequences) – Negative training sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
  • kDegree (float) – Kernel degree.
getSequenceFeatureVector(seq)[source]
getTrainer()[source]
scoreWindow(seq)[source]

gnocis.sklearnModelsOpt module

class gnocis.sklearnModelsOpt.sequenceModelSVMOptimizedQuadratic(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]

Bases: gnocis.models.sequenceModel

The sequenceModelSVMOptimizedQuadratic class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication.

Parameters:
  • name (str) – Model name.
  • features (features) – Feature set.
  • positives (sequences) – Positive training sequences.
  • negatives (sequences) – Negative training sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
  • kDegree (float) – Kernel degree.
getSequenceFeatureVector(seq)[source]
getTrainer()[source]
scoreWindow(seq)[source]
class gnocis.sklearnModelsOpt.sequenceModelSVMOptimizedQuadraticAutoScale(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]

Bases: gnocis.models.sequenceModel

The sequenceModelSVMOptimizedQuadratic class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication.

Parameters:
  • name (str) – Model name.
  • features (features) – Feature set.
  • positives (sequences) – Positive training sequences.
  • negatives (sequences) – Negative training sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
  • kDegree (float) – Kernel degree.
getSequenceFeatureVector(seq)[source]
getTrainer()[source]
scoreWindow(seq)[source]
class gnocis.sklearnModelsOpt.sequenceModelSVMOptimizedQuadraticCUDA(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]

Bases: gnocis.models.sequenceModel

The sequenceModelSVMOptimizedQuadraticCUDA class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication with CUDA.

Parameters:
  • name (str) – Model name.
  • features (features) – Feature set.
  • positives (sequences) – Positive training sequences.
  • negatives (sequences) – Negative training sequences.
  • windowSize (int) – Window size.
  • windowStep (int) – Window step size.
  • kDegree (float) – Kernel degree.
getSequenceFeatureVector(seq)[source]
getTrainer()[source]
scoreWindow(seq)[source]

gnocis.validation module

gnocis.validation.getAUC()

Calculates the Area Under the Curve (AUC) for an input curve.

Parameters:curve (list) – Curve to calculate area under.
Returns:Area Under the Curve (AUC).
Return type:float
gnocis.validation.getConfusionMatrix()

Calculates confusion matrix based on input validation pairs, and an optional threshold.

Parameters:
  • vPos (list) – Positive validation pairs.
  • vNeg (list) – Negative validation pairs.
  • threshold (float) – Classification threshold.
Returns:

Dictionary with confusion matrix entries.

Return type:

dict

gnocis.validation.getConfusionMatrixStatistics()

Calculates confusion matrix statistics in an input confusion matrix dictionary.

Parameters:CM (dict) – Confusion matrix dictionary.
Returns:Dictionary with confusion matrix statistics.
Return type:dict
gnocis.validation.getPRC()

Generates a Precision/Recall Curve (PRC).

Parameters:
  • vPos (list) – Positive validation pairs.
  • vNeg (list) – Negative validation pairs.
Returns:

List of 2D-points for curve

Return type:

list

gnocis.validation.getROC()

Generates a Receiver Operating Characteristic (ROC) curve.

Parameters:
  • vPos (list) – Positive validation pairs.
  • vNeg (list) – Negative validation pairs.
Returns:

List of 2D-points for curve

Return type:

list

class gnocis.validation.point2D

Bases: object

Two-dimensional point.

Parameters:
  • x (float) – X-coordinate.
  • y (float) – Y-coordinate.
  • rank (float, optional) – Rank.
rank
x
y
gnocis.validation.printValidationStatistics()

Prints model confusion matrix statistics from a dictionary.

Parameters:stats (dict) – Confusion matrix statistics dictionary.
class gnocis.validation.validationPair

Bases: object

Pair of score and binary class label.

Parameters:
  • score (float) – Score.
  • label – Binary class label.
label
name
score