gnocis¶

Gnocis is divided into a number of modules, with discrete functions.

gnocis.biomarkers module¶

class gnocis.biomarkers.biomarkers¶

Bases: object

The biomarkers class represents a set of biomarker regions. Up to multiple regions marking a phenomenon of interest can be registered in a biomarker object.

Parameters:	name (str) – Name of the biomarker set. regionSets (list, optional) – Name of the biomarker set. Defaults to []. positiveThreshold (int, optional) – Threshold for Highly BioMarker-Enriched (HBME) loci. Defaults to -1. negativeThreshold (int, optional) – Threshold for Lowly BioMarker-Enriched (LBME) loci. Defaults to -1.

After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.

HBMEs()¶

Gets highly biomarker-enriched regions of a set. This is determined either by an optional threshold argument, or, if unspecified, by the positiveThreshold member. The resulting set is the merged biomarker spectrum for enrichment levels >= to the threshold.

Parameters:	rs (regions) – Region set to extract enriched subset of.
Returns:	Highly BioMarker-Enriched regions (HBMEs).
Return type:	regions

LBMEs()¶

Gets lowly biomarker-enriched regions of a set. This is determined either by an optional threshold argument, or, if unspecified, by the negativeThreshold member. The resulting set is the merged biomarker spectrum for enrichment levels <= to the threshold.

Parameters:	rs (regions) – Region set to extract enriched subset of.
Returns:	Highly BioMarker-Enriched regions (HBMEs)
Return type:	regions

add()¶

Adds a biomarker to the set, by name and region set. The name of the region set is taken as the name of the biomarker.

Parameters:	rs (regions) – Region set to add as biomarker.

biomarkers¶

enrichmentSpectrum()¶

Returns the biomarker spectrum. This is a dictionary with numerical keys, containing subsets enriched in N biomarkers, for every valid N as key.

Parameters:	rs (regions) – Region set to add as biomarker.
Returns:	Biomarker spectrum
Return type:	dict

name¶

negativeThreshold¶

positiveThreshold¶

gnocis.common module¶

gnocis.common.CI()¶

Calculates a confidence interval of the mean for a set of values. By default, Gnocis calculates a 95% confidence interval with a normal distribution. It is recommended to replace this with an appropriate distribution depending on the analysis performed.

Parameters:	X (list) – Values.
Returns:	Confidence interval difference from the mean.
Return type:	float

gnocis.common.KLdiv()¶

gnocis.common.SE()¶

gnocis.common.getFloat()¶

Helper function to get floating point values of the same precision as used for float by Cython.

Parameters:	v (float) – Value.

gnocis.common.getReverseComplementaryDNASequence()¶

Returns the reverse complement of a DNA sequence.

Parameters:	seq (str) – Sequence to get reverse complement of.

gnocis.common.mean()¶

gnocis.common.setConfidenceIntervalFunction()¶

Sets the function for calculating confidence intervals.

Parameters:	CIfunc (function) – Function that takes a set of values and outputs the confidence interval difference from the mean. A symmetric confidence interval is assumed.

gnocis.common.setSeed()¶

Sets the random seed.

Parameters:	seed (int) – Sequence to get reverse complement of.

gnocis.common.std()¶

gnocis.common.useSciPyConfidenceIntervals()¶

Sets the function for calculating confidence intervals.

Parameters:	CIfunc (function) – Function that takes a set of values and outputs the confidence interval difference from the mean. A symmetric confidence interval is assumed.

gnocis.features module¶

class gnocis.features.feature¶

Bases: object

The feature class is a base class for sequence features.

cachedSequence¶

cachedValue¶

get()¶

Extracts the feature value from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature value
Return type:	float

class gnocis.features.featureMotifOccurrenceFrequency¶

Bases: gnocis.features.feature

Occurrence frequency feature for a motif.

Parameters:	_motif (motif) – The motif that the feature is for.

get()¶

Extracts the feature value from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature value
Return type:	float

m¶

class gnocis.features.featurePREdictorMotifPairOccurrenceFrequency¶

Bases: gnocis.features.feature

PREdictor pair occurrence frequency feature for two motifs and a distance cutoff.

Parameters:	motifA (motif) – The first motif. motifB (motif) – The second motif. distanceCutoff (int) – Pairing cutoff distance. The cutoff is for nucleotides inbetween occurrences of motifA and motifB.

distCut¶

get()¶

Extracts the feature value from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature value
Return type:	float

mA¶

mB¶

class gnocis.features.featureScaler¶

Bases: gnocis.features.features

The featureScaler class scales an input feature set, based on a binary training set.

Parameters:	_features (features) – The feature set to scale. trainingSet (sequences) – Training sequences.

getAll()¶

Extracts the feature values from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature vector
Return type:	list

class gnocis.features.features¶

Bases: object

The features class represents a set of features.

Parameters:	name (str) – Name of the feature set f (list, optional) – Features to include. Defaults to [].

After constructing a features object, for a set of features FS, len(FS) gives the number of features, FS[i] gives feature with index i, and [ x for x in FS ] gives the list of feature objects. Two features objects A and B can be added together with A+B, yielding a concatenated set of features.

diffsummary()¶

features¶

getAll()¶

Extracts the feature values from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature vector
Return type:	list

static kSpectrum()¶

Constructs the k-mer spectrum kernel, the occurrence frequencies of all motifs of length k, with no ambiguous positions.

Parameters:	k (int) – k-mer length.
Returns:	Feature set
Return type:	features

static kSpectrumGPS()¶

Constructs the k-mer spectrum graded position stranded kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position.

Parameters:	k (int) – k-mer length.
Returns:	Feature set
Return type:	features

static kSpectrumMM()¶

Constructs the k-mer spectrum mismatch kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position.

Parameters:	k (int) – k-mer length.
Returns:	Feature set
Return type:	features

static kSpectrumMMD()¶

Constructs the k-mer spectrum mismatch kernel, the occurrence frequencies of all motifs of length k, with one allowed mismatch in an arbitrary position. This formulation counts the base k-mer k times, and mutated k-mers one time.

Parameters:	k (int) – k-mer length.
Returns:	Feature set
Return type:	features

static motifPairSpectrum()¶

Constructs a feature set for the PREdictor motif pair occurrence spectrum, given a set of motifs.

Parameters:	motifs (motifs) – The motifs to use for the feature space.
Returns:	Feature set
Return type:	features

static motifSpectrum()¶

Constructs a feature set for the motif occurrence spectrum, given a set of motifs.

Parameters:	motifs (motifs) – The motifs to use for the feature space.
Returns:	Feature set
Return type:	features

name¶

nvalues¶

scale()¶

summary()¶

table()¶

values¶

class gnocis.features.kSpectrum¶

Bases: object

Feature set that extracts the occurrence frequencies of all unambiguous motifs of length k. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:	nspectrum (int) – Length of motifs, k, to use.

bitmask¶

cachedSequence¶

cachedSpectrum¶

features¶

kmerByIndex¶

nFeatures¶

nspectrum¶

class gnocis.features.kSpectrumFeatureGPS¶

Bases: gnocis.features.feature

get()¶

Extracts the feature value from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature value
Return type:	float

index¶

kmer¶

parent¶

section¶

class gnocis.features.kSpectrumGPS¶

Bases: object

Feature set that extracts the occurrence frequencies of all motifs of length k, position and strand represented. To represent position and strandedness, four features are generated per k-mer: one for each combination of strandedness and sequence end for proximity. Graded distance to each end of the sequence is used in order to represent position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:	nspectrum (int) – Length of motifs, k, to use.

bitmask¶

cachedSequence¶

cachedSpectrum¶

features¶

kmerByIndex¶

nFeatures¶

nkmers¶

nspectrum¶

class gnocis.features.kSpectrumMM¶

Bases: object

Feature set that extracts the occurrence frequencies of all motifs of length k, with one mismatch allowed in an arbitrary position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:	nspectrum (int) – Length of motifs, k, to use.

bitmask¶

cachedSequence¶

cachedSpectrum¶

features¶

kmerByIndex¶

nFeatures¶

nspectrum¶

class gnocis.features.kSpectrumMMD¶

Bases: object

Feature set that extracts the occurrence frequencies of all motifs of length k, with one mismatch allowed in an arbitrary position. Extraction of the entire spectrum is optimized by taking overlaps of motifs into account.

Parameters:	nspectrum (int) – Length of motifs, k, to use.

bitmask¶

cachedSequence¶

cachedSpectrum¶

features¶

kmerByIndex¶

nFeatures¶

nspectrum¶

class gnocis.features.kSpectrumMMDFeature¶

Bases: gnocis.features.feature

get()¶

Extracts the feature value from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature value
Return type:	float

index¶

kmer¶

parent¶

class gnocis.features.scaledFeature¶

Bases: gnocis.features.feature

The scaledFeature class scales and shifts an input feature.

Parameters:	_feature (feature) – The feature to scale. vScale (float) – Value to scale by. vSub (float) – Shifting.

After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.

feature¶

get()¶

Extracts the feature value from a sequence.

Parameters:	seq (sequence) – The sequence to extract the feature value from.
Returns:	Feature value
Return type:	float

vScale¶

vSub¶

gnocis.models module¶

class gnocis.models.CVModelPredictions¶

Bases: object

Helper class that represents cross-validated model predictions..

Parameters:	model (sequenceModel) – Model. labelPredictionCurves (dict) – Labelled prediction curves.

getStatTable¶

predictCore¶

regions¶

setCorePredictions¶

gnocis.models.createDummyPREdictorModel()¶

gnocis.models.crossvalidate()¶

class gnocis.models.crossvalidation¶

Bases: object

Helper class for cross-validations. Accepts binary training and validation sets and constructs cross-validation sets for a desired number of repeats. If a separate validation set is not given, the training set is used. The cross-validation set for each repeat contains numbers of training and test sequences determined by a training-to-testing sequence ratio, as well as a negative-per-positive test sequence ratio. When constructing the validation set, identities are checked for against the training set, to avoid contamination (will not work if sequences are cloned). Holds models and validation statistics. Integrates with terminal and IPython for visualization of results. Stores training and testing data, so new models can be added.

Parameters:

models (list) – List of models to cross-validate.
tpos (sequences) – Positive training set.
tneg (sequences) – Negative training set.
vpos (sequences, optional) – Positive validation set.
vneg (sequences, optional) – Negative validation set.
repeats (int, optional) – Number of experimental repeats. Default = 20.
ratioTrainTest (float, optional) – Ratio of training to testing sequences. Default = 80%.
ratioNegPos (float, optional) – Ratio of validation negatives to positives. Default = 100.

addModel¶

calibrate¶

getAUCTable¶

getConfigurationTable¶

plotPRC¶

plotROC¶

predict¶

class gnocis.models.sequenceModel¶

Bases: object

The sequenceModel class is an abstract class for sequence models. A number of methods are implemented for machine learning and prediction with DNA sequences.

Parameters:	name (string) – Name of model. enableMultiprocessing (bool) – If True (default), multiprocessing is enabled.

calibrateGenomewidePrecision¶

Calibrates the model threshold for an expected precision genome-wide. Returns self to facilitate chaining of operations. However, this operation does mutate the model object. A scaling factor can be applied to the genome with the ‘factor’ argument. If, for instance, the positive set has been divided in half for independent training and calibration, a factor of 0.5 can be used.

Parameters:	positives (sequences/sequenceStream) – Positive sequences. genome (sequences/sequenceStream/str) – The genome. factor (float, optional) – Scaling factor for positive set versus genome. precision (float, optional) – The precision to approximate.
Returns:	List of the maximum window score per sequence
Return type:	list

getConfusionMatrix¶

Calculates and returns a confusion matrix for sets of positive and negative sequences.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	Confusion matrix
Return type:	dict

getOptimalAccuracyThreshold¶

Gets a threshold value optimized for accuracy to a set of positive and a set of negative sequences.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	List of the maximum window score per sequence
Return type:	list

getPRC¶

Calculates and returns a Precision/Recall curve (PRC) for sets of positive and negative sequences.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	Precision/Recall curve (PRC)
Return type:	list

getPRCAUC¶

Calculates and returns the area under a Precision/Recall curve (PRCAUC) for sets of positive and negative sequences.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	Area under the Precision/Recall curve (PRCAUC)
Return type:	list

getPrecisionThreshold¶

Gets a threshold value for a desired precision to a set of positive and a set of negative sequences. Linear interpolation is used in order to achieve a close approximation.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives. wantedPrecision (float) – The precision to approximate.
Returns:	List of the maximum window score per sequence
Return type:	list

getROC¶

Calculates and returns a Receiver Operating Characteristic (ROC) curve for sets of positive and negative sequences.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	Receiver Operating Characteristic (ROC) curve
Return type:	list

getROCAUC¶

Calculates and returns the area under a Receiver Operating Characteristic (ROCAUC) curve for sets of positive and negative sequences.

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	Area under the Receiver Operating Characteristic (ROCAUC) curve
Return type:	list

getSequenceScores¶

Scores a set of sequences, returning the maximum window score for each. Multiprocessing is enabled by default, but can be disabled in the constructor.

Parameters:	seqs (sequences/sequenceStream) – Sequences to score.
Returns:	List of the maximum window score per sequence
Return type:	list

getValidationStatistics¶

Returns common model validation statistics (confusion matrix values; ROCAUC; PRCAUC).

Parameters:	seqs (sequences) – Sequences. labelPositive (sequenceLabel) – Label of positives. labelNegative (sequenceLabel) – Label of negatives.
Returns:	Validation statistics
Return type:	list

plotPRC¶

Plots a Precision/Recall curve and either displays it in an IPython session or saves it to a file.

Parameters:

seqs (sequences) – Sequences.
labelPositive (sequenceLabel) – Label of positives.
labelNegative (sequenceLabel) – Label of negatives.
figsize (tuple, optional) – Tuple of figure dimensions.
outpath (str, optional) – Path to save generated plot to. If not set, the plot will be output to IPython.
style (str, optional) – Matplotlib style to use.

predict¶

Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.

Parameters:	stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction.
Returns:	Set of non-overlapping (merged) predicted regions
Return type:	regions

predictCore¶

Predicts cores. Not implemented by default. Models that use core prediction should implement this.

Parameters:	genome (genome) – Genome.
Returns:	Predicted core regions if implemented, or None otherwise
Return type:	regions

predictSequenceStreamCurves¶

Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.

Parameters:	stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction.
Returns:	Set of non-overlapping (merged) predicted regions
Return type:	regions

predictSequenceStreamRegions¶

Applies the model using a sliding window across an input sequence stream or sequence set. Windows with a score >= self.threshold are predicted, and predicted windows are merged into non-overlapping predictions.

Parameters:	stream (sequences/sequenceStream/str) – Sequence material that the model is applied to for prediction.
Returns:	Set of non-overlapping (merged) predicted regions
Return type:	regions

printTestStatistics¶

score¶

Scores a sequence of set of sequences.

Parameters:	target (sequence/sequences/sequenceStream) – Sequence(s) to score. asTable (bool) – If True, a table will be output. Otherwise, a list.
Returns:	Score or list of scores
Return type:	float/list

scoreSequence¶

Scores a single sequence. The score is determined by applying the model with a sliding window, and taking the maximum score.

Parameters:	seq (sequences/sequenceStream) – The sequence.
Returns:	Maximal window score
Return type:	float

train¶

Scores a sequence of set of sequences.

Parameters:	ts (sequences) – Training set.
Returns:	Trained model
Return type:	sequenceModel

class gnocis.models.sequenceModelDummy¶

Bases: gnocis.models.sequenceModel

This model takes a feature set, and scores input sequences by summing feature values, without weighting.

Parameters:	name (str) – Model name. features (features) – The feature set. windowSize (int) – Window size to use. windowStep (int) – Window step size to use.

getTrainer¶

scoreWindow¶

class gnocis.models.sequenceModelLogOdds¶

Bases: gnocis.models.sequenceModel

Constructs a log-odds model based on an input feature set and binary training set, and scores input sequences by summing log-odds-weighted feature values.

Parameters:	name (str) – Model name. features (features) – The feature set. trainingSet (sequences) – Training sequences. windowSize (int) – Window size to use. windowStep (int) – Window step size to use. labelPositive (sequenceLabel) – Positive training class label. labelNegative (sequenceLabel) – Negative training class label.

getTrainer¶

scoreWindow¶

gnocis.models.setNCores()¶

Sets the number of cores to use for multiprocessing.

Parameters:	n (int) – The number of cores to use.

gnocis.models.setNThreadFetch()¶

gnocis.models.trainPREdictorModel()¶

gnocis.models.trainSinglePREdictorModel()¶

gnocis.motifs module¶

class gnocis.motifs.IUPACMotif¶

Bases: object

Represents an IUPAC motif.

Parameters:	name (str) – Name of the motif. motif (str) – Motif sequence, in IUPAC nucleotide codes (https://www.bioinformatics.org/sms/iupac.html).

c¶

cRC¶

cachedOcc¶

cachedSequence¶

find()¶

Finds the occurrences of the motif in a sequence.

Parameters:	seq (sequence) – The sequence. cache (bool) – Whether or not to cache.
Returns:	List of motif occurrences
Return type:	list

motif¶

name¶

nmismatches¶

regexMotif¶

regexMotifRC¶

class gnocis.motifs.PWMMotif¶

Bases: object

Represents an Position Weight Matrix (PWM) motif.

Parameters:	name (str) – Name of the motif. pwm (str) – Position Weight Matrix. path (str) – Path to file that the PWM was loaded from.

PWMF¶

PWMRC¶

cachedOcc¶

cachedSequence¶

find()¶

Finds the occurrences of the motif in a sequence.

Parameters:	seq (sequence) – The sequence. cache (bool) – Whether or not to cache.
Returns:	List of motif occurrences
Return type:	list

getEValueThreshold()¶

Calibrates the threshold for a desired E-value.

Parameters:	bgModel (object) – Background model to use for calibration. Evalue (float) – The desired E-value. EvalueUnit (float) – E-value unit. seedThreshold (float) – Seed threshold. iterations (int) – Number of iterations.
Returns:	Threshold
Return type:	float

name¶

path¶

setPWM()¶

Sets the Position Weight Matrix.

Parameters:	pwm (list) – Position Weight Matrix.

threshold¶

gnocis.motifs.loadMEMEPWMDatabase()¶

class gnocis.motifs.motifOccurrence¶

Bases: object

Represents motif occurrences.

Parameters:	motif (object) – The motif. seq – Name of the sequence the motif occurred on. start – Occurrence start nucleotide. end – Occurrence end nucleotide. strand – Occurrence strand. True is the +/forward strand, and False is the -/backward strand. seq – str start – int end – int strand – bool

After constructing a biomarkers object, for a set of biomarkers BM, len(BM) gives the number of markers, BM[‘FACTOR_NAME’] gives the merged set of regions for factor FACTOR_NAME, and [ x for x in BM ] gives the list of merged regions per biomarker contained in BM.

end¶

motif¶

seq¶

start¶

strand¶

class gnocis.motifs.motifs¶

Bases: object

The motifs class represents a set of motifs.

Parameters:	name (str) – Name of the motif set. motifs – List of motifs. motifs – list

Ringrose2003¶: Preset for generating the Ringrose et al. (2003) motif set.

Ringrose2003GTGT¶: Preset for generating the Ringrose et al. (2003) motif set, with GTGT added (as in Bredesen et al. 2019).

motifs¶

name¶

occFreq¶: Generates an occurrence frequency feature set

pairFreq¶: Generates a pair occurrence frequency feature set

table¶

gnocis.regions module¶

gnocis.regions.loadBED()¶

Loads regions from a BED (https://www.ensembl.org/info/website/upload/bed.html) file.

Parameters:	path (str) – Path to the input file.
Returns:	Loaded regions.
Return type:	regions

gnocis.regions.loadBEDGZ()¶

Loads regions from a gzipped BED (https://www.ensembl.org/info/website/upload/bed.html) file.

Parameters:	path (str) – Path to the input file.
Returns:	Loaded regions.
Return type:	regions

gnocis.regions.loadCoordinateList()¶

Loads regions from a coordinate list file (one region per line, with the format seq:start..end).

Parameters:	path (str) – Path to the input file.
Returns:	Loaded regions.
Return type:	regions

gnocis.regions.loadGFF()¶

Loads regions from a General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.

Parameters:	path (str) – Path to the input file.
Returns:	Loaded regions.
Return type:	regions

gnocis.regions.loadGFFGZ()¶

Loads regions from a gzipped General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.

Parameters:	path (str) – Path to the input file.
Returns:	Loaded regions.
Return type:	regions

gnocis.regions.nucleotidePrecisionBarplot()¶

Generates a prediction nucleotide precision barplot.

Parameters:

predictionSets (list) – List of prediction region sets.
regionSets (list) – List of validation region sets.
figsize (tuple, optional) – Figure size.
outpath (str, optional) – Output path.
returnHTML (bool, optional) – If True, an HTML node will be returned.
fontsizeLabels (float, optional) – Size of label font.
fontsizeLegend (float, optional) – Size of legend font.
fontsizeAxis (float, optional) – Size of axis font.
style (str, optional) – Plot style to use.
showLegend (bool, optional) – Flag for whether or not to render legend.
bboxAnchorTo (tuple, optional) – Legend anchor point.
legendLoc (str, optional) – Legend location.
showValues (bool, optional) – If True, values will be plotted.

gnocis.regions.nucleotideRegionF1Barplot()¶

Generates a barplot of F1-measures, where the recall is based on region overlap sensitivity and precision is based on nucleotide precision.

Parameters:

predictionSets (list) – List of prediction region sets.
regionSets (list) – List of validation region sets.
figsize (tuple, optional) – Figure size.
outpath (str, optional) – Output path.
returnHTML (bool, optional) – If True, an HTML node will be returned.
fontsizeLabels (float, optional) – Size of label font.
fontsizeLegend (float, optional) – Size of legend font.
fontsizeAxis (float, optional) – Size of axis font.
style (str, optional) – Plot style to use.
showLegend (bool, optional) – Flag for whether or not to render legend.
bboxAnchorTo (tuple, optional) – Legend anchor point.
legendLoc (str, optional) – Legend location.
showValues (bool, optional) – If True, values will be plotted.

gnocis.regions.overlapPrecisionBarplot()¶

Generates a prediction overlap precision barplot.

Parameters:

predictionSets (list) – List of prediction region sets.
regionSets (list) – List of validation region sets.
figsize (tuple, optional) – Figure size.
outpath (str, optional) – Output path.
returnHTML (bool, optional) – If True, an HTML node will be returned.
fontsizeLabels (float, optional) – Size of label font.
fontsizeLegend (float, optional) – Size of legend font.
fontsizeAxis (float, optional) – Size of axis font.
style (str, optional) – Plot style to use.
showLegend (bool, optional) – Flag for whether or not to render legend.
bboxAnchorTo (tuple, optional) – Legend anchor point.
legendLoc (str, optional) – Legend location.
showValues (bool, optional) – If True, values will be plotted.

gnocis.regions.overlapSensitivityBarplot()¶

Generates a prediction overlap sensitivity barplot.

Parameters:

predictionSets (list) – List of prediction region sets.
regionSets (list) – List of validation region sets.
figsize (tuple, optional) – Figure size.
outpath (str, optional) – Output path.
returnHTML (bool, optional) – If True, an HTML node will be returned.
fontsizeLabels (float, optional) – Size of label font.
fontsizeLegend (float, optional) – Size of legend font.
fontsizeAxis (float, optional) – Size of axis font.
style (str, optional) – Plot style to use.
showLegend (bool, optional) – Flag for whether or not to render legend.
bboxAnchorTo (tuple, optional) – Legend anchor point.
legendLoc (str, optional) – Legend location.
showValues (bool, optional) – If True, values will be plotted.

gnocis.regions.predictionBarplot()¶

Generates a prediction overlap precision barplot.

Parameters:

predictionSets (list) – List of prediction region sets.
figsize (tuple, optional) – Figure size.
outpath (str, optional) – Output path.
returnHTML (bool, optional) – If True, an HTML node will be returned.
fontsizeLabels (float, optional) – Size of label font.
fontsizeLegend (float, optional) – Size of legend font.
fontsizeAxis (float, optional) – Size of axis font.
style (str, optional) – Plot style to use.
showLegend (bool, optional) – Flag for whether or not to render legend.
bboxAnchorTo (tuple, optional) – Legend anchor point.
legendLoc (str, optional) – Legend location.
showValues (bool, optional) – If True, values will be plotted.

class gnocis.regions.region¶

Bases: object

The region class represents a region within a sequence. A region has a sequence name, a start and end, and a strandedness, with optional additional annotation.

Parameters:

seq (str) – Sequence name.
start (int) – Start coordinate.
end (int) – End coordinate (inclusive).
strand (bool, optional) – Strandedness. True for the + (forward) strand (default), and False for the - (reverse) strand.
score (float, optional) – Score.
source (str, optional) – Soure.
feature (str, optional) – Feature.
group (str, optional) – Group.
dropChr (bool, optional) – True if a leading chr in seq is to be dropped (default).

The length of a region r can be calculated with len(r).

bstr()¶

Returns:	Returns the region formatted as a string: seq:start..end (strand).
Return type:	str

center()¶

end¶

ext¶

feature¶

group¶

markers¶

score¶

seq¶

singleton()¶

source¶

start¶

strand¶

gnocis.regions.regionBarplot()¶

Generates a barplot of a measure (lambda) for predicted regions.

Parameters:

mainRegions (list) – List of main region sets.
measure (function) – Function to compute measure.
measureName (str) – Name of measure.
testRegions (list, optional) – List of test region sets, to check against (or None if not relevant).
figsize (tuple, optional) – Figure size.
outpath (str, optional) – Output path.
returnHTML (bool, optional) – If True, an HTML node will be returned.
fontsizeLabels (float, optional) – Size of label font.
fontsizeLegend (float, optional) – Size of legend font.
fontsizeAxis (float, optional) – Size of axis font.
style (str, optional) – Plot style to use.
showLegend (bool, optional) – Flag for whether or not to render legend.
bboxAnchorTo (tuple, optional) – Legend anchor point.
legendLoc (str, optional) – Legend location.
showValues (bool, optional) – If True, values will be plotted.
isPercent (bool, optional) – If True, values will be plotted as percentages.

class gnocis.regions.regions¶

Bases: object

The regions class represents a set of regions.

Parameters:	name (str) – Name for the sequence set. rgns (list) – List of regions to include. initialSort (bool) – Whether or not to sort the regions (default is True).

For a region set RS, len(RS) gives the number of regions, RS[i] gives the i`th region in the set, and `[ r for r in RS ] lists regions in the set. For sets A and B, A+B gives a sorted region set of regions from both sets, A|B gives the merged set, A&B the conjunction, and A^B the regions of A with B excluded.

bstr()¶

Returns:	Returns a concatenated list of regions formatted as seq:start..end (strand).
Return type:	str

deltaResize()¶

Gets a set of resized regions. The delta size is subtracted from the start and added to the end.

Parameters:	sizeDelta – Desired region size change.
Returns:	Set of resized regions.
Return type:	regions

dummy()¶

Generates a dummy region set, with the same region lengths but random positions.

Parameters:	genome (genome) – Genome. useSeq (list) – Chromosomes to focus on.
Returns:	Dummy region set.
Return type:	regions

exclusion()¶

Gets the regions of this set with regions of another set excluded.

Parameters:	other (regions) – Other region set.
Returns:	Excluded region set.
Return type:	regions

expected()¶

Calculates an expected statistic by random dummy set creation and averaging.

Parameters:	genome (genome) – Genome. statfun (Function) – Function to apply to region sets that returns desired statistic. useSeq (list) – Chromosomes to focus on. repeats (int) – Number of repeats for calculating statistic.
Returns:	Average statistic.
Return type:	float

extract()¶

Extracts region sequences from a sequence set, sequence stream, or a file by path.

Parameters:	src (sequences/sequenceStream/str) – Sequence set, sequence stream, or sequence file path.
Returns:	Extracted region sequences.
Return type:	sequences

extractSequences()¶

Extracts region sequences from a sequence set or stream.

Parameters:	seqs (sequences/sequenceStream) – Sequences/sequence stream to extract region sequences from.
Returns:	Extracted region sequences.
Return type:	sequences

extractSequencesFrom2bit()¶

Extracts region sequences from a 2bit file.

Parameters:	path (str) – Path to 2bit file.
Returns:	Extracted region sequences.
Return type:	sequences

extractSequencesFromFASTA()¶

Extracts region sequences from a FASTA file.

Parameters:	path (str) – Path to FASTA file.
Returns:	Extracted region sequences.
Return type:	sequences

filter()¶

Returns a region set filtered with the input lambda function.

Parameters:	fltName (str) – Name of the filter. Appended to the region set name. flt (lambda) – Lambda function to be applied to every region, returning True if the region is to be included, or otherwise False.
Returns:	Region set filtered with the lambda function.
Return type:	regions

flatten()¶

Returns a flattened set of regions, with internally overlapping regions merged.

Returns:	Flattened region set.
Return type:	regions

intersection()¶

Gets the intersection of regions in this set and another.

Parameters:	other (regions) – Other region set.
Returns:	Intersection region set.
Return type:	regions

merge()¶

Gets the merged set of regions in this set and another.

Parameters:	other (regions) – Other region set.
Returns:	Merged region set.
Return type:	regions

name¶

nonOverlap()¶

Gets the regions in this set that do not overlap with another.

Parameters:	other (regions) – Other region set.
Returns:	Set of non-overlapping regions.
Return type:	regions

nucleotidePrecision()¶

Calculates the nucleotide precision to another set.

Parameters:	other (regions) – Other region set.
Returns:	Nucleotide precision.
Return type:	float

overlap()¶

Gets the regions in this set that overlap with another.

Parameters:	other (regions) – Other region set.
Returns:	Set of overlapping regions.
Return type:	regions

overlapPrecision()¶

Calculates the overlap precision to another set.

Parameters:	other (regions) – Other region set.
Returns:	Overlap precision.
Return type:	float

overlapSensitivity()¶

Calculates the overlap sensitivity to another set.

Parameters:	other (regions) – Other region set.
Returns:	Overlap sensitivity.
Return type:	float

printStatistics()¶: Outputs basic statistics.

randomSplit()¶

Parameters:	ratio (float) – Ratio for the split, between 0 and 1. For the return touple, (A, B), a ratio of 0 means all the regions are in B, and opposite for a ratio of 1.
Returns:	Returns a random split of the regions into two independent sets, with the given ratio.
Return type:	regions

recenter()¶

Gets a set of randomly recentered regions. If the target size is smaller than a given region, a region of the desired size is randomly placed within the region. If the desired size is larger, a region of the desired size is placed with equal center to the center of the region.

Parameters:	size (int) – Desired region size.
Returns:	Set of randomly recentered regions.
Return type:	regions

regions¶

rename()¶

Parameters:	newname (str) – New name.
Returns:	Renamed region set.
Return type:	regions

sample()¶

Parameters:	n (int) – Number of regions to pick. If 0, the same number will be selected as are in the full set.
Returns:	Returns a random sample of the regions of size n. Regions are selected with replacement, and the same region can occur multiple times.
Return type:	regions

saveBED()¶

Saves the regions to a BED (https://www.ensembl.org/info/website/upload/bed.html) file.

Parameters:	path (str) – Output file path.

saveCoordinateList()¶

Saves the regions to a coordinate list file (a coordinate per line, with the format seq:start..end).

Parameters:	path (str) – Output file path.

saveGFF()¶

Saves the regions to a General Feature Format (https://www.ensembl.org/info/website/upload/gff.html) file.

Parameters:	path (str) – Output file path.

sort()¶: Sorts the set, first by start coordinate, then by sequence.

table()¶

gnocis.sequences module¶

class gnocis.sequences.IID¶

Bases: gnocis.sequences.sequenceGenerator

The IID class trains an i.i.d. model for generating sequences.

Parameters:	trainingSequences (sequences/sequenceStream/str) – Training sequences. pseudoCounts (bool, optional) – Whether or not to use pseudocounts. Default is True. addComplements (bool, optional) – Whether or not to add complements. Default is True.

addComplements¶

generate()¶

Generates a sequence.

Parameters:	length (int) – Length of the sequence to generate.
Returns:	Random sequence.
Return type:	sequence

nGenerated¶

ntDistribution¶

prepared¶

pseudoCounts¶

spectrum¶

trainingSequences¶

class gnocis.sequences.MarkovChain¶

Bases: gnocis.sequences.sequenceGenerator

The MarkovChain class trains a Markov chain for generating sequences.

Parameters:	trainingSequences (sequences/sequenceStream/str) – Training sequences. degree (int, optional) – Markov chain degree. Default is 4. pseudoCounts (bool, optional) – Whether or not to use pseudocounts. Default is True. addReverseComplements (bool, optional) – Whether or not to add reverse complements. Default is True.

addReverseComplements¶

comparableSpectrum¶

degree¶

generate()¶

Generates a sequence.

Parameters:	length (int) – Length of the sequence to generate.
Returns:	Random sequence.
Return type:	sequence

initialDistribution¶

nGenerated¶

prepared¶

probspectrum¶

pseudoCounts¶

spectrum¶

trainingSequences¶

gnocis.sequences.getSequenceStreamFromPath()¶

Creates a sequence stream from a path, deducing the file type.

Parameters:	path (str) – Path to the input file. wantBlockSize (int, optional) – Desired block size. spacePrune (bool, optional) – Whether or not to prune sequence name spaces. dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes. restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:	Generated sequence stream.
Return type:	sequenceStream

gnocis.sequences.getSequenceWindowRegions()¶

Generates sequence window regions for a sequence set or stream.

Parameters:	stream – Input sequences. windowSize (int) – Window size. windowStep (int) – Window step size.
Returns:	Sequence window regions.
Return type:	regions

gnocis.sequences.loadFASTA()¶

Loads sequences from a FASTA file.

Parameters:	path (str) – Path to the input file.
Returns:	Loaded sequences.
Return type:	sequences

gnocis.sequences.loadFASTAGZ()¶

Loads sequences from a gzipped FASTA file.

Parameters:	path (str) – Path to the input file.
Returns:	Loaded sequences.
Return type:	sequences

class gnocis.sequences.sequence¶

Bases: object

The sequence class represents a DNA sequence.

Parameters:	name (str) – Name of the sequence. seq (str) – Sequence. path (str, optional) – Source file path. sourceRegion (region, optional) – Source region. annotation (str, optional) – Annotation.

For a sequence instance S, len(S) gives the sequence length. [ nt for nt in S ] lists the nucleotides. For two sequences A and B, A+B gives the concatenated sequence.

annotation¶

label()¶

Parameters:	label (sequenceLabel) – Label to add.
Returns:	Returns a labelled sequence.
Return type:	sequence

labels¶

name¶

path¶

seq¶

sourceRegion¶

windows()¶

Parameters:	size (int) – Window size. step (int) – Window step size. includeCroppedEnds (bool, optional) – Whether or not to include cropped windows on ends of sequence. Default is False.
Returns:	Returns a sequence set of sliding window sequences over the sequence.
Return type:	sequences

class gnocis.sequences.sequenceGenerator¶

Bases: object

The sequenceGenerator class is an abstract class for sequence generators.

generateFASTA()¶

Generates sequences and saves them to a FASTA file.

Parameters:	path (str) – Output path. n (int) – Number of sequences to generate. length (int) – Length of each sequence to generate.

generateSet()¶

Parameters:	n (int) – Number of sequences to generate. length (int) – Length of each sequence to generate.
Returns:	Sequence set of n randomly generated sequences, each of length length.
Return type:	sequences

generateStream()¶

Parameters:	n (int) – Number of sequences to generate. length (int) – Length of each sequence to generate. wantBlockSize (int, optional) – Desired block size.
Returns:	Sequence stream of n randomly generated sequences, each of length length.
Return type:	sequencesStream

class gnocis.sequences.sequenceLabel¶

Bases: object

The sequenceLabel class represents a label that can be assigned to sequences to be used for model training or testing, such as “positive” or “negative”.

Parameters:	name (str) – Name of the sequence label. value (float) – Value to represent the label.

name¶

value¶

class gnocis.sequences.sequenceStream¶

Bases: object

The sequenceStream class represents sequence streams.

fetch()¶

Fetches n sequences (or less, if fewer or none are left).

Parameters:	n (int) – Number of sequences to fetch. maxnt (int) – Maximum number of nucleotides to fetch.

:return Yields sequences. :rtype: sequences (yield)

sequenceLengths()¶

Returns:	Returns a dictionary of the lengths of all sequences in the stream.
Return type:	dict

streamFullSequences()¶

Streams and yields whole sequences.

:return Yields whole sequence. :rtype: sequence (yield)

windowRegions()¶

Generates and returns a set of sliding window regions.

Parameters:	size (int) – Window size. step (int) – Window step size.
Returns:	Region set of all sliding window regions across all sequences in the stream.
Return type:	regions

windows()¶

Generates and returns a set of sliding window sequences.

Parameters:	size (int) – Window size. step (int) – Window step size.
Returns:	Sequence set of all sliding window sequences across all sequences in the stream.
Return type:	sequences

class gnocis.sequences.sequenceStream2bit¶

Bases: gnocis.sequences.sequenceStream

Streams a 2bit file.

Parameters:	path (str) – Path to the input file. wantBlockSize (int) – Desired block size. spacePrune (bool) – Whether or not to prune sequence name spaces. dropChr (bool) – Whether or not to prune sequence name “chr”-prefixes. restrictToSequences (list) – List of sequence names to restrict to/focus on.

class gnocis.sequences.sequenceStreamFASTA¶

Bases: gnocis.sequences.sequenceStream

Streams a FASTA file.

Parameters:

path (str) – Path to the input file.
wantBlockSize (int) – Desired block size.
spacePrune (bool) – Whether or not to prune sequence name spaces.
dropChr (bool) – Whether or not to prune sequence name “chr”-prefixes.
restrictToSequences (list) – List of sequence names to restrict to/focus on.
isGZ (bool) – Whether or not the input file is GZipped.

class gnocis.sequences.sequences¶

Bases: object

The sequences class represents a set of DNA sequences.

Parameters:	name (str) – Name of the sequence set. seq (list) – List of sequences to include.

For a sequence instance S, len(S) gives the sequence length, and [ s for s in S ] lists the sequences. For two sequence sets A and B, A+B gives the set containing all sequences from both sets.

label()¶

Adds a label to sequences.

Parameters:	label (sequenceLabel) – Label to add to sequences.
Returns:	Sequences with label added.
Return type:	sequences

labels()¶

Gets labels for sequences.

Returns:	Set of labels.
Return type:	set

name¶

printStatistics()¶: Outputs basic statistics.

randomSplit()¶

Parameters:	ratio (float) – Ratio for the split, between 0 and 1. For the return touple, (A, B), a ratio of 0 means all the sequences are in B, and opposite for a ratio of 1.
Returns:	Returns a random split of the sequences into two independent sets, with the given ratio.
Return type:	sequences

rename()¶

Parameters:	newname (str) – New name.
Returns:	Renamed sequence set.
Return type:	sequences

sample()¶

Parameters:	n (int) – Number of sequences to pick. If 0, the same number will be selected as are in the full set.
Returns:	Returns a random sample of the sequences of size n. Sequences are selected with replacement, and the same sequence can occur multiple times.
Return type:	sequences

saveFASTA()¶

Saves the regions to a FASTA file.

Parameters:	path (str) – Output file path.

sequenceLengths()¶

Returns:	Returns a dictionary of the lengths of all sequences in the set.
Return type:	dict

sequences¶

sourceRegions()¶

Returns:	Region set of source regions for all sequences in the set.
Return type:	regions

split()¶

Parameters:	n (int) – Number of sets to split into.
Returns:	Returns a list of n independent splits of the sequence set, without shuffling.
Return type:	list

table()¶

windowRegions()¶

Generates and returns a set of sliding window regions.

Parameters:	size (int) – Window size. step (int) – Window step size.
Returns:	Region set of all sliding window regions across all sequences in the set.
Return type:	regions

windows()¶

Generates and returns a set of sliding window sequences.

Parameters:	size (int) – Window size. step (int) – Window step size.
Returns:	Sequence set of all sliding window sequences across all sequences in the set.
Return type:	sequences

withLabel()¶

Extracts sequences with the given label or labels.

Parameters:	labels (list, sequenceLabel) – List of labels, or single labels, to extract sequences with. ensureAllLabels (bool) – If true, an exception is raised if no sequences with a label were found. Default true.
Returns:	Single label or list of labels.
Return type:	sequences, list

If multiple labels are given, a list of sequences instances are returned in the same order as the labels.

gnocis.sequences.stream2bit()¶

Streams a 2bit file.

Parameters:	path (str) – Path to the input file. wantBlockSize (int, optional) – Desired block size. spacePrune (bool, optional) – Whether or not to prune sequence name spaces. dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes. restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:	Generated sequence stream.
Return type:	sequenceStream

gnocis.sequences.streamFASTA()¶

Streams a FASTA file.

Parameters:	path (str) – Path to the input file. wantBlockSize (int, optional) – Desired block size. spacePrune (bool, optional) – Whether or not to prune sequence name spaces. dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes. restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:	Generated sequence stream.
Return type:	sequenceStream

gnocis.sequences.streamFASTAGZ()¶

Streams a gzipped FASTA file.

Parameters:	path (str) – Path to the input file. wantBlockSize (int, optional) – Desired block size. spacePrune (bool, optional) – Whether or not to prune sequence name spaces. dropChr (bool, optional) – Whether or not to prune sequence name “chr”-prefixes. restrictToSequences (list, optional) – List of sequence names to restrict to/focus on.
Returns:	Generated sequence stream.
Return type:	sequenceStream

gnocis.sequences.streamSequenceWindows()¶

Yields sequence windows for a sequence set, sequence stream or a path to a sequence file.

Parameters:	src (sequences/sequenceStream/str) – Input sequences. windowSize (int) – Window size. windowStep (int) – Window step size.
Returns:	Sequence windows.
Return type:	sequence (yield)

gnocis.sklearnModels module¶

class gnocis.sklearnModels.RF(model=None, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>, nTrees=100, maxDepth=None)[source]¶

Bases: gnocis.featurenetwork.baseModel

score(featureVectors)[source]¶

train(trainingSet)[source]¶

weights(featureNames)[source]¶

class gnocis.sklearnModels.SVM(model=None, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>, kDegree=1, C=4)[source]¶

Bases: gnocis.featurenetwork.baseModel

score(featureVectors)[source]¶

train(trainingSet)[source]¶

weights(featureNames)[source]¶

class gnocis.sklearnModels.sequenceModelLasso(name, features, trainingSet, windowSize, windowStep, alpha=1.0, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶

Bases: gnocis.models.sequenceModel

The sequenceModelLasso class trains a Lasso model using scikit-learn.

Parameters:	name (str) – Model name. features (features) – Feature set. positives (sequences) – Positive training sequences. negatives (sequences) – Negative training sequences. windowSize (int) – Window size. windowStep (int) – Window step size. alpha (float) – Alpha parameter for Lasso.

getSequenceFeatureVector(seq)[source]¶

getTrainer()[source]¶

scoreWindow(seq)[source]¶

class gnocis.sklearnModels.sequenceModelRF(name, features, trainingSet, windowSize, windowStep, nTrees=100, maxDepth=None, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶

Bases: gnocis.models.sequenceModel

The sequenceModelRF class trains a Random Forest (RF) model using scikit-learn.

Parameters:	name (str) – Model name. features (features) – Feature set. positives (sequences) – Positive training sequences. negatives (sequences) – Negative training sequences. windowSize (int) – Window size. windowStep (int) – Window step size. nTrees (int) – Number of trees. maxDepth (int) – Maximum tree depth.

getSequenceFeatureVector(seq)[source]¶

getTrainer()[source]¶

scoreWindow(seq)[source]¶

class gnocis.sklearnModels.sequenceModelSVM(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶

Bases: gnocis.models.sequenceModel

The sequenceModelSVM class trains a Support Vector Machine (SVM) using scikit-learn.

Parameters:	name (str) – Model name. features (features) – Feature set. positives (sequences) – Positive training sequences. negatives (sequences) – Negative training sequences. windowSize (int) – Window size. windowStep (int) – Window step size. kDegree (float) – Kernel degree.

getSequenceFeatureVector(seq)[source]¶

getTrainer()[source]¶

scoreWindow(seq)[source]¶

gnocis.sklearnModelsOpt module¶

class gnocis.sklearnModelsOpt.sequenceModelSVMOptimizedQuadratic(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶

Bases: gnocis.models.sequenceModel

The sequenceModelSVMOptimizedQuadratic class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication.

Parameters:	name (str) – Model name. features (features) – Feature set. positives (sequences) – Positive training sequences. negatives (sequences) – Negative training sequences. windowSize (int) – Window size. windowStep (int) – Window step size. kDegree (float) – Kernel degree.

getSequenceFeatureVector(seq)[source]¶

getTrainer()[source]¶

scoreWindow(seq)[source]¶

class gnocis.sklearnModelsOpt.sequenceModelSVMOptimizedQuadraticAutoScale(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶

Bases: gnocis.models.sequenceModel

The sequenceModelSVMOptimizedQuadratic class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication.

Parameters:	name (str) – Model name. features (features) – Feature set. positives (sequences) – Positive training sequences. negatives (sequences) – Negative training sequences. windowSize (int) – Window size. windowStep (int) – Window step size. kDegree (float) – Kernel degree.

getSequenceFeatureVector(seq)[source]¶

getTrainer()[source]¶

scoreWindow(seq)[source]¶

class gnocis.sklearnModelsOpt.sequenceModelSVMOptimizedQuadraticCUDA(name, features, trainingSet, windowSize, windowStep, kDegree, scale=True, labelPositive=Sequence label<Positive; value = 1.0>, labelNegative=Sequence label<Negative; value = -1.0>)[source]¶

Bases: gnocis.models.sequenceModel

The sequenceModelSVMOptimizedQuadraticCUDA class trains a quadratic kernel Support Vector Machine (SVM) using scikit-learn. The kernel is applied using matrix multiplication with CUDA.

Parameters:	name (str) – Model name. features (features) – Feature set. positives (sequences) – Positive training sequences. negatives (sequences) – Negative training sequences. windowSize (int) – Window size. windowStep (int) – Window step size. kDegree (float) – Kernel degree.

getSequenceFeatureVector(seq)[source]¶

getTrainer()[source]¶

scoreWindow(seq)[source]¶

gnocis.validation module¶

gnocis.validation.getAUC()¶

Calculates the Area Under the Curve (AUC) for an input curve.

Parameters:	curve (list) – Curve to calculate area under.
Returns:	Area Under the Curve (AUC).
Return type:	float

gnocis.validation.getConfusionMatrix()¶

Calculates confusion matrix based on input validation pairs, and an optional threshold.

Parameters:	vPos (list) – Positive validation pairs. vNeg (list) – Negative validation pairs. threshold (float) – Classification threshold.
Returns:	Dictionary with confusion matrix entries.
Return type:	dict

gnocis.validation.getConfusionMatrixStatistics()¶

Calculates confusion matrix statistics in an input confusion matrix dictionary.

Parameters:	CM (dict) – Confusion matrix dictionary.
Returns:	Dictionary with confusion matrix statistics.
Return type:	dict

gnocis.validation.getPRC()¶

Generates a Precision/Recall Curve (PRC).

Parameters:	vPos (list) – Positive validation pairs. vNeg (list) – Negative validation pairs.
Returns:	List of 2D-points for curve
Return type:	list

gnocis.validation.getROC()¶

Generates a Receiver Operating Characteristic (ROC) curve.

Parameters:	vPos (list) – Positive validation pairs. vNeg (list) – Negative validation pairs.
Returns:	List of 2D-points for curve
Return type:	list

class gnocis.validation.point2D¶

Bases: object

Two-dimensional point.

Parameters:	x (float) – X-coordinate. y (float) – Y-coordinate. rank (float, optional) – Rank.

rank¶

x¶

y¶

gnocis.validation.printValidationStatistics()¶

Prints model confusion matrix statistics from a dictionary.

Parameters:	stats (dict) – Confusion matrix statistics dictionary.

class gnocis.validation.validationPair¶

Bases: object

Pair of score and binary class label.

Parameters:	score (float) – Score. label – Binary class label.

label¶

name¶

score¶

gnocis¶

gnocis.biomarkers module¶

gnocis.common module¶

gnocis.features module¶

gnocis.models module¶

gnocis.motifs module¶

gnocis.regions module¶

gnocis.sequences module¶

gnocis.sklearnModels module¶

gnocis.sklearnModelsOpt module¶

gnocis.validation module¶

Table of Contents

Previous topic

This Page