The Python API

The following classes/modules are included tag’s Python API, which is under semantic versioning.

Range

class tag.range.Range(start, end)[source]

Represents a region of a biological sequence.

Start and end coordinates correspond to the 0-based half-open interval: start inclusive, end exclusive (start=0 and end=10 corresonds to the first 10 nucleotides).

>>> rng = Range(0, 1000)
>>> rng
[0, 1000)
>>> len(rng)
1000
>>> rng.contains(Range(10, 20))
True
>>> rng.contains_point(2345)
False
>>> rng.start = 500
>>> rng.start
500
>>> rng.end
1000
contains(other)[source]

Determine whether this range contains another.

contains_point(point)[source]

Determine whether this range contains the specified point.

intersect(other)[source]

Determine the interval of overlap between this range and another.

Returns:a new Range object representing the overlapping interval, or None if the ranges do not overlap.
merge(other)[source]

Merge this range object with another (ranges need not overlap or abut).

Returns:a new Range object representing the interval containing both ranges.
overlap(other)[source]

Determine whether this range overlaps with another.

overlap_atleast(other, minbp=1, minperc=0.0)[source]

Determine whether two ranges overlap by at least some threshold.

The thresholds can be specified using a minimum number of bases, or a minimum percentage of the length of the ranges, or both.

overlap_extent(other)[source]

Compute number of nucleotides of overlap between two ranges.

transform(offset)[source]

Shift this range by the specified offset.

Note: the resulting range must be a valid interval.

within(other)[source]

Determine whether this range is contained within another.

Comment

class tag.comment.Comment(data)[source]

Represents a comment in an annotation (GFF3) file.

Any GFF3 entry starting with >= 1 ‘#’ characters is treated as a comment, with two exceptions:

  • the separator directive, a line containing ‘###’ and nothing more
  • any entry beginning with just two ‘#’ characters is treated as a directive.

Directive

class tag.directive.Directive(data)[source]

Represents a directive from a GFF3 file.

This class is primarily for error checking and data access. Once created, Directive objects should be treated as read-only: modify at your peril! Also, separator directives (###) and the ##FASTA directive are handled directly by parsers and not by this class.

Directives not explicitly declared in the GFF3 spec are application specific: they will be parsed without complaint, but no guarantees can be made about accessing their attributes.

>>> sr = Directive('##sequence-region chr1 5000 10000')
>>> sr.type
'sequence-region'
>>> sr.seqid
'chr1'
>>> gb = Directive('##genome-build   BeeBase 4.5')
>>> gb.type
'genome-build'
>>> gb.source
'BeeBase'

Sequence

class tag.sequence.Sequence(defline, seq)[source]

Represents a sequence (almost always a nucleotide sequence).

We do not encourage putting FASTA sequences in your GFF3 files, but since the specification explicitly allows it we have to handle it. :-(

>>> s = Sequence('>contig1 description', 'ACGT')
>>> s.defline
'>contig1 description'
>>> s.seq
'ACGT'
>>> s.seqid
'contig1'
>>> len(s)
4
>>> s = Sequence('>gi|12345|gb|BOGUSSEQ', 'GATTACA')
>>> s.seq
'GATTACA'
>>> s.accession
'BOGUSSEQ'
accession

Parse accession number from commonly supported formats.

If the defline does not match one of the following formats, the entire description (sans leading caret) will be returned.

  • >gi|572257426|ref|XP_006607122.1|
  • >gnl|Tcas|XP_008191512.1
  • >lcl|PdomMRNAr1.2-10981.1
format_seq(outstream=None, linewidth=70)[source]

Print a sequence in a readable format.

Parameters:
  • outstream – if None, formatted sequence is returned as a string; otherwise, it is treated as a file-like object and the formatted sequence is printed to the outstream
  • linewidth – width for wrapping sequences over multiple lines; set to 0 for no wrapping

Feature

class tag.feature.Feature(seqid, ftype, start, end, source='tag', score=None, strand=None, phase=None, attrstr=None)[source]

Represents a feature entry from a GFF3 file.

>>> gene = Feature('contig1', 'gene', 999, 7500, source='snap', strand='+',
...                attrstr='ID=gene1')
>>> gene.seqid
'contig1'
>>> gene.source
'snap'
>>> gene.type
'gene'
>>> gene.range
[999, 7500)
>>> gene.score is None
True
>>> gene.strand
'+'
>>> gene.phase is None
True
>>> gene.get_attribute('ID')
'gene1'
>>> gene.get_attribute('NotARegisteredAttribute')
>>> gene.slug
'gene@contig1[1000, 7500]'
add_attribute(attrkey, attrvalue, append=False, oldvalue=None)[source]

Add an attribute to this feature.

Feature attributes are stored as nested dictionaries.

Each feature can only have one ID, so ID attribute mapping is ‘string’ to ‘string’. All other attributes can have multiple values, so mapping is ‘string’ to ‘dict of strings’.

By default, adding an attribute that already exists will cause the old value to be overwritten. If the append option is true, the new attribute value will not overwrite the old value, but will be appended as a second value. (Note: ID attributes can have only 1 value.)

If the oldvalue option is set, the new value will replace the old value. This is necessary for updating an attribute that has multiple values without completely overwriting all old values. (Note: The append option is ignored when oldvalue is set.)

add_child(child, rangecheck=False)[source]

Add a child feature to this feature.

add_sibling(sibling)[source]

Designate this a multi-feature representative and add a co-feature.

Some features exist discontinuously on the sequence, and therefore cannot be declared with a single GFF3 entry (which can encode only a single interval). The canonical encoding for these types of features is called a multi-feature, in which a single feature is declared on multiple lines with multiple entries all sharing the same feature type and ID attribute. This is commonly done with coding sequence (CDS) features.

In this package, each multi-feature has a single “representative” feature object, and all other objects/entries associated with that multi-feature are attached to it as “siblings”.

Invoking this method will designate the calling feature as the multi-feature representative and add the argument as a sibling.

attribute_crawl(key)[source]

Grab all attribute values associated with the given feature.

Traverse the given feature (and all of its descendants) to find all values associated with the given attribute key.

>>> reader = tag.GFF3Reader(
...     tag.tests.data_stream('otau-no-seqreg.gff3')
... )
>>> features = tag.select.features(reader)
>>> for feature in features:
...     names = feature.attribute_crawl('Name')
...     print(sorted(list(names)))
['Ot01g00060', 'XM_003074019.1', 'XP_003074065.1']
['Ot01g00070', 'XM_003074020.1', 'XP_003074066.1']
['Ot01g00080', 'XM_003074021.1', 'XP_003074067.1']
['Ot01g00090', 'XM_003074022.1', 'XP_003074068.1']
['Ot01g00100', 'XM_003074023.1', 'XP_003074069.1']
['Ot01g00110', 'XM_003074024.1', 'XP_003074070.1']
cdslen

Translated length of this feature.

Undefined for non-mRNA features.

contains(rng)[source]

Report whether this feature contains the specified range.

contains_point(point)[source]

Report whether this feature contains the specified point.

drop_attribute(attrkey)[source]

Drop the specified attribute from the feature.

get_attribute(attrkey, as_string=False, as_list=False)[source]

Get the value of an attribute.

By default, returns a string for ID and attributes with a single value, and a list of strings for attributes with multiple values. The as_string and as_list options can be used to force the function to return values as a string (comma-separated in case of multiple values) or a list.

get_attribute_keys()[source]

Return a list of all this feature’s attribute keys.

ncbi_geneid

Retrieve this feature’s NCBI GeneID if it’s present.

NCBI GFF3 files contain gene IDs encoded in Dbxref attributes (example: Dbxref=GeneID:103504972). This function locates and returns the GeneID if present, or returns None otherwise.

overlap(rng)[source]

Report whether this feature overlaps with the specified range.

parse_attributes(attrstring)[source]

Parse an attribute string.

Given a string with semicolon-separated key-value pairs, populate a dictionary with the given attributes.

pseudoify()[source]

Derive a pseudo-feature parent from the given multi-feature.

The provided multi-feature does not need to be the representative. The newly created pseudo-feature has the same seqid as the provided multi- feature, and spans its entire range. Otherwise, the pseudo-feature is empty. It is used only for convenience in sorting.

set_coord(start, end)[source]

Manually reset the feature’s coordinates.

slug

A concise slug for this feature.

Unlike the internal representation, which is 0-based half-open, the slug is a 1-based closed interval (a la GFF3).

transform(offset, newseqid=None)[source]

Transform the feature’s coordinates by the given offset.

Index

class tag.index.Index[source]

In-memory index for efficient interval-based queries for genome features.

Implemented as a dictionary, with sequence IDs as keys to interval trees of features for the corresponding scaffold or contig sequence.

>>> index = tag.index.Index()
>>> index.consume_file(tag.tests.data_file('pcan-123.gff3.gz'))
>>> list(index.seqids)
['scaffold_123', 'scaffold_124', 'scaffold_125']
>>> index.extent('scaffold_124')
(6020, 444332)
>>> for feature in index.query('scaffold_123', 5000, 6000):
...     print(feature.slug)
gene@scaffold_123[5583, 5894]
>>> for feature in index.query('scaffold_125', 19000, 87000, strict=False):
...     print(feature.slug)
gene@scaffold_125[18994, 19335]
gene@scaffold_125[57450, 57680]
gene@scaffold_125[86995, 87151]
>>> reader = tag.reader.GFF3Reader(
...     tag.tests.data_stream('grape-cpgat.gff3')
... )
>>> index = tag.index.Index()
>>> index.consume(reader)
>>> index.yield_inferred = False
>>> for entry in index:
...     print(entry.slug)
sequence chr8[1, 100000]
gene@chr8[72, 5081]
gene@chr8[10538, 11678]
gene@chr8[22053, 23448]
consume(entrystream)[source]

Load a stream of entries into memory.

Only Feature objects and sequence-region directives are loaded, all other entries are discarded.

consume_feature(feature)[source]

Load a Feature object into memory.

consume_file(infile)[source]

Load the specified GFF3 file into memory.

consume_seqreg(seqreg)[source]

Load a ##sequence-region directive into memory.

query(seqid, start, end=None, strict=True)[source]

Query the index for features in the specified range.

Parameters:
  • seqid – ID of the sequence to query
  • start – start of the query interval
  • end – end of the query interval
  • strict – indicates whether query is strict containment or overlap (True and False, respectively)
class tag.index.NamedIndex[source]

In-memory index for retrieving genome features by identifier.

>>> index = tag.index.NamedIndex()
>>> index.consume_file(tag.tests.data_file('pcan-123.gff3.gz'))
>>> gene = index['PCAN011a001813']
>>> print(gene.slug)
gene@scaffold_123[200029, 201298]
>>> exon = index['PCAN011a001859T1.exon1']
>>> print(exon.slug)
exon@scaffold_125[57450, 57680]

Readers

Currently the readers module contains only a single class, GFF3Reader, but may include others in the future.

exception tag.reader.AnnotationOutOfBoundsError[source]
exception tag.reader.AnnotationSortingError[source]
exception tag.reader.DuplicatedRegionError[source]
exception tag.reader.FeatureTypeDisagreementError[source]
class tag.reader.GFF3Reader(instream=None, infilename=None, assumesorted=False, strict=True, checkorder=True)[source]

Loads sequence features and other GFF3 entries into memory.

Features and other objects are obtained by iterating over the GFF3Reader object. See the tag.select module for available filtering functions.

>>> infile = tag.tests.data_file('pbar-withseq.gff3')
>>> reader = GFF3Reader(infilename=infile)
>>> for entry in reader:
...     print(type(entry))
<class 'tag.directive.Directive'>
<class 'tag.directive.Directive'>
<class 'tag.directive.Directive'>
<class 'tag.comment.Comment'>
<class 'tag.feature.Feature'>
<class 'tag.feature.Feature'>
<class 'tag.feature.Feature'>
<class 'tag.sequence.Sequence'>
<class 'tag.sequence.Sequence'>
>>> reader = GFF3Reader(infilename=infile)
>>> for feature in tag.select.features(reader, type='gene'):
...     print(feature.slug)
gene@NW_011929623.1[4557, 5749]
gene@NW_011929624.1[3725, 4229]

By default, the GFF3Reader assumes the input is not in sorted order and loads the entire annotation into memory to ensure sorting and resolution of ID/Parent relationships. If the input is sorted according to genomic position and independent features are separated by ### directives, memory consumption will be drastically reduced by setting the assumesorted attribute to True.

The strict attribute enforces some additional sanity checks, which in some exceptional cases may need to be relaxed.

tag.reader.parse_fasta(data)[source]

Load sequences in Fasta format.

This generator function yields a Sequence object for each sequence record in a GFF3 file. Implementation stolen shamelessly from http://stackoverflow.com/a/7655072/459780.

Writers

Currently the writers module contains only a single class, GFF3Writer, but may include others in the future.

class tag.writer.GFF3Writer(instream, outfile='-')[source]

Writes sequence features and other GFF3 entries to a file.

The instream is expected to be an iterable of sequence features and other related objects. Set outfile to - to write output to stdout.

>>> # Sort and tidy GFF3 file in 3 lines!
>>> reader = GFF3Reader(infilename=tag.tests.data_file('grape-cpgat.gff3'))
>>> writer = GFF3Writer(instream=reader, outfile='/dev/null')
>>> writer.retainids = True
>>> writer.write()
write(blockitvl=0)[source]

Pull features from the instream and write them to the output.

By default, separator tags are added at the end of complex features. To intersperse separators throughout blocks of simple features, specify a desired block size with blockitvl.

Transcript

tag.transcript.primary_mrna(entrystream, parenttype='gene')[source]

Select a single mRNA as a representative for each protein-coding gene.

The primary mRNA is the one with the longest translation product. In cases where multiple isoforms have the same translated length, the feature ID is used for sorting.

This function does not return only mRNA features, it returns all GFF3 entry types (pragmas, features, sequences, etc). The function does modify the gene features that pass through to ensure that they have at most a single mRNA feature.

>>> reader = tag.GFF3Reader(tag.tests.data_stream('pdom-withseq.gff3'))
>>> filter = tag.transcript.primary_mrna(reader)
>>> for gene in tag.select.features(filter, type='gene'):
...    assert gene.num_children == 1
tag.transcript.primary_transcript(entrystream, parenttype='gene', logstream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)[source]

Select a single transcript as a representative for each gene.

This function is a generalization of the primary_mrna function that attempts, under certain conditions, to select a single transcript as a representative for each gene. If a gene encodes multiple transcript types, one of those types must be mRNA or the function will complain loudly and fail.

For mRNAs, the primary transcript is selected according to translated length. For all other transcript types, the length of the transcript feature itself is used. I’d be eager to hear suggestions for alternative selection criteria.

Like the primary_mrna function, this function does not return only transcript features. It does modify gene features to ensure that each has at most one transcript feature.

>>> reader = tag.GFF3Reader(
...     tag.tests.data_stream('psyllid-mixed-gene.gff3.gz')
... )
>>> gene_filter = tag.select.features(reader, type='gene')
>>> trans_filter = tag.transcript.primary_transcript(gene_filter)
>>> for gene in trans_filter:
...     assert gene.num_children == 1

In cases where the direct children of a gene feature have heterogenous types, the primary_mrna function will only discard mRNA features. This function, however, will discard all direct children of the gene that are not the primary transcript, including non-transcript children. This is a retty subtle distinction, and anecdotal experience suggests that cases in which the distinction actually matters are extremely rare.

Selectors

tag.select.directives(entrystream, type=None)[source]

Pull directives out of the specified entry stream.

Parameters:
  • entrystream – a stream of entries
  • type – retrieve only directives of the specified type; set to None to retrieve all directives
tag.select.entry_type_filter(entrystream, entryclass)[source]

Generic entry filter.

Parameters:
  • entrystream – a stream of entries
  • entryclass – specify the type of entry upon which to filter
tag.select.features(entrystream, type=None, traverse=False)[source]

Pull features out of the specified entry stream.

Parameters:
  • entrystream – a stream of entries
  • type – retrieve only features of the specified type; set to None to retrieve all features
  • traverse – by default, only top-level features are selected; set to True to search each feature graph for the specified feature type
tag.select.merge(*sorted_streams)[source]

Efficiently merge sorted annotation streams.

tag.select.sequences(entrystream)[source]

Pull sequences out of the specified entry stream.

tag.select.window(featurestream, seqid, start=None, end=None, strict=True)[source]

Pull features out of the designated genomic interval.

This function uses 0-based half-open intervals, not the 1-based closed intervals used by GFF3.

Parameters:
  • featurestream – a stream of feature entries
  • seqid – ID of the sequence from which to select features
  • start – start of the genomic interval
  • end – end of the genomic interval
  • strict – when set to True, only features completely contained within the interval are selected; when set to False, any feature overlapping the interval is selected

Locus partitioning

tag.locus.loci(*sorted_streams, **kwargs)[source]

Determine feature loci from two or more sorted annotation streams.

Rather than simply relying on gene coordinates from a reference annotation, this function will determine coordinates for features of interest by considering an arbitrary number of annotation sources, whether they be references or predictions. This enables efficient and accurate assessment of both sensitivity and specificity at the level of individual nucleotides and entire features.

tag.locus.pocus(*sorted_streams, **kwargs)[source]

Feature stream for locus parsing.

From two or more sorted annotation streams, create a new stream that yields the features from the original streams, additional locus features, and separator directives (###) between loci.

Bacterial annotation evaluation

tag.bae.collapse_locus(features)[source]

Collapse redundant annotation information.

Given a locus, collapse any gene predictions with identical coordinates into a single prediction, noting the annotation sources that support the collapsed gene model.

tag.bae.collapse_stream(locusstream)[source]

Feature stream for collapsing bacterial annotation data.

tag.bae.eval_locus(features)[source]

Evaluate congruence between gene predictions from different sources.

Gene predictions are assumed to be microbial protein-coding genes, marked as type CDS. Any reference protein alignments can be included as features of type translated_nucleotide_match. Compute the agreement between different sources of start/stop codon position, entire ORF position, and coverage from reference proteins.

tag.bae.eval_stream(locusstream)[source]

Feature stream for bacterial annotation evaluation.