The Python API¶
The following classes/modules are included tag’s Python API, which is under semantic versioning.
Range¶
-
class
tag.range.
Range
(start, end)[source]¶ Represents a region of a biological sequence.
Start and end coordinates correspond to the 0-based half-open interval: start inclusive, end exclusive (start=0 and end=10 corresonds to the first 10 nucleotides).
>>> rng = Range(0, 1000) >>> rng [0, 1000) >>> len(rng) 1000 >>> rng.contains(Range(10, 20)) True >>> rng.contains_point(2345) False >>> rng.start = 500 >>> rng.start 500 >>> rng.end 1000
-
intersect
(other)[source]¶ Determine the interval of overlap between this range and another.
Returns: a new Range object representing the overlapping interval, or None if the ranges do not overlap.
-
merge
(other)[source]¶ Merge this range object with another (ranges need not overlap or abut).
Returns: a new Range object representing the interval containing both ranges.
-
overlap_atleast
(other, minbp=1, minperc=0.0)[source]¶ Determine whether two ranges overlap by at least some threshold.
The thresholds can be specified using a minimum number of bases, or a minimum percentage of the length of the ranges, or both.
-
Comment¶
-
class
tag.comment.
Comment
(data)[source]¶ Represents a comment in an annotation (GFF3) file.
Any GFF3 entry starting with >= 1 ‘#’ characters is treated as a comment, with two exceptions:
- the separator directive, a line containing ‘###’ and nothing more
- any entry beginning with just two ‘#’ characters is treated as a directive.
Directive¶
-
class
tag.directive.
Directive
(data)[source]¶ Represents a directive from a GFF3 file.
This class is primarily for error checking and data access. Once created, Directive objects should be treated as read-only: modify at your peril! Also, separator directives (###) and the ##FASTA directive are handled directly by parsers and not by this class.
Directives not explicitly declared in the GFF3 spec are application specific: they will be parsed without complaint, but no guarantees can be made about accessing their attributes.
>>> sr = Directive('##sequence-region chr1 5000 10000') >>> sr.type 'sequence-region' >>> sr.seqid 'chr1' >>> gb = Directive('##genome-build BeeBase 4.5') >>> gb.type 'genome-build' >>> gb.source 'BeeBase'
Sequence¶
-
class
tag.sequence.
Sequence
(defline, seq)[source]¶ Represents a sequence (almost always a nucleotide sequence).
We do not encourage putting FASTA sequences in your GFF3 files, but since the specification explicitly allows it we have to handle it. :-(
>>> s = Sequence('>contig1 description', 'ACGT') >>> s.defline '>contig1 description' >>> s.seq 'ACGT' >>> s.seqid 'contig1' >>> len(s) 4 >>> s = Sequence('>gi|12345|gb|BOGUSSEQ', 'GATTACA') >>> s.seq 'GATTACA' >>> s.accession 'BOGUSSEQ'
-
accession
¶ Parse accession number from commonly supported formats.
If the defline does not match one of the following formats, the entire description (sans leading caret) will be returned.
- >gi|572257426|ref|XP_006607122.1|
- >gnl|Tcas|XP_008191512.1
- >lcl|PdomMRNAr1.2-10981.1
-
format_seq
(outstream=None, linewidth=70)[source]¶ Print a sequence in a readable format.
Parameters: - outstream – if None, formatted sequence is returned as a string; otherwise, it is treated as a file-like object and the formatted sequence is printed to the outstream
- linewidth – width for wrapping sequences over multiple lines; set to 0 for no wrapping
-
Feature¶
-
class
tag.feature.
Feature
(seqid, ftype, start, end, source='tag', score=None, strand=None, phase=None, attrstr=None)[source]¶ Represents a feature entry from a GFF3 file.
>>> gene = Feature('contig1', 'gene', 999, 7500, source='snap', strand='+', ... attrstr='ID=gene1') >>> gene.seqid 'contig1' >>> gene.source 'snap' >>> gene.type 'gene' >>> gene.range [999, 7500) >>> gene.score is None True >>> gene.strand '+' >>> gene.phase is None True >>> gene.get_attribute('ID') 'gene1' >>> gene.get_attribute('NotARegisteredAttribute') >>> gene.slug 'gene@contig1[1000, 7500]'
-
add_attribute
(attrkey, attrvalue, append=False, oldvalue=None)[source]¶ Add an attribute to this feature.
Feature attributes are stored as nested dictionaries.
Each feature can only have one ID, so ID attribute mapping is ‘string’ to ‘string’. All other attributes can have multiple values, so mapping is ‘string’ to ‘dict of strings’.
By default, adding an attribute that already exists will cause the old value to be overwritten. If the append option is true, the new attribute value will not overwrite the old value, but will be appended as a second value. (Note: ID attributes can have only 1 value.)
If the oldvalue option is set, the new value will replace the old value. This is necessary for updating an attribute that has multiple values without completely overwriting all old values. (Note: The append option is ignored when oldvalue is set.)
-
add_sibling
(sibling)[source]¶ Designate this a multi-feature representative and add a co-feature.
Some features exist discontinuously on the sequence, and therefore cannot be declared with a single GFF3 entry (which can encode only a single interval). The canonical encoding for these types of features is called a multi-feature, in which a single feature is declared on multiple lines with multiple entries all sharing the same feature type and ID attribute. This is commonly done with coding sequence (CDS) features.
In this package, each multi-feature has a single “representative” feature object, and all other objects/entries associated with that multi-feature are attached to it as “siblings”.
Invoking this method will designate the calling feature as the multi-feature representative and add the argument as a sibling.
-
attribute_crawl
(key)[source]¶ Grab all attribute values associated with the given feature.
Traverse the given feature (and all of its descendants) to find all values associated with the given attribute key.
>>> reader = tag.GFF3Reader( ... tag.tests.data_stream('otau-no-seqreg.gff3') ... ) >>> features = tag.select.features(reader) >>> for feature in features: ... names = feature.attribute_crawl('Name') ... print(sorted(list(names))) ['Ot01g00060', 'XM_003074019.1', 'XP_003074065.1'] ['Ot01g00070', 'XM_003074020.1', 'XP_003074066.1'] ['Ot01g00080', 'XM_003074021.1', 'XP_003074067.1'] ['Ot01g00090', 'XM_003074022.1', 'XP_003074068.1'] ['Ot01g00100', 'XM_003074023.1', 'XP_003074069.1'] ['Ot01g00110', 'XM_003074024.1', 'XP_003074070.1']
-
cdslen
¶ Translated length of this feature.
Undefined for non-mRNA features.
-
get_attribute
(attrkey, as_string=False, as_list=False)[source]¶ Get the value of an attribute.
By default, returns a string for ID and attributes with a single value, and a list of strings for attributes with multiple values. The as_string and as_list options can be used to force the function to return values as a string (comma-separated in case of multiple values) or a list.
-
ncbi_geneid
¶ Retrieve this feature’s NCBI GeneID if it’s present.
NCBI GFF3 files contain gene IDs encoded in Dbxref attributes (example: Dbxref=GeneID:103504972). This function locates and returns the GeneID if present, or returns None otherwise.
-
parse_attributes
(attrstring)[source]¶ Parse an attribute string.
Given a string with semicolon-separated key-value pairs, populate a dictionary with the given attributes.
-
pseudoify
()[source]¶ Derive a pseudo-feature parent from the given multi-feature.
The provided multi-feature does not need to be the representative. The newly created pseudo-feature has the same seqid as the provided multi- feature, and spans its entire range. Otherwise, the pseudo-feature is empty. It is used only for convenience in sorting.
-
slug
¶ A concise slug for this feature.
Unlike the internal representation, which is 0-based half-open, the slug is a 1-based closed interval (a la GFF3).
-
Index¶
-
class
tag.index.
Index
[source]¶ In-memory index for efficient interval-based queries for genome features.
Implemented as a dictionary, with sequence IDs as keys to interval trees of features for the corresponding scaffold or contig sequence.
>>> index = tag.index.Index() >>> index.consume_file(tag.tests.data_file('pcan-123.gff3.gz')) >>> list(index.seqids) ['scaffold_123', 'scaffold_124', 'scaffold_125'] >>> index.extent('scaffold_124') (6020, 444332) >>> for feature in index.query('scaffold_123', 5000, 6000): ... print(feature.slug) gene@scaffold_123[5583, 5894] >>> for feature in index.query('scaffold_125', 19000, 87000, strict=False): ... print(feature.slug) gene@scaffold_125[18994, 19335] gene@scaffold_125[57450, 57680] gene@scaffold_125[86995, 87151] >>> reader = tag.reader.GFF3Reader( ... tag.tests.data_stream('grape-cpgat.gff3') ... ) >>> index = tag.index.Index() >>> index.consume(reader) >>> index.yield_inferred = False >>> for entry in index: ... print(entry.slug) sequence chr8[1, 100000] gene@chr8[72, 5081] gene@chr8[10538, 11678] gene@chr8[22053, 23448]
-
consume
(entrystream)[source]¶ Load a stream of entries into memory.
Only Feature objects and sequence-region directives are loaded, all other entries are discarded.
-
query
(seqid, start, end=None, strict=True)[source]¶ Query the index for features in the specified range.
Parameters: - seqid – ID of the sequence to query
- start – start of the query interval
- end – end of the query interval
- strict – indicates whether query is strict containment or overlap
(
True
andFalse
, respectively)
-
-
class
tag.index.
NamedIndex
[source]¶ In-memory index for retrieving genome features by identifier.
>>> index = tag.index.NamedIndex() >>> index.consume_file(tag.tests.data_file('pcan-123.gff3.gz')) >>> gene = index['PCAN011a001813'] >>> print(gene.slug) gene@scaffold_123[200029, 201298] >>> exon = index['PCAN011a001859T1.exon1'] >>> print(exon.slug) exon@scaffold_125[57450, 57680]
Readers¶
Currently the readers
module contains only a single class, GFF3Reader,
but may include others in the future.
-
class
tag.reader.
GFF3Reader
(instream=None, infilename=None, assumesorted=False, strict=True, checkorder=True)[source]¶ Loads sequence features and other GFF3 entries into memory.
Features and other objects are obtained by iterating over the GFF3Reader object. See the
tag.select
module for available filtering functions.>>> infile = tag.tests.data_file('pbar-withseq.gff3') >>> reader = GFF3Reader(infilename=infile) >>> for entry in reader: ... print(type(entry)) <class 'tag.directive.Directive'> <class 'tag.directive.Directive'> <class 'tag.directive.Directive'> <class 'tag.comment.Comment'> <class 'tag.feature.Feature'> <class 'tag.feature.Feature'> <class 'tag.feature.Feature'> <class 'tag.sequence.Sequence'> <class 'tag.sequence.Sequence'> >>> reader = GFF3Reader(infilename=infile) >>> for feature in tag.select.features(reader, type='gene'): ... print(feature.slug) gene@NW_011929623.1[4557, 5749] gene@NW_011929624.1[3725, 4229]
By default, the GFF3Reader assumes the input is not in sorted order and loads the entire annotation into memory to ensure sorting and resolution of ID/Parent relationships. If the input is sorted according to genomic position and independent features are separated by
###
directives, memory consumption will be drastically reduced by setting theassumesorted
attribute to True.The
strict
attribute enforces some additional sanity checks, which in some exceptional cases may need to be relaxed.
-
tag.reader.
parse_fasta
(data)[source]¶ Load sequences in Fasta format.
This generator function yields a Sequence object for each sequence record in a GFF3 file. Implementation stolen shamelessly from http://stackoverflow.com/a/7655072/459780.
Writers¶
Currently the writers
module contains only a single class, GFF3Writer,
but may include others in the future.
-
class
tag.writer.
GFF3Writer
(instream, outfile='-')[source]¶ Writes sequence features and other GFF3 entries to a file.
The
instream
is expected to be an iterable of sequence features and other related objects. Setoutfile
to-
to write output to stdout.>>> # Sort and tidy GFF3 file in 3 lines! >>> reader = GFF3Reader(infilename=tag.tests.data_file('grape-cpgat.gff3')) >>> writer = GFF3Writer(instream=reader, outfile='/dev/null') >>> writer.retainids = True >>> writer.write()
Transcript¶
-
tag.transcript.
primary_mrna
(entrystream, parenttype='gene')[source]¶ Select a single mRNA as a representative for each protein-coding gene.
The primary mRNA is the one with the longest translation product. In cases where multiple isoforms have the same translated length, the feature ID is used for sorting.
This function does not return only mRNA features, it returns all GFF3 entry types (pragmas, features, sequences, etc). The function does modify the gene features that pass through to ensure that they have at most a single mRNA feature.
>>> reader = tag.GFF3Reader(tag.tests.data_stream('pdom-withseq.gff3')) >>> filter = tag.transcript.primary_mrna(reader) >>> for gene in tag.select.features(filter, type='gene'): ... assert gene.num_children == 1
-
tag.transcript.
primary_transcript
(entrystream, parenttype='gene', logstream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)[source]¶ Select a single transcript as a representative for each gene.
This function is a generalization of the primary_mrna function that attempts, under certain conditions, to select a single transcript as a representative for each gene. If a gene encodes multiple transcript types, one of those types must be mRNA or the function will complain loudly and fail.
For mRNAs, the primary transcript is selected according to translated length. For all other transcript types, the length of the transcript feature itself is used. I’d be eager to hear suggestions for alternative selection criteria.
Like the primary_mrna function, this function does not return only transcript features. It does modify gene features to ensure that each has at most one transcript feature.
>>> reader = tag.GFF3Reader( ... tag.tests.data_stream('psyllid-mixed-gene.gff3.gz') ... ) >>> gene_filter = tag.select.features(reader, type='gene') >>> trans_filter = tag.transcript.primary_transcript(gene_filter) >>> for gene in trans_filter: ... assert gene.num_children == 1
In cases where the direct children of a gene feature have heterogenous types, the primary_mrna function will only discard mRNA features. This function, however, will discard all direct children of the gene that are not the primary transcript, including non-transcript children. This is a retty subtle distinction, and anecdotal experience suggests that cases in which the distinction actually matters are extremely rare.
Selectors¶
-
tag.select.
directives
(entrystream, type=None)[source]¶ Pull directives out of the specified entry stream.
Parameters: - entrystream – a stream of entries
- type – retrieve only directives of the specified type; set to
None
to retrieve all directives
-
tag.select.
entry_type_filter
(entrystream, entryclass)[source]¶ Generic entry filter.
Parameters: - entrystream – a stream of entries
- entryclass – specify the type of entry upon which to filter
-
tag.select.
features
(entrystream, type=None, traverse=False)[source]¶ Pull features out of the specified entry stream.
Parameters: - entrystream – a stream of entries
- type – retrieve only features of the specified type; set to
None
to retrieve all features - traverse – by default, only top-level features are selected; set
to
True
to search each feature graph for the specified feature type
-
tag.select.
window
(featurestream, seqid, start=None, end=None, strict=True)[source]¶ Pull features out of the designated genomic interval.
This function uses 0-based half-open intervals, not the 1-based closed intervals used by GFF3.
Parameters: - featurestream – a stream of feature entries
- seqid – ID of the sequence from which to select features
- start – start of the genomic interval
- end – end of the genomic interval
- strict – when set to
True
, only features completely contained within the interval are selected; when set toFalse
, any feature overlapping the interval is selected
Locus partitioning¶
-
tag.locus.
loci
(*sorted_streams, **kwargs)[source]¶ Determine feature loci from two or more sorted annotation streams.
Rather than simply relying on gene coordinates from a reference annotation, this function will determine coordinates for features of interest by considering an arbitrary number of annotation sources, whether they be references or predictions. This enables efficient and accurate assessment of both sensitivity and specificity at the level of individual nucleotides and entire features.
Bacterial annotation evaluation¶
-
tag.bae.
collapse_locus
(features)[source]¶ Collapse redundant annotation information.
Given a locus, collapse any gene predictions with identical coordinates into a single prediction, noting the annotation sources that support the collapsed gene model.
-
tag.bae.
collapse_stream
(locusstream)[source]¶ Feature stream for collapsing bacterial annotation data.
-
tag.bae.
eval_locus
(features)[source]¶ Evaluate congruence between gene predictions from different sources.
Gene predictions are assumed to be microbial protein-coding genes, marked as type CDS. Any reference protein alignments can be included as features of type translated_nucleotide_match. Compute the agreement between different sources of start/stop codon position, entire ORF position, and coverage from reference proteins.