tag: genome annotation analysis in Python!¶
tag is a free open-source software package for analyzing genome annotation data. It is developed as a reusable library with a focus on ease of use. tag is implemented in pure Python (no compiling required) with minimal dependencies!
Contributor Code of Conduct¶
As contributors and maintainers of this project, we pledge to respect all people who contribute through reporting issues, posting feature requests, updating documentation, submitting pull requests or patches, and other activities.
We are committed to making participation in this project a harassment-free experience for everyone, regardless of level of experience, gender, gender identity and expression, sexual orientation, disability, personal appearance, body size, race, age, or religion.
Examples of unacceptable behavior by participants include the use of sexual language or imagery, derogatory comments or personal attacks, trolling, public or private harassment, insults, or other unprofessional conduct.
Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct. Project maintainers or contributors who do not follow the Code of Conduct may be blocked from further contribution and engagement with the project.
This Code of Conduct is adapted from the Contributor Covenant, version 1.0.0, available at http://contributor-covenant.org/version/1/0/0/.
Installing and using tag¶
The easiest way to install tag is with the pip
command.
By default, pip
installs the most recent stable release from the Python Package Index (PyPI).
pip install tag
To install the latest unreleased code, install directly from the development repository at GitHub.
pip install git+https://github.com/standage/tag.git
Note
We recommend installing tag and its prerequisites in a virtual environment.
tag is implemented in pure Python and requires no compilation. It has only a single runtime dependency (the intervaltree library) which is also pure Python.
Using tag interactively¶
If you want to analyze or explore your data interactively, fire up the Python interpreter by invoking the python
command in your terminal.
Please see the API documentation for a description of the data structures and objects available to you.
>>> import tag
>>> reader = tag.reader.GFF3Reader(infilename='/data/genomes/mybug.gff3.gz')
>>> for entry in tag.select.features(reader, type='intron'):
... if len(entry) > 100000:
... print(entry.slug)
intron@scaffold3[37992, 149255]
intron@scaffold55[288477, 389001]
intron@scaffold192[1057, 196433]
Using the tag command-line interface¶
The tag package has a command-line interface for common processing workflows.
Execute tag -h
to see a list of available commands and tag <cmd> -h
for instructions on running a particular command.
A crash course in annotation formats¶
Data file formatting is the perennial topic of jokes and rants in the bioinformatics community. For many scientists, it’s easier to come up with a new ad hoc format than it is to take the time to fully understand and adhere to existing formats (due in large part to the poor state of training for data and computing literacy and poor support for standards development in the life sciences). There’s even a running joke that inventing your own file format is an important rite of passage in becoming a “true” bioinformatician.
Introducing the Bioinformatics Ironman: write an assembler, a short/long read aligner and a file format
— Pall Melsted (@pmelsted) December 26, 2015
For annotating genome features, there are several formats in wide use, the most common of which include GFF3, GTF, and BED. These are all plain-text tab-delimited file formats.
Although there are many richer ways of representing genomic features via XML and in relational database schemas, the stubborn persistence of a variety of ad-hoc tab-delimited flat file formats declares the bioinformatics community’s need for a simple format that can be modified with a text editor and processed with shell tools like grep.– Lincoln Stein, GFF3 specification
GFF3 vs GTF vs BED¶
There are a lot of similarities between the “big three” annotation formats, but there are also some important differences (this has been covered before). BED and GTF were designed for very specific use cases (visualization and gene prediction, respectively), whereas GFF3 was designed as a generalized solution for genome annotation. BED allows for a single level of feature decomposition (a feature can be broken up into blocks) and GTF supports two levels (exons can be grouped by transcript_id, and transcripts grouped by gene_id), while GFF3 supports an arbitrary number of levels (parent/child relationships defined by ID and Parent attributes specify a directed acyclic graph of features). And finally, GFF3 and GTF use 1-based closed interval notation, whereas BED uses 0-based half-open interval notation which makes interval arithmetic much simpler.
One common complaint about GFF3 is that all the “action” happens in the 9th column, which consists of free-form key/value pairs for storing metadata and relationships between annotated features. To a large extent this complaint applies to all of these formats.
Why GFF3?¶
GFF3 is the preferred annotation encoding for the tag package. Despite its use of 1-based closed interval notation (inferior to the notation used by BED), GFF3 is the most robust and flexible format for describing genomic features.
There may be some level of support for GTF and BED in the future, but it will likely be limited to scripts for converting data into GFF3. This is a difficult problem to solve generically, since each annotation format exists in so many subtly incompatible variants. Rather, this will likely be handled on a case-by-case basis: for example, conversion scripts for a small number of tools or databases that produce data in a consistent GTF or BED format.
Annotation graphs¶
The concept of an “annotation graph” was first introduced by Eilbeck et al. (citation) and then elaborated on by Gremme et al. (citation). The tag library relies on this concept heavily. In short, rather than processing a data file one entry at a time (line by line), the idea is to group related features into a directed acyclic graph structure.

This has some implications for the terminology we use to describe annotations.
- When we say feature we could be referring to a single node in the annotation graph, or we could be referring to an entire connected component.
- We use parent and child to refer to part_of relationships between features and related subfeatures. In GFF3 these are encoded using ID and Parent attributes. Feature types and part_of relationships can be validated using the Sequence Ontology.
- A feature is a top-level feature if it has no parents.
The tag package provides two primary means for processing the annotation graph.
- iterating over the top-level features in the entire graph
- for a particular feature (connected component) in the graph, iterating over all connected subfeatures
Multiple parents¶
A node in a feature graph can have more than one parent, and the GFF3 specification explicitly permits features to have multiple Parent attribute values. For example, in an alternatively spliced gene, different isoforms will often share exons, and the canonical way to reflect this is for each shared exon to refer to each of its parent mRNAs in its Parent attribute. However, in my experience it is much more common for shared exons to be duplicated in the GFF3 file, once for each isoform to which they belong. This isn’t wrong per se from a standards perspective, but it can be misleading when (for example) counting up different feature types or calculating sequence characteristics.
Currently the tag library does little to warn the user of duplicate entries. It is up to the user to inspect the data to determine how it is encoded and how statistics on the features should be calculated and interpreted.
Multi-features¶
Some genome features exist discontinuously on the sequence, and therefore cannot be declared with a single GFF3 entry (which can encode only a single interval). The canonical encoding for these types of features is called a multi-feature, in which a single feature is declared on multiple lines with multiple entries all sharing the same feature type and ID attribute. This is commonly done with coding sequence (CDS) features.
tag designates one entry for each multi-features as its representative, and all other components of that feature are designated siblings of the representative.
Developer documentation¶
Development environment quickstart¶
# Fork repo on GitHub then clone
git clone git@github.com:YourGithubUsername/tag.git
cd tag/
# Optional (but highly recommended): set up a virtual environment
virtualenv -p python3 env
echo "export PYTHONPATH=$(pwd)" >> env/bin/activate
source env/bin/activate # Re-run this any time you return after launching a new terminal.
# Install Python libraries needed for testing and run the test suite
make devenv
make test
echo "make style" > .git/hooks/pre-commit
# Test the development CLI; end users will invoke `tag` instead of `./tagcli`
./tagcli -h
Testing¶
Complete coverage from automated tests is a high priority. This doesn’t mean the software is completely bug free, but it forces us to consider all the branching behavior and make sure all code is at least executed. Our testing philosophy is “stupidity-driven development”.
- Test core features extensively, but don’t try to be clever. Simple tests are sufficient and on balance are superior to complicated tests.
- When bugs appear, write regression tests, fix the bugs, and get back to something more important.
- Don’t invest too much time or effort in edge features, refactoring, optimization, etc. If there is a simple and direct path to a substantial improvement, go for it and then move on. Priorities: accuracy, then performance, then user experience.
API¶
The tag package is under semantic versioning. Any changes to the command-line interface or the Python interface require a version bump: a minor version bump for backwards-compatible additions, and a major version bump for API-breaking changes.
The API documentation is pulled directly from docstrings in the class and module definitions. Any Python code in the API docs is executed and evaluated via autodoc when the test suite is invoked.
Change log¶
The CHANGELOG.md
file follows the conventions described at
http://keepachangelog.com. Minor changes that affect only the package internals
need not be documented. More substantial changes, or any bug fixes or changes to
the API, should be documented.
The Python API¶
The following classes/modules are included tag’s Python API, which is under semantic versioning.
Range¶
-
class
tag.range.
Range
(start, end)[source]¶ Represents a region of a biological sequence.
Start and end coordinates correspond to the 0-based half-open interval: start inclusive, end exclusive (start=0 and end=10 corresonds to the first 10 nucleotides).
>>> rng = Range(0, 1000) >>> rng [0, 1000) >>> len(rng) 1000 >>> rng.contains(Range(10, 20)) True >>> rng.contains_point(2345) False >>> rng.start = 500 >>> rng.start 500 >>> rng.end 1000
-
intersect
(other)[source]¶ Determine the interval of overlap between this range and another.
Returns: a new Range object representing the overlapping interval, or None if the ranges do not overlap.
-
merge
(other)[source]¶ Merge this range object with another (ranges need not overlap or abut).
Returns: a new Range object representing the interval containing both ranges.
-
overlap_atleast
(other, minbp=1, minperc=0.0)[source]¶ Determine whether two ranges overlap by at least some threshold.
The thresholds can be specified using a minimum number of bases, or a minimum percentage of the length of the ranges, or both.
-
Comment¶
-
class
tag.comment.
Comment
(data)[source]¶ Represents a comment in an annotation (GFF3) file.
Any GFF3 entry starting with >= 1 ‘#’ characters is treated as a comment, with two exceptions:
- the separator directive, a line containing ‘###’ and nothing more
- any entry beginning with just two ‘#’ characters is treated as a directive.
Directive¶
-
class
tag.directive.
Directive
(data)[source]¶ Represents a directive from a GFF3 file.
This class is primarily for error checking and data access. Once created, Directive objects should be treated as read-only: modify at your peril! Also, separator directives (###) and the ##FASTA directive are handled directly by parsers and not by this class.
Directives not explicitly declared in the GFF3 spec are application specific: they will be parsed without complaint, but no guarantees can be made about accessing their attributes.
>>> sr = Directive('##sequence-region chr1 5000 10000') >>> sr.type 'sequence-region' >>> sr.seqid 'chr1' >>> gb = Directive('##genome-build BeeBase 4.5') >>> gb.type 'genome-build' >>> gb.source 'BeeBase'
Sequence¶
-
class
tag.sequence.
Sequence
(defline, seq)[source]¶ Represents a sequence (almost always a nucleotide sequence).
We do not encourage putting FASTA sequences in your GFF3 files, but since the specification explicitly allows it we have to handle it. :-(
>>> s = Sequence('>contig1 description', 'ACGT') >>> s.defline '>contig1 description' >>> s.seq 'ACGT' >>> s.seqid 'contig1' >>> len(s) 4 >>> s = Sequence('>gi|12345|gb|BOGUSSEQ', 'GATTACA') >>> s.seq 'GATTACA' >>> s.accession 'BOGUSSEQ'
-
accession
¶ Parse accession number from commonly supported formats.
If the defline does not match one of the following formats, the entire description (sans leading caret) will be returned.
- >gi|572257426|ref|XP_006607122.1|
- >gnl|Tcas|XP_008191512.1
- >lcl|PdomMRNAr1.2-10981.1
-
format_seq
(outstream=None, linewidth=70)[source]¶ Print a sequence in a readable format.
Parameters: - outstream – if None, formatted sequence is returned as a string; otherwise, it is treated as a file-like object and the formatted sequence is printed to the outstream
- linewidth – width for wrapping sequences over multiple lines; set to 0 for no wrapping
-
Feature¶
-
class
tag.feature.
Feature
(seqid, ftype, start, end, source='tag', score=None, strand=None, phase=None, attrstr=None)[source]¶ Represents a feature entry from a GFF3 file.
>>> gene = Feature('contig1', 'gene', 999, 7500, source='snap', strand='+', ... attrstr='ID=gene1') >>> gene.seqid 'contig1' >>> gene.source 'snap' >>> gene.type 'gene' >>> gene.range [999, 7500) >>> gene.score is None True >>> gene.strand '+' >>> gene.phase is None True >>> gene.get_attribute('ID') 'gene1' >>> gene.get_attribute('NotARegisteredAttribute') >>> gene.slug 'gene@contig1[1000, 7500]'
-
add_attribute
(attrkey, attrvalue, append=False, oldvalue=None)[source]¶ Add an attribute to this feature.
Feature attributes are stored as nested dictionaries.
Each feature can only have one ID, so ID attribute mapping is ‘string’ to ‘string’. All other attributes can have multiple values, so mapping is ‘string’ to ‘dict of strings’.
By default, adding an attribute that already exists will cause the old value to be overwritten. If the append option is true, the new attribute value will not overwrite the old value, but will be appended as a second value. (Note: ID attributes can have only 1 value.)
If the oldvalue option is set, the new value will replace the old value. This is necessary for updating an attribute that has multiple values without completely overwriting all old values. (Note: The append option is ignored when oldvalue is set.)
-
add_sibling
(sibling)[source]¶ Designate this a multi-feature representative and add a co-feature.
Some features exist discontinuously on the sequence, and therefore cannot be declared with a single GFF3 entry (which can encode only a single interval). The canonical encoding for these types of features is called a multi-feature, in which a single feature is declared on multiple lines with multiple entries all sharing the same feature type and ID attribute. This is commonly done with coding sequence (CDS) features.
In this package, each multi-feature has a single “representative” feature object, and all other objects/entries associated with that multi-feature are attached to it as “siblings”.
Invoking this method will designate the calling feature as the multi-feature representative and add the argument as a sibling.
-
attribute_crawl
(key)[source]¶ Grab all attribute values associated with the given feature.
Traverse the given feature (and all of its descendants) to find all values associated with the given attribute key.
>>> reader = tag.GFF3Reader( ... tag.tests.data_stream('otau-no-seqreg.gff3') ... ) >>> features = tag.select.features(reader) >>> for feature in features: ... names = feature.attribute_crawl('Name') ... print(sorted(list(names))) ['Ot01g00060', 'XM_003074019.1', 'XP_003074065.1'] ['Ot01g00070', 'XM_003074020.1', 'XP_003074066.1'] ['Ot01g00080', 'XM_003074021.1', 'XP_003074067.1'] ['Ot01g00090', 'XM_003074022.1', 'XP_003074068.1'] ['Ot01g00100', 'XM_003074023.1', 'XP_003074069.1'] ['Ot01g00110', 'XM_003074024.1', 'XP_003074070.1']
-
cdslen
¶ Translated length of this feature.
Undefined for non-mRNA features.
-
get_attribute
(attrkey, as_string=False, as_list=False)[source]¶ Get the value of an attribute.
By default, returns a string for ID and attributes with a single value, and a list of strings for attributes with multiple values. The as_string and as_list options can be used to force the function to return values as a string (comma-separated in case of multiple values) or a list.
-
ncbi_geneid
¶ Retrieve this feature’s NCBI GeneID if it’s present.
NCBI GFF3 files contain gene IDs encoded in Dbxref attributes (example: Dbxref=GeneID:103504972). This function locates and returns the GeneID if present, or returns None otherwise.
-
parse_attributes
(attrstring)[source]¶ Parse an attribute string.
Given a string with semicolon-separated key-value pairs, populate a dictionary with the given attributes.
-
pseudoify
()[source]¶ Derive a pseudo-feature parent from the given multi-feature.
The provided multi-feature does not need to be the representative. The newly created pseudo-feature has the same seqid as the provided multi- feature, and spans its entire range. Otherwise, the pseudo-feature is empty. It is used only for convenience in sorting.
-
slug
¶ A concise slug for this feature.
Unlike the internal representation, which is 0-based half-open, the slug is a 1-based closed interval (a la GFF3).
-
Index¶
-
class
tag.index.
Index
[source]¶ In-memory index for efficient interval-based queries for genome features.
Implemented as a dictionary, with sequence IDs as keys to interval trees of features for the corresponding scaffold or contig sequence.
>>> index = tag.index.Index() >>> index.consume_file(tag.tests.data_file('pcan-123.gff3.gz')) >>> list(index.seqids) ['scaffold_123', 'scaffold_124', 'scaffold_125'] >>> index.extent('scaffold_124') (6020, 444332) >>> for feature in index.query('scaffold_123', 5000, 6000): ... print(feature.slug) gene@scaffold_123[5583, 5894] >>> for feature in index.query('scaffold_125', 19000, 87000, strict=False): ... print(feature.slug) gene@scaffold_125[18994, 19335] gene@scaffold_125[57450, 57680] gene@scaffold_125[86995, 87151] >>> reader = tag.reader.GFF3Reader( ... tag.tests.data_stream('grape-cpgat.gff3') ... ) >>> index = tag.index.Index() >>> index.consume(reader) >>> index.yield_inferred = False >>> for entry in index: ... print(entry.slug) sequence chr8[1, 100000] gene@chr8[72, 5081] gene@chr8[10538, 11678] gene@chr8[22053, 23448]
-
consume
(entrystream)[source]¶ Load a stream of entries into memory.
Only Feature objects and sequence-region directives are loaded, all other entries are discarded.
-
query
(seqid, start, end=None, strict=True)[source]¶ Query the index for features in the specified range.
Parameters: - seqid – ID of the sequence to query
- start – start of the query interval
- end – end of the query interval
- strict – indicates whether query is strict containment or overlap
(
True
andFalse
, respectively)
-
-
class
tag.index.
NamedIndex
[source]¶ In-memory index for retrieving genome features by identifier.
>>> index = tag.index.NamedIndex() >>> index.consume_file(tag.tests.data_file('pcan-123.gff3.gz')) >>> gene = index['PCAN011a001813'] >>> print(gene.slug) gene@scaffold_123[200029, 201298] >>> exon = index['PCAN011a001859T1.exon1'] >>> print(exon.slug) exon@scaffold_125[57450, 57680]
Readers¶
Currently the readers
module contains only a single class, GFF3Reader,
but may include others in the future.
-
class
tag.reader.
GFF3Reader
(instream=None, infilename=None, assumesorted=False, strict=True, checkorder=True)[source]¶ Loads sequence features and other GFF3 entries into memory.
Features and other objects are obtained by iterating over the GFF3Reader object. See the
tag.select
module for available filtering functions.>>> infile = tag.tests.data_file('pbar-withseq.gff3') >>> reader = GFF3Reader(infilename=infile) >>> for entry in reader: ... print(type(entry)) <class 'tag.directive.Directive'> <class 'tag.directive.Directive'> <class 'tag.directive.Directive'> <class 'tag.comment.Comment'> <class 'tag.feature.Feature'> <class 'tag.feature.Feature'> <class 'tag.feature.Feature'> <class 'tag.sequence.Sequence'> <class 'tag.sequence.Sequence'> >>> reader = GFF3Reader(infilename=infile) >>> for feature in tag.select.features(reader, type='gene'): ... print(feature.slug) gene@NW_011929623.1[4557, 5749] gene@NW_011929624.1[3725, 4229]
By default, the GFF3Reader assumes the input is not in sorted order and loads the entire annotation into memory to ensure sorting and resolution of ID/Parent relationships. If the input is sorted according to genomic position and independent features are separated by
###
directives, memory consumption will be drastically reduced by setting theassumesorted
attribute to True.The
strict
attribute enforces some additional sanity checks, which in some exceptional cases may need to be relaxed.
-
tag.reader.
parse_fasta
(data)[source]¶ Load sequences in Fasta format.
This generator function yields a Sequence object for each sequence record in a GFF3 file. Implementation stolen shamelessly from http://stackoverflow.com/a/7655072/459780.
Writers¶
Currently the writers
module contains only a single class, GFF3Writer,
but may include others in the future.
-
class
tag.writer.
GFF3Writer
(instream, outfile='-')[source]¶ Writes sequence features and other GFF3 entries to a file.
The
instream
is expected to be an iterable of sequence features and other related objects. Setoutfile
to-
to write output to stdout.>>> # Sort and tidy GFF3 file in 3 lines! >>> reader = GFF3Reader(infilename=tag.tests.data_file('grape-cpgat.gff3')) >>> writer = GFF3Writer(instream=reader, outfile='/dev/null') >>> writer.retainids = True >>> writer.write()
Transcript¶
-
tag.transcript.
primary_mrna
(entrystream, parenttype='gene')[source]¶ Select a single mRNA as a representative for each protein-coding gene.
The primary mRNA is the one with the longest translation product. In cases where multiple isoforms have the same translated length, the feature ID is used for sorting.
This function does not return only mRNA features, it returns all GFF3 entry types (pragmas, features, sequences, etc). The function does modify the gene features that pass through to ensure that they have at most a single mRNA feature.
>>> reader = tag.GFF3Reader(tag.tests.data_stream('pdom-withseq.gff3')) >>> filter = tag.transcript.primary_mrna(reader) >>> for gene in tag.select.features(filter, type='gene'): ... assert gene.num_children == 1
-
tag.transcript.
primary_transcript
(entrystream, parenttype='gene', logstream=<_io.TextIOWrapper name='<stderr>' mode='w' encoding='UTF-8'>)[source]¶ Select a single transcript as a representative for each gene.
This function is a generalization of the primary_mrna function that attempts, under certain conditions, to select a single transcript as a representative for each gene. If a gene encodes multiple transcript types, one of those types must be mRNA or the function will complain loudly and fail.
For mRNAs, the primary transcript is selected according to translated length. For all other transcript types, the length of the transcript feature itself is used. I’d be eager to hear suggestions for alternative selection criteria.
Like the primary_mrna function, this function does not return only transcript features. It does modify gene features to ensure that each has at most one transcript feature.
>>> reader = tag.GFF3Reader( ... tag.tests.data_stream('psyllid-mixed-gene.gff3.gz') ... ) >>> gene_filter = tag.select.features(reader, type='gene') >>> trans_filter = tag.transcript.primary_transcript(gene_filter) >>> for gene in trans_filter: ... assert gene.num_children == 1
In cases where the direct children of a gene feature have heterogenous types, the primary_mrna function will only discard mRNA features. This function, however, will discard all direct children of the gene that are not the primary transcript, including non-transcript children. This is a retty subtle distinction, and anecdotal experience suggests that cases in which the distinction actually matters are extremely rare.
Selectors¶
-
tag.select.
directives
(entrystream, type=None)[source]¶ Pull directives out of the specified entry stream.
Parameters: - entrystream – a stream of entries
- type – retrieve only directives of the specified type; set to
None
to retrieve all directives
-
tag.select.
entry_type_filter
(entrystream, entryclass)[source]¶ Generic entry filter.
Parameters: - entrystream – a stream of entries
- entryclass – specify the type of entry upon which to filter
-
tag.select.
features
(entrystream, type=None, traverse=False)[source]¶ Pull features out of the specified entry stream.
Parameters: - entrystream – a stream of entries
- type – retrieve only features of the specified type; set to
None
to retrieve all features - traverse – by default, only top-level features are selected; set
to
True
to search each feature graph for the specified feature type
-
tag.select.
window
(featurestream, seqid, start=None, end=None, strict=True)[source]¶ Pull features out of the designated genomic interval.
This function uses 0-based half-open intervals, not the 1-based closed intervals used by GFF3.
Parameters: - featurestream – a stream of feature entries
- seqid – ID of the sequence from which to select features
- start – start of the genomic interval
- end – end of the genomic interval
- strict – when set to
True
, only features completely contained within the interval are selected; when set toFalse
, any feature overlapping the interval is selected
Locus partitioning¶
-
tag.locus.
loci
(*sorted_streams, **kwargs)[source]¶ Determine feature loci from two or more sorted annotation streams.
Rather than simply relying on gene coordinates from a reference annotation, this function will determine coordinates for features of interest by considering an arbitrary number of annotation sources, whether they be references or predictions. This enables efficient and accurate assessment of both sensitivity and specificity at the level of individual nucleotides and entire features.
Bacterial annotation evaluation¶
-
tag.bae.
collapse_locus
(features)[source]¶ Collapse redundant annotation information.
Given a locus, collapse any gene predictions with identical coordinates into a single prediction, noting the annotation sources that support the collapsed gene model.
-
tag.bae.
collapse_stream
(locusstream)[source]¶ Feature stream for collapsing bacterial annotation data.
-
tag.bae.
eval_locus
(features)[source]¶ Evaluate congruence between gene predictions from different sources.
Gene predictions are assumed to be microbial protein-coding genes, marked as type CDS. Any reference protein alignments can be included as features of type translated_nucleotide_match. Compute the agreement between different sources of start/stop codon position, entire ORF position, and coverage from reference proteins.
Acknowledgements¶
The initial development of the tag package was influenced heavily by GenomeTools: the software, the paper, and the developers. Aside from being an extremely well engineered software library and fostering a welcoming development community, GenomeTools is exceptional in two key respects: 1) a focus on streaming processing; and 2) a focus on explicit grouping of related genome features into graphs, rather than entry-by-entry processing requiring the user to resolve relationships between features. For all of these reasons I use the GenomeTools library extensively in my research: I use the command-line interface, I’ve written programs (and entire libraries) that use and extend the C API, and I’ve contributed to the core library itself.
However, it’s no secret that development in C is no walk in the park. I’ve spent more hours troubleshooting memory management and chasing memory leaks than I ever care to again. Of course, the performance of bare-metal ANSI C is nigh unmatchable, but as a research scientist my ability to implement and evaluate prototypes quickly saves me much more time in the long run than a constant-factor speedup in my research code’s performance. On the other hand, consider maintenance burden: the initial release of tag weighs in at < 1000 lines of code (excluding ≈ 1000 lines of testing code), which could easily require 10000s or 100000s of lines of code to implement in C. Finally, implementing tag in Python makes for a seamless integration with an ever growing ecosystem of robust data science tools.
So I’d like to acknowledge the GenomeTools community for blazing the trail and (especially Gordon Gremme and Sascha Steinbiss) for their tireless support. I’d also like to thank the Python community for their support of a wide variety of tools that made implementing, testing, documenting, and distributing the tag package a pleasure.
What problem does tag solve?¶
Computational biology is 90% text formatting and ID cross-referencing!– discouraged graduate students everywhere
Most GFF parsers will load data into memory for you–the trivial bit–but will not group related features for you–the useful bit. tag represents related features as a feature graph (a directed acyclic graph) which can be easily traversed and inspected.
# Compute number of exons per gene
import tag
reader = tag.GFF3Reader(infilename='/data/genomes/mybug.gff3.gz')
for gene in tag.select.features(reader, type='gene'):
exons = [feat for feat in gene if feat.type == exon]
print('num exons:', len(exons))
See the primer on annotation formats for more information.
Summary¶
The tag library is built around the following features:
- parsers and writers for reading and printing annotation data in GFF3 format (with intelligent gzip support)
- data structures for convenient handling of various types of GFF3 entries: annotated sequence features, directives and other metadata, embedded sequences, and comments
- generator functions for a variety of common and useful annotation processing tasks, which can be easily composed to create streaming pipelines
- a unified command-line interface for executing common processing workflows
- a stable, documented Python API for interactive data analysis and building custom workflows
Development¶
Development of the tag library is currently a one-man show, but I would heartily welcome contributions. The development repository is at https://github.com/standage/tag. Please feel free to submit comments, questions, support requests to the GitHub issue tracker, or (even better) a pull request!