Discussion

Why Annot?

Our overarching goal was to create a database to support the collection and access of controlled, structured experimental metadata to meet the needs of both computational and experimental scientists.

The common solution to this in biological research labs is to employ spreadsheets. While these benefit from being flexible and easily edited, they are subject to errors that result from manual entry, inadvertent auto-formatting, and version drift. Annot offers a robust solution to annotate - using controlled vocabulary - samples, reagents, and experimental details for established assays where multiple staff are involved. While Annot was written with an informatics agnostic end-user in mind, full system administration requires basic skills in Linux, Python3, and Django, as well as basic knowledge of relational databases. Because of the cost required to populate Annot with detailed sample and reagent annotation, it is most appropriate for large-scale, high-throughput experiments.

However, a major benefit to our approach is that data generated in different experimental settings can be integrated through a detailed description of each experimental condition along the dimensions of sample, perturbation, and endpoint. Moreover, the high cost of large-scale screening efforts warrants the time and effort required to adequately annotate it. Ultimately, approaches such as this will allow data to be better leveraged and utilized to make discoveries and biological insights.

About Controlled Vocabulary

Annot enforces controlled vocabulary for every sample and reagent. This means all terms used can be tracked back to controlled vocabulary from a specific ontology. Annot does this by creating an “annot id”: an ontology term and id number linked together by an underscore (_). For example: “bovine insulin from UniProt” transforms into annot id INS_P01317. This approach helps limit variability in the nomenclature.

Annot id’s are further restricted to use only alphanumeric character and the underscore. The official ontology identifier is found behind the last underscore.

If needed, the term part of the annot identifier can be adjusted to a term everyone in the lab is familiar with. The ontology identifier, on the other hand, should not be modified. For example: we changed the official term hyaluronic_acid_chebi16336 into HA_chebi16336. These types of changes should always happen before the term is used for sample or reagent annotation.

In the case needed terms are missing from a particular ontology, terms can be borrowed from an other ontology and added by clicking the Add button in the particular ontology (orange colored in the GUI).

We advocate the use of controlled vocabulary from existing, well established ontologies whenever possible. However, some terms do not exist in established ontologies. For example, all vocabulary from apponprovider_own. All of these terms will have “Own” as term id, so they are easily detectable. For example: Boots_Own for the boots pharmacy.

Should it ever happen that an id from an ontology becomes deprecated then the particular term will not be deleted. Instead in Appsabbrick (bright orange colored in the GUI) in the Uploaded_endpoint_reagent_bricks or Uploaded_perturbation_reagent_bricks or Uploaded_sample_bricks table the ok_brick field and theontology_term_status field in the corresponding ontology (maroon colored in the GUI) will automatically be marked with a red cross.

In annot, the controlled vocabulary origin version contains the latest information pulled form the original source. Only the backup version will store adapted annot ids, our own added ontology terms, and deprecated ontology identifiers. Original version and backup files can be found inside annot at /usr/src/media/vocabulary/.

Further reading:

  • HowTo handle controlled vocabulary?
  • HowTo deal with huge ontologies?
  • HowTo get detailed information about the ontologies in use?
  • HowTo backup annot content?

About Proteins, Protein Isoforms, and Protein Complexes

Protein ids are a bit of a special case because some proteins have multiple known isoforms. For such cases we introduced an additional hierarchical separation character, the pipe symbol (|). For example, the canonical human insulin isoform: INS|1_P01308|1. Please note that in UniProt, the isoforms identifiers is officially separated by a dash from the protein identifier (e.g. P01308-1). Annot already uses the dash to separate the primary key parts in the annot brick identifiers which is why we adopted the pipe here.

If we could not identify an exact isoform, we always chose the canonical isoform as defined by UniProt. The boolean field isoform_explicit in the protein brick identifies which proteins have known isoforms and which simply use the canonical form.

Because UniProt doesn’t cover protein complexes (i.e. COL1, ITGA2B1 or Laminin3B32), we used Gene Ontology cellular component identifiers, which resulted in annot ids like COL1_go0005584, ITGA2B1_go0034666, Laminin3B32_go0061801.

The unique annot id naming conventions make it very easy to spot key details about a protein. All details are not in the name, for example, the species the protein comes from (i.e. Human_9606 or Cow_9913). But this is annotated in the bricks in the Protein_DNA_code_source field.

Further reading:

  • HowTo annotate protein complexes?

About not_yet_specified and not_available

Annot sets every empty field to not_yet_specified, regardless of whether information was not specified or the information was simply not available. This avoids the common problem of empty fields and confusion about how to handle missing data.

A sample or reagent brick, which has a not_yet_specified field in the primary key block, will in general not be uploadable. If however, primary key fields are marked as not_available, then we can upload the reagent brick. For example, if we do not have information about provider, catalog number, or lot number of the reagent DMSO, then we would have the following descriptor: DMSO_chebi28262-notavailable_notavailable_notavailable.

Programmer Contribution

  • Elmar Bucher: main programmer.
  • Cheryl Claunch: co-programmer on version 4.
  • Derrick He: cron job backup routine implementation.
  • Dave Kilburn and Laura Heiser: manual proofreading.

Contact Information

Contact Elmar Bucher at https://gitlab.com/biotransistor/annot or send an email to buchere at ohsu dot edu.