<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2005-6-5-r44</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Method</dochead>
      <bibl>
         <title>
            <p>The Sequence Ontology: a tool for the unification of genome annotations</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Eilbeck</snm>
               <fnm>Karen</fnm>
               <insr iid="I1"/>
               <email>keilbeck@fruitfly.org</email>
            </au>
            <au id="A2">
               <snm>Lewis</snm>
               <mi>E</mi>
               <fnm>Suzanna</fnm>
               <insr iid="I1"/>
               <email>suzi@fruitfly.org</email>
            </au>
            <au id="A3">
               <snm>Mungall</snm>
               <mi>J</mi>
               <fnm>Christopher</fnm>
               <insr iid="I2"/>
               <email>cjm@fruitfly.org</email>
            </au>
            <au id="A4">
               <snm>Yandell</snm>
               <fnm>Mark</fnm>
               <insr iid="I2"/>
               <email>myandell@fruitfly.org</email>
            </au>
            <au id="A5">
               <snm>Stein</snm>
               <fnm>Lincoln</fnm>
               <insr iid="I3"/>
               <email>lstein@cshl.org</email>
            </au>
            <au id="A6">
               <snm>Durbin</snm>
               <fnm>Richard</fnm>
               <insr iid="I4"/>
               <email>rd@sanger.ac.uk</email>
            </au>
            <au id="A7" ca="yes">
               <snm>Ashburner</snm>
               <fnm>Michael</fnm>
               <insr iid="I5"/>
               <email>ma11@gen.cam.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Department of Molecular and Cellular Biology, Life Sciences Addition, University of California, Berkeley, CA 94729-3200, USA</p>
            </ins>
            <ins id="I2">
               <p>Howard Hughes Memorial Institute, Department of Molecular and Cellular Biology, Life Sciences Addition, University of California, Berkeley, CA 94729-3200, USA</p>
            </ins>
            <ins id="I3">
               <p>Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, New York 11724, USA</p>
            </ins>
            <ins id="I4">
               <p>Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK</p>
            </ins>
            <ins id="I5">
               <p>Department of Genetics, University of Cambridge, Downing Street, Cambridge, CB2 3EH, UK</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2005</pubdate>
         <volume>6</volume>
         <issue>5</issue>
         <fpage>R44</fpage>
         <url>http://genomebiology.com/2005/6/5/R44</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="pmpid">15892872</pubid>
               <pubid idtype="doi">10.1186/gb-2005-6-5-r44</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>4</day>
               <month>10</month>
               <year>2004</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>1</day>
               <month>2</month>
               <year>2005</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>30</day>
               <month>3</month>
               <year>2005</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>29</day>
               <month>4</month>
               <year>2005</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2005</year>
         <collab>Eilbeck et al.; licensee BioMed Central Ltd.</collab>
         <note>This is an Open Access article distributed under the terms of the Creative Commons Attribution License (<url>http://creativecommons.org/licenses/by/2.0</url>), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.</note>
      </cpyrt>
      <shorttitle>
         <p>The Sequence Ontology tool</p>
      </shorttitle>
      <shortabs>
         <p>The goal of the Sequence Ontology (SO) project is to produce a structured controlled vocabulary with a common set of terms and definitions for parts of a genomic annotation, and to describe the relationships among them. Details of SO construction, design and use, particularly with regard to part-whole relationships are discussed and the practical utility of SO is demonstrated for a set of genome annotations from <it>Drosophila melanogaster</it>.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>The Sequence Ontology (SO) is a structured controlled vocabulary for the parts of a genomic annotation. SO provides a common set of terms and definitions that will facilitate the exchange, analysis and management of genomic data. Because SO treats part-whole relationships rigorously, data described with it can become substrates for automated reasoning, and instances of sequence features described by the SO can be subjected to a group of logical operations termed extensional mereology operators.</p>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification id="ontology" subtype="theme_series_title" type="BMC">Ontologies</classification>
         <classification id="ontology" subtype="theme_series_editor" type="BMC"/>
         <classification type="BMC" subtype="man_spc_id" id="30010002">Bioinformatics</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010013">Methods</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010016">Molecular biology</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <sec>
            <st>
               <p>Why a sequence ontology is needed</p>
            </st>
            <p>Genomic annotations are the focal point of sequencing, bioinformatics analysis, and molecular biology. They are the means by which we attach what we know about a genome to its sequence. Unfortunately, biological terminology is notoriously ambiguous; the same word is often used to describe more than one thing and there are many dialects. For example, does a coding sequence (CDS) contain the stop codon or is the stop codon part of the 3'-untranslated region (3' UTR)? There really is no right or wrong answer to such questions, but consistency is crucial when attempting to compare annotations from different sources, or even when comparing annotations performed by the same group over an extended period of time.</p>
            <p>At present, GenBank <abbrgrp><abbr bid="B1">1</abbr></abbrgrp> houses 220 viral genomes, 152 bacterial genomes, 20 eukaryotic genomes and 18 archeal genomes. Other centers such as The Institute for Genomic Research (TIGR) <abbrgrp><abbr bid="B2">2</abbr></abbrgrp> and the Joint Genome Institute (JGI) <abbrgrp><abbr bid="B3">3</abbr></abbrgrp> also maintain and distribute annotations, as do many model organism databases such as FlyBase <abbrgrp><abbr bid="B4">4</abbr></abbrgrp>, WormBase <abbrgrp><abbr bid="B5">5</abbr></abbrgrp>, The <it>Arabidopsis </it>Information Resource (TAIR) <abbrgrp><abbr bid="B6">6</abbr></abbrgrp> and the <it>Saccharomyces </it>Genome Database (SGD) <abbrgrp><abbr bid="B7">7</abbr></abbrgrp>. Each of these groups has their own databases and many use their own data model to describe their annotations. There is no single place at which all sets of genome annotations can be found, and several sets are informally mirrored in multiple locations, leading to location-specific version differences. This can make it hazardous to exchange, combine and compare annotation data. Clearly, if genomic annotations were always described using the same language, then comparative analysis of the wealth of information distributed by these institutions would be enormously simplified: Hence the Sequence Ontology (SO) project. SO began 2 years ago, when a group of scientists and developers from the model organism databases - FlyBase, WormBase, Ensembl, SGD and MGI - came together to collect and unify the terms they used in their sequence annotation.</p>
            <p>The Goal of the SO is to provide a standardized set of terms and relationships with which to describe genomic annotations and provide the structure necessary for automated reasoning over their contents, thereby facilitating data exchange and comparative analyses of annotations. SO is a sister project to the Gene Ontology (GO) <abbrgrp><abbr bid="B8">8</abbr></abbrgrp> and is part of the Open Biomedical Ontologies (OBO) project <abbrgrp><abbr bid="B9">9</abbr></abbrgrp>. The scope of the SO project is the description of the features and properties of biological sequence. The features can be located in base coordinates, such as <it>gene </it>and <it>intron</it>, and the properties of these features describe an attribute of the feature; for example, a <it>gene </it>may be <it>maternally_imprinted</it>.</p>
         </sec>
         <sec>
            <st>
               <p>SO terminology and format</p>
            </st>
            <p>Like other ontologies, SO consists of a controlled vocabulary of terms or concepts and a restricted set of relationships between those terms. While the concepts and relationships of the sequence ontology make it possible to describe precisely the features of a genomic annotation, discussions of them can lead to much lexical confusion, as some of the terms used by SO are also common words; thus we begin our description of SO with a discussion of its naming conventions, and adhere to these rules throughout this document.</p>
            <p>Wherever possible, the terms used by SO to describe the parts of an annotation are those commonly used in the genomics community. In some cases, however, we have altered these terms in order to render them more computer-friendly so that users can create software classes and variables named after them. Thus, term names do not include spaces; instead, underscores are used to separate the words in phrases. Numbers are spelled out in full, for example <it>five_prime_UTR</it>, except in cases where the number is part of the accepted name. If the commonly used name begins with a number, such as 28S RNA, the stem is moved to the front - for example, <it>RNA_28S</it>. Symbols are spelled out in full where appropriate, for example, <it>prime</it>, <it>plus</it>, <it>minus</it>; as are Greek letters. Periods, points, slashes, hyphens, and brackets are not allowed. If there is a common abbreviation it is used as the term name, and case is always lower except when the term is an acronym, for example, <it>UTR </it>and <it>CDS</it>. Where there are differences in the accepted spelling between English and US usage, the US form is used.</p>
            <p>Synonyms are used to record the variant term names that have the same meaning as the term. They are used to facilitate searching of the ontology. There is no limit to the number of synonyms a term can have, nor do they adhere to SO naming conventions. They are, however, still lowercase except when they are acronyms.</p>
            <p>Throughout the remainder of this document, the terms from SO are highlighted in italics and the names of relationships between the terms are shown in bold. The terms are always depicted exactly as they appear in the ontology. The names of EM operators are underlined.</p>
         </sec>
         <sec>
            <st>
               <p>SO, SOFA, and the feature table</p>
            </st>
            <p>To facilitate the use of SO for the markup of gene annotation data, a subset of terms from SO consisting of some of those terms that can be located onto sequence has been selected; this condensed version of SO is especially well suited for labeling the outputs of automated or semi-automated sequence annotation pipelines. This subset is known as the Sequence Ontology Feature Annotation, or SOFA.</p>
            <p>SO, like GO, is an 'open source' ontology. New terms, definitions, and their location within the ontology are proposed, debated, and approved or rejected by an open group of individuals via a mailing list. SO is maintained in OBO format and the current version can be downloaded from the CVS repository of the SO website <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>. For development purposes, SOFA was stabilized and released (in May 2004) for at least 12 months to allow development of software and formats. SO is a directed acyclic graph (DAG), and can be viewed using the editor for OBO files, OBO-Edit <abbrgrp><abbr bid="B11">11</abbr></abbrgrp>.</p>
            <p>The terms describing sequence features in SO and SOFA are richer than those of the Feature Table <abbrgrp><abbr bid="B12">12</abbr></abbrgrp> of the three large genome databanks: GenBank <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>, EMBL <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> and the DNA Data Bank of Japan (DDBJ) <abbrgrp><abbr bid="B14">14</abbr></abbrgrp>. The Feature Table is a controlled vocabulary of terms describing sequence features and is used to describe the annotations distributed by these data banks. The Feature Table does provide a grouping of its terms for annotation purposes, based on the degree of specificity of the term. The relationships between the terms are not formalized; thus the interpretation of these relationships is left to the user to infer, and, more critically, must be hard-coded into software applications. Most of the terms in the Feature Table map directly to terms in SO, although the term names may have been changed to fit SO naming conventions. In general, SO contains a more extensive set of features for detailed annotation. There are currently 171 locatable sequence features in SOFA compared to 65 of the Feature Table. There are 11 terms in the Feature Table that are not included in SO. These terms fall into two categories: remarks and immunological features, both of which have been handled slightly differently in SO. A mapping between SO and the Feature Table is available from the SO website <abbrgrp><abbr bid="B10">10</abbr></abbrgrp>.</p>
         </sec>
         <sec>
            <st>
               <p>Database schemas, file formats and SO</p>
            </st>
            <p>SO is not a database schema, nor is it a file format; it is an ontology. As such, SO transcends any particular database schema or file format. This means it can be used equally well as an external data-exchange format or internally as an integral component of a database.</p>
            <p>The simplest way to use SO is to label data destined for redistribution with SO terms and to make sure that the data adhere to the SO definition of the data type. Accordingly, SO provides a human-readable definition for each term that concisely states its biological meaning. Usually the definitions are drawn from standard authoritative sources such as <it>The Molecular Biology of the Cell </it><abbrgrp><abbr bid="B15">15</abbr></abbrgrp>, and each definition contains a reference to its source. Defining each term in such a way is important as it aids communication and minimizes confusion and disputes as to just what data should consist of. For example, the term <it>CDS </it>is defined as <it>a contiguous RNA sequence which begins with, and includes, a start codon and ends with, and includes, a stop codon</it>. According to SO, the sequence of a <it>three_prime_utr </it>does not contain the <it>stop_codon </it>- and files with such sequences are SO-compliant; files of <it>three_prime_utr </it>containing <it>stop_codons </it>are not. This is a trivial example, illustrating one of the simplest use cases, but it does demonstrate the power of SO to put an end to needless negotiations between parties as to the details of a data exchange. This aspect of SO is especially well suited for use with the generic feature format (GFF) <abbrgrp><abbr bid="B16">16</abbr></abbrgrp>. Indeed, the latest version, GFF3, uses SO terms and definitions to standardize the feature type described in each row of a file and SO terms as optional attributes to a feature.</p>
            <p>SO can also be employed in a much more sophisticated manner within a database. CHADO <abbrgrp><abbr bid="B17">17</abbr></abbrgrp> is a modular relational database schema for integrating molecular and genetic data and is part of the Generic Model Organism Database project (GMOD) <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>, currently used by both FlyBase and TIGR. The CHADO relational schema is extremely flexible, and is centered on genomic features and their relationships, both of which are described using SO terms. This use of SO ensures that software that queries, populates and exports data from different CHADO databases is interoperable, and thus greatly facilitates large-scale comparisons of even very complex genomics data.</p>
            <p>Like GFF3, Chaos-XML <abbrgrp><abbr bid="B19">19</abbr></abbrgrp> is a file format that uses SO to label and structure data, but it is more intimately tied to the CHADO project than is GFF3. Chaos-XML is a hierarchical XML mapping of the CHADO relational schema. Annotations are represented as an ontology-typed feature graph. The central concept of Chaos-XML is the sequence-feature, which is any sequence entity typed by SO. The features are interconnected via feature relationship elements, whereby each relationship connects a subject feature and an object feature. Features are located via featureloc elements which use interbase (zero-based) coordinates. Chaos-XML and CHADO are richer models than GFF3 in that feature_relationships are typed, and a more sophisticated location model is used. Chaos-XML is the substrate of a suite of programs called Comparative Genomics Library (CGL), pronounced 'seagull' <abbrgrp><abbr bid="B20">20</abbr></abbrgrp>, which we have used for the analyses presented in our Results section.</p>
            <p>The basic types in SOFA, from which other types are defined, are <it>region </it>and <it>junction</it>, equivalent to the concepts of interiors and boundaries defined in the field of topological relationships <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>. A region is a length of sequence such as an <it>exon </it>or a <it>transposable_element</it>. A <it>junction </it>is the space between two bases, such as an <it>insertion_site</it>. Building on these basic data types, SOFA can be used to describe a wide range of sequence features. Raw sequence features such as assembly components are captured by terms like <it>contig </it>and <it>read</it>. Analysis features, defined by the results of sequence-analysis programs such as BLAST <abbrgrp><abbr bid="B22">22</abbr></abbrgrp> are captured by terms such as <it>nucleotide_match</it>. Gene models can be defined on the sequence using terms like <it>gene</it>, <it>exon </it>and <it>CDS</it>. Variation in sequence is captured by subtypes of the term <it>sequence_variant</it>. These terms have multiple parentages with either region or junction. SOFA (and SO) can also be used to describe many other sequence features, for example, <it>repeat</it>, <it>reagent</it>, <it>remark</it>. Thus, SOFA together with GFF3 or Chaos-XML provide an easy means by which parties can describe, standardize, and document the data they distribute and exchange.</p>
            <p>The SO and SOFA controlled vocabularies can be used for <it>de novo </it>annotation. Several groups including SGD and FlyBase now use either SO or SOFA terms in their annotation efforts. SO is not restricted to new annotations, however, and may be applied to existing annotations. For example, annotations from GenBank may be converted into SO-compliant formats using Bioperl <abbrgrp><abbr bid="B23">23</abbr></abbrgrp> (see Materials and methods).</p>
         </sec>
         <sec>
            <st>
               <p>SO relationships</p>
            </st>
            <p>One essential difference between a controlled vocabulary, such as the Feature Table, and an ontology is that an ontology is not merely a collection of predefined terms that are used to describe data. Ontologies also formally specify the relationships between their terms. Labeling data with terms from an ontology makes the data a substrate for software capable of logical inference. The information necessary for making logical inferences about data resides in the class designations of the relationships that unite terms within SO. We detail this aspect of the ontology below. For purposes of reference, a section of SO illustrating the various relationships between some of its terms is shown in Figure <figr fid="F1">1</figr>.</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>A section of the Sequence Ontology showing how terms and relationships are used together to describe knowledge about sequence</p>
               </caption>
               <text>
                  <p>A section of the Sequence Ontology showing how terms and relationships are used together to describe knowledge about sequence. The <b>kind_of </b>relationships are depicted using arrows labeled with 'i', the <b>part_of </b>relationships use arrows with 'P' and the <b>derives_from </b>relationships with 'd'. By tracing the arrows that connect the terms, different logical inferences can be made regarding what a term 'is' and what are its allowable parts. For example, an <it>exon </it>is a <b>part_of </b>a <it>transcript</it>, a <it>tRNA </it>is a <b>kind_of </b><it>ncRNA </it>which is a <b>kind_of </b><it>processed_transcript</it>.</p>
               </text>
               <graphic file="gb-2005-6-5-r44-1"/>
            </fig>
            <p>Currently, SO uses three basic kinds of relationship between its terms: <b>kind_of</b>, <b>derives_from</b>, and <b>part_of</b>. These relationships are defined in the OBO relationship types ontology <abbrgrp><abbr bid="B24">24</abbr></abbrgrp>. <b>kind_of </b>relationships specify what something 'is'. For example, an <it>mRNA </it>is a <b>kind_of </b><it>transcript</it>. Likewise an <it>enhancer </it>is a <b>kind_of </b><it>regulatory_region</it>. <b>kind_of </b>relationships are valid in only one direction. Hence, a <it>regulatory_region </it>is not a <b>kind_of </b><it>enhancer</it>. One consequence of the directional nature of <b>kind_of </b>relationships is that their transitivity is hierarchical - inferences as to what something 'is' proceed from the leaves towards the root of the ontology. For example, an <it>mRNA </it>is a <b>kind_of </b><it>processed_transcript </it>AND a <it>processed_transcript </it>is a <b>kind_of </b><it>transcript</it>. Thus, an <it>mRNA </it>is a <b>kind_of </b><it>transcript</it>. <b>kind_of </b>relationships are synonymous with <b>is_a </b>relationships. We adopted the '<b>kind_of</b>' notation to avoid the lexical confusion often encountered when describing relationships, as the phrase 'is a' is often used in conjunction with another relationships in English - for example 'is a part_of'.</p>
            <p>SO uses the term <b>derives_from </b>to denote relationships of process between two terms. For example, an <it>EST </it><b>derives_from </b>an <it>mRNA</it>. <b>derives_from </b>relationships imply an inverse relationship; <b>derives</b>. Note that although a <it>polypeptide </it><b>derives_from </b>an <it>mRNA</it>, a <it>polypeptide </it>cannot be derived from an <it>ncRNA </it>(non-coding RNA), because no <b>derives_from </b>relationship unites these two terms in the ontology. This fact illustrates another important aspect of how SO handles relationships: children always inherit from parents but never from siblings. An <it>ncRNA </it>is a <b>kind_of </b><it>transcript </it>as is an <it>mRNA</it>. Labeling something as a <it>transcript </it>implies that it could possibly produce a <it>polypeptide</it>; labeling that same entity with the more specific term <it>ncRNA </it>rules that possibility out. Thus, a file that contained ncRNAs and their polypeptides would be semantically invalid.</p>
            <p><b>part_of </b>relationships pertain to meronomies; that is to say 'part-whole' relationships. An <it>exon</it>, for example, is a <b>part_of </b>a <it>transcript</it>. <b>part_of </b>relationships are not valid in both directions. In other words, while an <it>exon </it>is a <b>part_of </b>a <it>transcript</it>, a <it>transcript </it>is not a <b>part_of </b>an <it>exon</it>. Instead, we say a <it>transcript </it><b>has_part </b><it>exon</it>. SO does not explicitly denote whole-part relationships, as every <b>part_of </b>relationship logically implies the inverse <b>has_part </b>relationship between the two terms.</p>
            <p>Transitivity is a more complicated issue with regards to part-whole relationships than it is for the other relationships in SO. In general, <b>part_of </b>relationships are transitive - an <it>exon </it>is a <b>part_of </b>a <it>gene</it>, because an <it>exon </it>is a <b>part_of </b>a <it>transcript</it>, and a <it>transcript </it>is a <b>part_of </b>a <it>gene</it>. Not every chain of part-whole relationships, however, obeys the principle of transitivity. This is because parts can be combined to make wholes according to different organizing principles. Winston <it>et al</it>. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> have described six different subclasses of the part-whole relationship, based on the following three properties: <it>configuration</it>, whether the parts have a structural or functional role with respect to one another or the whole they form; <it>substance</it>, whether the part is made of the same stuff as the whole (homomerous or heteromerous); and <it>invariance</it>, whether the part can be separated from the whole. These six relations and their associated <b>part_of </b>subclasses are detailed in Table <tblr tid="T1">1</tblr>.</p>
            <tbl id="T1">
               <title>
                  <p>Table 1</p>
               </title>
               <caption>
                  <p>Six subclasses of part-whole relationships</p>
               </caption>
               <tblbdy cols="4">
                  <r>
                     <c ca="left">
                        <p>Part_of subtype</p>
                     </c>
                     <c ca="left">
                        <p>Whole</p>
                     </c>
                     <c ca="left">
                        <p>Properties of relationship</p>
                     </c>
                     <c ca="left">
                        <p>Example</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="4">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>component_part_of</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>integral object</p>
                     </c>
                     <c ca="left">
                        <p>Functional/heteromerous/separable</p>
                     </c>
                     <c ca="left">
                        <p>A leg is a <b>part_of </b>a body.</p>
                        <p>A regulatory_region is a <b>part_of </b>a <it>gene</it>.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>portion_part_of</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>mass</p>
                     </c>
                     <c ca="left">
                        <p>Not functional/homomerous/separable</p>
                     </c>
                     <c ca="left">
                        <p>A slice is a <b>part_of </b>a cake.</p>
                        <p>A restriction_fragment is <b>part_of </b>a chromosome.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>stuff_part_of</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>object</p>
                     </c>
                     <c ca="left">
                        <p>Not functional/heteromerous/not separable</p>
                     </c>
                     <c ca="left">
                        <p>Carbon is a <b>part_of </b>a chromosome.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>member_part_of</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>collection</p>
                     </c>
                     <c ca="left">
                        <p>Not functional/heteromerous/separable</p>
                     </c>
                     <c ca="left">
                        <p>A sheep is a <b>part_of </b>a flock.</p>
                        <p>A <it>read </it>is a <b>part_of </b>a <it>contig</it>.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>place_part_of</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>area</p>
                     </c>
                     <c ca="left">
                        <p>Not functional/homomerous/not separable</p>
                     </c>
                     <c ca="left">
                        <p>England is a <b>part_of </b>Britain.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <b>feature_part_of</b>
                        </p>
                     </c>
                     <c ca="left">
                        <p>activity</p>
                     </c>
                     <c ca="left">
                        <p>Functional/heteromerous/not separable</p>
                     </c>
                     <c ca="left">
                        <p>Inhaling is a <b>part_of </b>breathing.</p>
                        <p>Translation is <b>part_of </b>protein synthesis.</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Column 1 gives the name of the subclass; column 2, the class or 'whole' to which such parts belong; column 3, the essential properties that define that particular part-whole relationship; and column 4 provides examples. Of the six classes only two - <b>component_part_of </b>and <b>member_part_of </b>occur in SO.</p>
               </tblfn>
            </tbl>
            <p>Winston <it>et al</it>. <abbrgrp><abbr bid="B25">25</abbr></abbrgrp> argue that there is transitivity across a series of <b>part_of </b>relationships only if they all belong to the same subclass. In other words, an <it>exon </it>can only be <b>part_of </b>a <it>gene</it>, if an <it>exon </it>is a <b>component_part_of </b>a <it>transcript</it>, and a <it>transcript </it>is <b>component_part_of </b>a <it>gene</it>. If, however, the two statements contain different types of <b>part_of </b>relationship, then transitivity does not hold.</p>
            <p>By addressing the vague English term 'part of' in this way, Winston <it>et al. </it>solve many of the problems associated with reasoning across <b>part_of </b>relationships; thus, we are adopting their approach with SO. The parts contained in the sequence ontology are mostly of the type <b>component_part_of </b>such as <it>exon </it>is a <b>part_of </b><it>transcript</it>, although there are a few occurrences of <b>member_part_of </b>such as <it>read </it>is a <b>part_of </b><it>contig</it>.</p>
         </sec>
         <sec>
            <st>
               <p>SO's relationships facilitate software design and bioinformatics research</p>
            </st>
            <p>Genomic annotations are substrates for a multitude of software applications. Annotations, for example, are rendered by graphical viewers, or, as another example, their features are searched and queried for purposes of data validation and genomics research. Using an ontology for sequence annotation purposes offers many advantages over the traditional Feature Table approach. Because controlled vocabularies do not specify the relationships that obtain between their terms, using the Feature Table has meant that relationships between features have had to be hard-coded in software applications themselves; consequently, adding a new term to the Feature Table and/or changing the details of the relationships that obtain between its terms has meant revising every software application that made use of the Feature Table. Ontologies mitigate this problem as all of the knowledge about terms and their relationships to one another is contained in the ontology, not the software.</p>
            <p>SO-compliant software need only be provided with an updated version of the ontology, and everything else will follow automatically. This is because SO-compliant software need not hard-code the fact that a <it>tRNA </it>is a <b>kind_of </b><it>transcript</it>; it need merely know that <b>kind_of </b>relationships are transitive and hierarchical and be capable of internally navigating the network of relationships specified by the ontology (see Figure <figr fid="F1">1</figr>) in order to logically infer this fact. This means that every time a new form of <it>ncRNA </it>is discovered, and added to SO, all SO-compliant software applications will automatically be able to infer that any data labeled with that new term is a <b>kind_of </b>transcript. This means that existing graphical viewers will render those data with the appropriate transcript glyph, and validation and query tools will automatically deal with this new data-type in a coherent fashion. Placing the biological knowledge in the ontology rather than in the software means that the ontology and the software that uses it can be developed, revised, and extended independently of one another. Thus ontologies offer the bioinformatics programming community significant opportunities as regards software design and the speed of the development cycle. Using an ontology does, however, mean that software applications must meet certain professional standards; namely, they must be capable of parsing an OBO file and navigating the network of relationships that constitute the ontology, but these are minimal hurdles.</p>
            <p>SO facilitates bioinformatics research in ways that reach far beyond its utility as regards software design. For example, SO's <b>kind_of </b>relationships provide a subsumption hierarchy, or classification system for its terms. This added depth of knowledge greatly improves the searching and querying capabilities of software using SO. The ontology's higher-level terms may be used to query via inference, even if they are never used for annotation. We recommend that annotators label their data using terms corresponding to terminal nodes in the ontology. Transcripts, for example, might be annotated using terms such as <b>mRNA</b>, <b>tRNA</b>, and <b>rRNA </b>(see Figure <figr fid="F1">1</figr>). Note that doing so means that if, for example, non-coding RNA sequences are required for some subsequent analysis, then SO-compliant software tools can locate annotations labelled with the subtypes of ncRNA, and retrieve tRNAs and rRNAs to the exclusion of mRNAs, even though these data have not been explicitly labelled with the term <b>ncRNA</b>. Thus, many analyses become easy, for example, how many ncRNAs are annotated in <it>H. sapiens</it>? Of these what percent have more than one exon? Are any maternally imprinted? Moreover, using SO as part of a database schema ensures that such questions 'mean' the same thing in different databases.</p>
            <p>SO also greatly facilitates the automatic validation of annotation data, as the relationships implied by an annotation can be compared to the allowable relationships specified in the ontology. For example, an annotation that asserts an <it>intron </it>to be <b>part_of </b>an <it>mRNA </it>would be invalid, as this relationship is not specified in the ontology (Figure <figr fid="F1">1</figr>). On the other hand, an annotation that asserted that an <it>UTR </it>sequence was <b>part_of </b><it>mRNA </it>would be valid (Figure <figr fid="F1">1</figr>). This makes possible better quality control of annotation data, and makes it possible to check existing annotations for such errors when converting them to a SO-compliant format such as GFF3.</p>
            <p>To summarize, by identifying the set of relationships between terms that are possible, we are also specifying the inferences that can be drawn from these relationships: that is, the software operations that can be carried out over the data. As a consequence, software is easier to maintain, SO can easily be extended to embrace new biological knowledge, quality controls can be readily implemented, and software to mine data can be written so as to be very flexible.</p>
         </sec>
         <sec>
            <st>
               <p>EM operators and SO</p>
            </st>
            <p>SO also enables some modes of analyses of genomics data that are completely new to the field. One such class of analyses involves the use of extensional mereology (EM) operators to ask questions about gene parts. Although new to genomics, EM operators are well known in the field of ontology, where they provide a basis for asking and answering questions pertaining to how parts are distributed within and among different wholes (reviewed in <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr></abbrgrp>). These operators are usually applied to studies of how parts are shared between complex wholes - such as different models of automobiles or personal computers - for the purpose of optimizing manufacturing procedures. Below we explain how these same operators can be applied to the analyses of genomics data. Although these operators, <ul>difference</ul> and <ul>overlap</ul>, share the same name as topological operators, they are different as they function on the parts of an object, not on its geometric coordinate space. The topological operators, regarding the coincidence of edges and interiors - equality, overlap, disjointedness, containment and coverage of spatial analysis <abbrgrp><abbr bid="B21">21</abbr></abbrgrp> - may also be applied to biological sequence.</p>
            <p>EM is a formal theory of parts: it defines the properties of the <b>part_of </b>relationship and then provides a set of operations (Table <tblr tid="T2">2</tblr>) that can be applied to those parts. These operators are akin to those of set theory, but whereas set theory makes use of an object's <b>kind_of </b>relationships, EM operators function on an object's <b>part_of </b>relationships. Only wholes and their 'proper parts' are legitimate substrates for EM operations. Proper parts are those parts that satisfy three self-evident criteria: first, nothing is a proper part of itself (a proper part is part of but not identical to the individual or whole); second, if <b>A </b>is a proper part of <b>B </b>then the <b>B </b>is not a part of <b>A</b>; third, if <b>A </b>is a part of <b>B </b>and <b>B </b>is a part of <b>C </b>then <b>A </b>is a part of <b>C</b>.</p>
            <tbl id="T2">
               <title>
                  <p>Table 2</p>
               </title>
               <caption>
                  <p>The EM operators</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>EM operation</p>
                     </c>
                     <c ca="left">
                        <p>Definition</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Overlap (x &#9675; y)</p>
                     </c>
                     <c ca="left">
                        <p>x and y overlap if they have a part in common.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Disjoint (x &#953; y)</p>
                     </c>
                     <c ca="left">
                        <p>x and y are disjoint if they share no parts in common.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Binary product (x . y)</p>
                     </c>
                     <c ca="left">
                        <p>The parts that x and y share in common.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Difference (x - y)</p>
                     </c>
                     <c ca="left">
                        <p>The largest portion of x which has no part in common with y.</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Binary sum (x + y)</p>
                     </c>
                     <c ca="left">
                        <p>The set consisting of individuals x and y.</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>In each case x and y refer to two wholes. The first two operators are Boolean and pertain to whether two wholes share any parts in common; whereas the remainder return either the parts, or, in the case of binary sum, the wholes, that satisfy the operation.</p>
               </tblfn>
            </tbl>
            <p>Note that the third criterion of proper parts is that they obey the rule of transitivity. As we discussed earlier, not all <b>part_of </b>relationships are transitive. Accordingly, we have restricted our analyses (see Results and discussion) to component parts (Table <tblr tid="T2">2</tblr>).</p>
            <p>Figure <figr fid="F2">2</figr> illustrates the effects of applying EM operations to analyze the relationships '<it>transcript </it>is a <b>part_of </b><it>gene' </it>and '<it>exon </it>is a <b>part_of </b><it>transcript</it>'. The EM operations <ul>overlap</ul> and <ul>disjoint</ul> pertain to relationships between transcripts, whereas <ul>difference</ul> and <ul>binary product</ul> pertain to exons. Two transcripts <ul>overlap</ul> if they share one or more exon in common. Two transcripts are <ul>disjoint</ul> if they do not share any exons in common. The exons shared between two overlapping transcripts are the <ul>binary product</ul> of the two transcripts, and the exons not shared in common comprise the <ul>difference</ul> between the two transcripts. The <ul>binary sum</ul> of two transcripts is simply the sum of their parts.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>Using EM operations to characterize alternatively spliced transcripts and their exons</p>
               </caption>
               <text>
                  <p>Using EM operations to characterize alternatively spliced transcripts and their exons. The EM operations <ul>overlap</ul> and <ul>disjoint</ul> can be used to characterize pair-wise relationships between alternative transcripts. <ul>Binary product</ul> and <ul>difference</ul>, on the other hand, pertain to exons shared, or not-shared between two alternative transcripts.</p>
               </text>
               <graphic file="gb-2005-6-5-r44-2"/>
            </fig>
            <p>One key feature of EM operations is that they operate in 'identifier space' rather than 'coordinate space'. Two transcripts <ul>overlap</ul> only if they share a part in common rather than if their genomic coordinates overlap. Thus, two transcripts may be <ul>disjoint</ul> even if their exons partially overlap one another. This is one way in which EM analyses differ from standard bioinformatics analyses, and it has some interesting repercussions. This is particularly so with regard to modes of alternative splicing, as each of the EM operations suggests a distinct category by means of which two alternatively spliced transcripts can be related to one another. We further explore the potential of these operations to classify alternative transcripts and their exons below.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Results and discussion</p>
         </st>
         <p>As part of a pilot project to evaluate the practical utility of SO as a tool for data management and analysis, we have used SO to name and enumerate the parts of every protein-coding annotation in the <it>D. melanogaster </it>genome. Doing so has allowed us to compare annotations with respect to their parts, for example, number of exons, amount of UTR sequence, and so on.</p>
         <p>These data afford many potential analyses, but as our motivation was primarily to demonstrate the practical utility of SO as a tool for data management, rather than comparative genomics <it>per se</it>, we have focused more on what exon-transcript-gene part-whole relationships have to say about the annotations themselves, than what the annotations have to say about the biology of the genome. Accordingly, we have used EM-operators to characterize the annotations with respect to their parts, especially with regard to alternative splicing. The current version of FlyBase (5 August, 2004) contained 13,539 genes, (of which 10,653 have a single transcript and 2,886 are alternatively spliced), 18,735 transcripts and 61,853 exons.</p>
         <sec>
            <st>
               <p>An EM-based scheme for classifying alternatively spliced genes</p>
            </st>
            <p>As we had characterized the parts of the annotations using SO, we were able to employ the EM operators over these parts. This proved to be a natural way to explore the relative complexity of alternative splicing, as the alternatively spliced transcripts have different combinations of parts: that is, exons. We grouped alternatively spliced transcripts into two classes. An alternatively spliced gene will contain <ul>overlapping</ul> transcripts if at least one of its exons is shared between two of its transcripts, and will have disjoint transcripts if one of its transcripts shares no exons in common with any other transcript of that gene. For the purposes of this analysis, we further classified <ul>disjoint</ul> transcripts as <ul>sequence-disjoint</ul> and <ul>parts-disjoint</ul>. We term two <ul>disjoint</ul> transcripts <ul>sequence-disjoint</ul> if none of their exons shares any sequence in common with one another; and <ul>parts-disjoint</ul> if one or more of their exons overlap on the chromosome but have different exon boundaries. Note that the three operations are pairwise, and thus not mutually exclusive. To see why this is, imagine a gene having three transcripts, A, B, and C. Obviously, transcript A can be <ul>disjoint</ul> with respect to B, but <ul>overlap</ul> with respect to C. Thus, we can speak of a gene as having both disjoint and overlapping transcripts.</p>
            <p>The relative numbers of <ul>disjoint</ul> and <ul>overlapping</ul> transcripts in a genome says something about the relative complexity of alternative splicing in that genome. A gene may have any combination of these types of <ul>disjoint</ul> and <ul>overlapping</ul> transcripts, so we created a labeling system consisting of the seven possible combinations. We did this by asking three EM-based questions about the relationships between pairs of a gene's transcripts: How many pairs are there of <ul>sequence-disjoint</ul> transcripts? How many pairs are there of <ul>parts-disjoint</ul> transcripts? How many pairs are there of <ul>overlapping</ul> transcripts? Doing so allowed us to place that gene into one of seven classes with regards to the properties of its alternatively spliced transcripts. We also kept track of the number of times each of the three relationships held true for each pair combination. For example, a gene having two transcripts that are <ul>parts-disjoint</ul> with respect to one another would be labeled 0:1:0. Keeping track of the number of transcript pairs falling into each class provides an easy means to prioritize them for manual review. These results are summarized in Figure <figr fid="F3">3</figr>.</p>
            <fig id="F3">
               <title>
                  <p>Figure 3</p>
               </title>
               <caption>
                  <p>Examples of alternatively spliced genes from Entrez Gene at the NCBI</p>
               </caption>
               <text>
                  <p>Examples of alternatively spliced genes from Entrez Gene at the NCBI. Of the seven classes of alternatively spliced genes, some classes are more likely to indicate annotation problems than others - particularly those genes having one or more <ul>sequence-disjoint</ul> transcripts. <ul>Parts-disjoint</ul> transcripts, on the other hand, are more suggestive of complex biology. Alternatively spliced genes having only overlapping transcripts (0:0:N) comprise the vast majority of instances.</p>
               </text>
               <graphic file="gb-2005-6-5-r44-3"/>
            </fig>
            <p>Of the alternatively spliced fly genes, none has a <ul>sequence-disjoint</ul> transcript, 275 have <ul>parts-disjoint</ul> transcripts, and 2,664 have <ul>overlapping</ul> transcripts, and 53 have both <ul>parts-disjoint</ul> and <ul>overlapping</ul> transcripts. The percentage of <it>D. melanogaster </it>genes in each category is shown in Table <tblr tid="T3">3</tblr>. Most alternatively spliced genes contain at least one pair of overlapping transcripts. These data also have something to say about the ways in which research and management issues are intertwined with one another with respect to genome annotation, as some aspects of these data are clearly attributable to annotation practice. The lack of any <ul>sequence-disjoint</ul> transcripts in <it>D. melanogaster</it>, for example, is due to annotation practice; in fact, current FlyBase annotation practices forbid their creation, the reason being that any evidence for such transcripts is evidence for a new gene <abbrgrp><abbr bid="B28">28</abbr></abbrgrp>. This is not true for all genomic annotations. Annotations converted from the genomes division of GenBank to a SO-compliant form, were subjected to EM analysis, and inspection of the corresponding gene-centric annotations provided by Entrez Gene <abbrgrp><abbr bid="B29">29</abbr></abbrgrp> revealed examples of genes that fall into each of the seven categories. Some of these annotations are shown in Figure <figr fid="F3">3</figr>.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Percentage of each of the seven EM-based classes among the alternatively spliced genes in the <it>D. melanogaster </it>genome</p>
               </caption>
               <tblbdy cols="2">
                  <r>
                     <c ca="left">
                        <p>Class</p>
                     </c>
                     <c ca="center">
                        <p>
                           <it>D. melanogaster</it>
                        </p>
                     </c>
                  </r>
                  <r>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>N:0:0</p>
                     </c>
                     <c ca="center">
                        <p>0%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>N:N:0</p>
                     </c>
                     <c ca="center">
                        <p>0%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>N:0:N</p>
                     </c>
                     <c ca="center">
                        <p>0%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>N:N:N</p>
                     </c>
                     <c ca="center">
                        <p>0%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>0:N:0</p>
                     </c>
                     <c ca="center">
                        <p>7.70%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>0:N:N</p>
                     </c>
                     <c ca="center">
                        <p>1.83%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>0:0:N</p>
                     </c>
                     <c ca="center">
                        <p>90.47%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>The number of genes with one or more pairs of <ul>sequence-disjoint</ul> transcripts, no pairs of <ul>parts-disjoint</ul> transcripts, and no pairs of overlapping transcripts - denoted as N:0:0 - is given in the first row. Row 2 gives the number of genes having both <ul>sequence-disjoint</ul> and <ul>parts-disjoint</ul> transcripts, but no overlapping transcripts - these are N:N:0 genes. Rows 3 to 7 detail the counts for each of the remaining possible classes.</p>
               </tblfn>
            </tbl>
            <p>The frequencies of genes that fall into each of the seven classes shown in Table <tblr tid="T3">3</tblr> provides a concise summary of genome-wide trends in alternative splicing in the fly. This EM-based classification schema, when applied to many model organisms, from many original sources, makes very apparent the magnitude of the practical challenges that surround decentralized annotation, and the distribution and redistribution of annotations. Certainly, they highlight the need for data-management tools such as SO to assist the community in enforcing biological constraints and annotation standards. Only then will comparative genomic analyses show their full power.</p>
         </sec>
         <sec>
            <st>
               <p>Exons as alternative parts of transcripts</p>
            </st>
            <p>EM-operators can also be used to classify the exons of alternatively spliced genes. Exons shared between two transcripts comprise the <ul>binary product</ul> of the two transcripts; whereas those exons present in only one of the transcripts constitute their <ul>difference</ul> (see Table <tblr tid="T2">2</tblr> and Figure <figr fid="F2">2</figr> for more information). These basic facts suggest a very simple, three-part classification system. If an exon is the <ul>difference</ul> between all other transcripts, then it is only in one transcript; we term these UNIQUE exons. If an exon is the <ul>difference</ul> of some transcripts, and the <ul>binary product</ul> of others, it is in a fraction of transcripts; we term these SOMETIMES_FOUND exons. And, if an exon is the <ul>binary product</ul> of all combinations of transcripts, then it must be in all transcripts; we term such exons ALWAYS_FOUND exons. Classifying exons in this way allows us to look more closely at alternative splicing from the exon's perspective.</p>
            <p>As can be seen from Table <tblr tid="T4">4</tblr>, despite the low frequency of alternatively spliced genes, a large fraction of their exons are associated with alternatively spliced transcripts - almost 39%. A sizable proportion of SOMETIMES_FOUND and ALWAYS_FOUND exons are coding exons in some of the transcripts and entirely untranslated exons in others. In some cases, this is due to actual biology: some transcripts in <it>D. melanogaster </it>are known to produce more than one protein (see, for example <abbrgrp><abbr bid="B30">30</abbr></abbrgrp>). In other cases, this situation appears to be a result of best attempts on the part of annotators to interpret ambiguous supporting evidence; in yet others the supporting data sometimes unambiguously points to patterns of alternative splicing that would seem to produce transcripts destined for nonsense-mediated decay <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>. Whatever the underlying cause, these exons, like the N:0:0 class annotations, should be subjected to further investigation.</p>
            <tbl id="T4">
               <title>
                  <p>Table 4</p>
               </title>
               <caption>
                  <p>Summary of the types of exons present in each of the genomes and their functions</p>
               </caption>
               <tblbdy cols="5">
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>Exon part of gene with single transcript</p>
                     </c>
                     <c ca="center">
                        <p>Exon part of one transcript of alternatively spliced gene (UNIQUE)</p>
                     </c>
                     <c ca="center">
                        <p>Exon part of fraction of alternatively spliced transcripts (SOMETIMES_FOUND)</p>
                     </c>
                     <c ca="center">
                        <p>Exon part of all of the transcripts of alternatively spliced gene (ALWAYS_FOUND)</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="5">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Percentage of all exons</p>
                     </c>
                     <c ca="center">
                        <p>60.1%</p>
                     </c>
                     <c ca="center">
                        <p>16.1%</p>
                     </c>
                     <c ca="center">
                        <p>5.2%</p>
                     </c>
                     <c ca="center">
                        <p>18.6%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Coding</p>
                     </c>
                     <c ca="center">
                        <p>94.5%</p>
                     </c>
                     <c ca="center">
                        <p>68%</p>
                     </c>
                     <c ca="center">
                        <p>73%</p>
                     </c>
                     <c ca="center">
                        <p>93%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Non-coding</p>
                     </c>
                     <c ca="center">
                        <p>4.5%</p>
                     </c>
                     <c ca="center">
                        <p>32%</p>
                     </c>
                     <c ca="center">
                        <p>19%</p>
                     </c>
                     <c ca="center">
                        <p>3.5%</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>Coding/non-coding</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>-</p>
                     </c>
                     <c ca="center">
                        <p>8%</p>
                     </c>
                     <c ca="center">
                        <p>3.5%</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Exons of alternatively spliced genes were divided into three categories based on the <ul>binary product</ul> and <ul>difference</ul> operations. UNIQUE exons (column 2) occur in only a single transcript; SOMETIMES_FOUND exons (column 3) occur in some, but not all of a gene's alternatively spliced transcripts. ALWAYS_FOUND exons occur in every alternative transcript. The table rows show the breakdown of each exon class with respect to function, i.e., <b>coding exons </b>are those that consist at least partially of translated nucleotides, whereas <b>non-coding exons </b>consist entirely of UTR sequence. In some genes, an exon may be coding in one transcript and non-coding in another, depending on the annotated start and stop codons and the phase of the upstream intron; these exons are denoted as <b>coding/non-coding exons</b>. For reference purposes, the breakdown of exons in single-transcript genes is shown in column 1.</p>
               </tblfn>
            </tbl>
            <p>To investigate these conclusions in more detail, we further examined each exon with respect to its EM-based class and its coding and untranslated portions. These results are shown Figure <figr fid="F4">4</figr>, and naturally extend the analyses presented in Table <tblr tid="T4">4</tblr>. First, regardless of exon class, most entirely untranslated exons are 5-prime exons; the lower frequency of 3-prime untranslated exons is perhaps due to nonsense-mediated decay <abbrgrp><abbr bid="B31">31</abbr></abbrgrp>, as the presence of splice junctions in a processed transcript downstream of its stop codon are believed to target that transcript for degradation. A second point made clear by the data in Table <tblr tid="T4">4</tblr> is that alternatively spliced genes of <it>D. melanogaster </it>are highly enriched for 5-prime untranslated exons compared with single-transcript genes. Most of these exons belong to ALWAYS_FOUND; thus, there seems to be a strong tendency in <it>D. melanogaster </it>for alternative transcripts to begin with a unique 5' UTR region. This fact suggests that alternative transcription in the fly may, in many cases, be a consequence of alternative-promoter usage and perhaps tissue-specific transcription start sites. The high percentage of untranslated 5-prime UNIQUE exons in <it>D. melanogaster </it>may also be a consequence of the large numbers of 5' ESTs that have been sequenced in the fly <abbrgrp><abbr bid="B32">32</abbr></abbrgrp>.</p>
            <fig id="F4">
               <title>
                  <p>Figure 4</p>
               </title>
               <caption>
                  <p>A series of Venn diagrams showing the relationship between exon class and coding potential</p>
               </caption>
               <text>
                  <p>A series of Venn diagrams showing the relationship between exon class and coding potential. An exon may be fully protein coding, partially protein coding, or be fully UTR. An exon may be a <b>part_of </b>a single transcript gene (single-transcript genes), be a <b>part_of </b>either one (UNIQUE exons), all (ALWAYS_FOUND exons), or a fraction (SOMETIMES_FOUND exons) of transcripts in an alternatively transcribed gene.</p>
               </text>
               <graphic file="gb-2005-6-5-r44-4"/>
            </fig>
            <p>Figure <figr fid="F4">4</figr> also shows that most (> 95%) <it>D. melanogaster </it>ALWAYS_FOUND exons are coding. This makes sense, as it seems likely that one reason for an exon's inclusion in every one of a gene's alternative transcripts is that it encodes a portion of the protein essential for its function(s).</p>
            <p>As with our previous analyses of alternative transcripts, our analyses of alternatively transcribed exons also illustrate the ways in which basic biology and annotation-management issues intersect one another. The fact that most ALWAYS_FOUND exons are entirely coding, for example, may have something important to say about which parts of a protein are essential for its function(s). Whereas the over-abundance of un-translated UNIQUE exons probably has more to say about the resources available to, and the protocols used by, the annotation project than it does about biology. Such considerations make it clear that the evidence used to produce an annotation is an essential part of the annotation. In this regard SO has much to offer, as it provides a rational means by which to manage annotation evidence in the context of gene-parts and the relations between those parts.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Conclusion</p>
         </st>
         <p>We have sought to provide an introduction to the SO and justify why its use to unify genomic annotations is beneficial to the model organism community. We illustrate some of the ways in which SO can be used to analyze and manage annotations. Relationships are an essential component of SO, and understanding their role within the ontology is a basic prerequisite for using SO in an intelligent fashion. Much of this paper revolves around the <b>part_of </b>relationship because SO is largely a meronomy - a particular kind of ontology concerned with the relationships of parts to wholes. Extensional mereology (EM) is an area that is largely new to bioinformatics for which there are several excellent reference works available <abbrgrp><abbr bid="B26">26</abbr><abbr bid="B27">27</abbr><abbr bid="B33">33</abbr></abbrgrp>, and even a cursory examination of these texts will make it clear that EM has much to offer bioinformatics.</p>
         <p>Using all of the relationships in SO allows us to automatically draw logical conclusions about data that has been labelled with SO terms and thereby provide useful insights into the underlying annotations. We have shown how SO, together with the EM-based operations it enables, can be used to standardize, analyze, and manage genome annotations.</p>
         <p>Given any standardized set of genome annotations described with SO these annotations can then be rigorously characterized. For our pilot analyses, we focused on alternatively transcribed genes and their exons, and explored the potential of EM-operators to classify and characterize them. We believe that the results of these analyses support two principle conclusions. First, EM-based classification schemes are simple to implement, and second, they capture important trends in the data and provide a concise, natural, and meaningful overview of annotations in these genomes.</p>
         <p>One criticism that might be justifiably leveled against the SO- and EM-based analyses presented here is that they are too formal, and that simpler approaches could have accomplished the same ends. As our discussion of <b>part_of </b>relationships made clear, however, reasoning across diverse types of parts is a complicated process; <it>ad-hoc </it>approaches will not suffice where the data are complex. The more formal approach afforded by SO means that analyses can be easily be extended beyond the domain of transcripts and exons to include many other gene parts and relationships as well - including evidence. It seems clear that over the next few years both the number and complexity of annotations will increase, especially with regard to the diversity of their parts. Drawing valid conclusions from comparisons of these annotations will prove challenging. That SO has much to offer such analyses is indisputable.</p>
         <p>SO and SOFA provide the model organism community with a means to unify the semantics of sequence annotation. This facilitates communication within a group and between different model organism groups. Adopting SO terminology to type the features and properties of sequence will provide both the group and the community the advantages of a common vocabulary, to use for sharing and querying data and for automated reasoning over large amounts of sequence data.</p>
      </sec>
      <sec>
         <st>
            <p>Materials and methods</p>
         </st>
         <p>SO and SOFA have been built and are maintained using the ontology-editing tool OBO-Edit. The ontologies are available at <abbrgrp><abbr bid="B34">34</abbr></abbrgrp>.</p>
         <p>The FlyBase <it>D. melanogaster </it><abbrgrp><abbr bid="B35">35</abbr></abbrgrp> data was derived from the GadFly <abbrgrp><abbr bid="B36">36</abbr></abbrgrp> relational database and converted to Chaos-XML using the Bio-chaos tools. The features were annotated to the deepest concept in the ontology possible, given the available information. For example, the degree of information in annotations was sufficiently deep to describe the transcript features with the type of RNA such as <it>mRNA</it>, or <it>tRNA</it>. It was therefore possible to restrict the analysis to given types of transcript. CGL tools were used to validate each of the annotations, iterate through the genes and query the features. EM-operators were applied to the part features of genes.</p>
         <p>Other organism data was derived from the <it>genomes </it>section of GenBank <abbrgrp><abbr bid="B37">37</abbr></abbrgrp>. GenBank flat files were converted to SO-compliant Chaos-XML using the script cx-genbank2chaos.pl (available from <abbrgrp><abbr bid="B19">19</abbr></abbrgrp>) and BioPerl <abbrgrp><abbr bid="B23">23</abbr></abbrgrp>. The BioPerl GenBank parser, Bio::SeqIO::genbank was used to convert GenBank flat files to Bioperl SeqFeature objects. Feature_relationships between these objects were inferred from location information using the Bioperl Bio::SeqFeature::Tools::Unflattener code. GenBank Feature Table types were converted to SO terms using the Bio::SeqFeature::Tools::TypeMapper class, which contains a hardcoded mapping for the subset of the GenBank Feature Table which is currently used in the <it>genomes </it>section of GenBank. The same Perl class was used to type the feature_relationships according to SO relationship types. The EM analysis was performed over the Chaos-XML annotations using the CGL suite of modules to iterate over the parts of each gene.</p>
      </sec>
   </bdy>
   <bm>
      <refgrp>
         <bibl id="B1">
            <title>
               <p>Genbank</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/Genbank/index.html</url>
         </bibl>
         <bibl id="B2">
            <title>
               <p>The Institute for Genome Research</p>
            </title>
            <url>http://www.tigr.org</url>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Joint Genome Institute</p>
            </title>
            <url>http://jgi.doe.gov</url>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Annotation of the <it>Drosophila melanogaster </it>euchromatic genome: a systematic review.</p>
            </title>
            <aug>
               <au>
                  <snm>Misra</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Crosby</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Mungall</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Matthews</snm>
                  <fnm>BB</fnm>
               </au>
               <au>
                  <snm>Campbell</snm>
                  <fnm>KS</fnm>
               </au>
               <au>
                  <snm>Hradecky</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Huang</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>Kamiker</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Millburn</snm>
                  <fnm>GH</fnm>
               </au>
               <au>
                  <snm>Prochnik</snm>
                  <fnm>SE</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>research0083.1</fpage>
            <lpage>0083.22</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/gb-2002-3-12-research0083</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>WormBase: network access to the genome and biology of <it>Caenorhabditis elegans</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Stein</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Sternberg</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Durbin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Thierry-Mieg</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Spieth</snm>
                  <fnm>J</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>82</fpage>
            <lpage>86</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29781</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125056</pubid>
                  <pubid idtype="doi">10.1093/nar/29.1.82</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>The <it>Arabidopsis </it>Information Resource (TAIR): a model organism database providing a centralized, curated gateway to <it>Arabidopsis </it>biology, research materials and community.</p>
            </title>
            <aug>
               <au>
                  <snm>Rhee</snm>
                  <fnm>SY</fnm>
               </au>
               <au>
                  <snm>Beavis</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Berardini</snm>
                  <fnm>TZ</fnm>
               </au>
               <au>
                  <snm>Chen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Dixon</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Doyle</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Garcia-Hernandez</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Huala</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Montoya</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2003</pubdate>
            <volume>31</volume>
            <fpage>224</fpage>
            <lpage>228</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">165523</pubid>
                  <pubid idtype="pmpid" link="fulltext">12519987</pubid>
                  <pubid idtype="doi">10.1093/nar/gkg076</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p><it>Saccharomyces </it>genome database: underlying principles and organization.</p>
            </title>
            <aug>
               <au>
                  <snm>Dwight</snm>
                  <fnm>SS</fnm>
               </au>
               <au>
                  <snm>Balakrishnan</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Christie</snm>
                  <fnm>KR</fnm>
               </au>
               <au>
                  <snm>Costanzo</snm>
                  <fnm>MC</fnm>
               </au>
               <au>
                  <snm>Dolinski</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Engel</snm>
                  <fnm>SR</fnm>
               </au>
               <au>
                  <snm>Feierbach</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Fisk</snm>
                  <fnm>DG</fnm>
               </au>
               <au>
                  <snm>Hirschman</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Hong</snm>
                  <fnm>EL</fnm>
               </au>
               <etal/>
            </aug>
            <source>Brief Bioinform</source>
            <pubdate>2004</pubdate>
            <volume>5</volume>
            <fpage>9</fpage>
            <lpage>22</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1186/1471-2105-5-9</pubid>
                  <pubid idtype="pmpid" link="fulltext">15153302</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Creating the gene ontology resource: design and implementation.</p>
            </title>
            <aug>
               <au>
                  <cnm>Gene Ontology Consortium</cnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2001</pubdate>
            <volume>11</volume>
            <fpage>1425</fpage>
            <lpage>1433</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">311077</pubid>
                  <pubid idtype="pmpid" link="fulltext">11483584</pubid>
                  <pubid idtype="doi">10.1101/gr.180801</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Open Biomedical Ontologies</p>
            </title>
            <url>http://obo.sourceforge.net</url>
         </bibl>
         <bibl id="B10">
            <title>
               <p>The Sequence Ontology</p>
            </title>
            <url>http://song.sourceforge.net</url>
         </bibl>
         <bibl id="B11">
            <title>
               <p>OBO-Edit</p>
            </title>
            <url>http://sourceforge.net/projects/geneontology</url>
         </bibl>
         <bibl id="B12">
            <title>
               <p>DDBJ/EMBL/GenBank Feature Table documentation</p>
            </title>
            <url>http://www.ebi.ac.uk/embl/Documentation/FT_definitions/feature_table.html</url>
         </bibl>
         <bibl id="B13">
            <title>
               <p>The EMBL Nucleotide Sequence Database.</p>
            </title>
            <aug>
               <au>
                  <snm>Kulikova</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Aldebert</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Althorpe</snm>
                  <fnm>A</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D27</fpage>
            <lpage>D30</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">14681351</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh120</pubid>
                  <pubid idtype="pmcid">308854</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B14">
            <title>
               <p>DDBJ in the stream of various biological data.</p>
            </title>
            <aug>
               <au>
                  <snm>Miyazaki</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Sugawara</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Ikeo</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Gojobori</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Tateno</snm>
                  <fnm>Y</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2004</pubdate>
            <volume>32</volume>
            <fpage>D31</fpage>
            <lpage>D34</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmpid" link="fulltext">14681352</pubid>
                  <pubid idtype="doi">10.1093/nar/gkh127</pubid>
                  <pubid idtype="pmcid">308861</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <aug>
               <au>
                  <snm>Alberts</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Johnson</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lewis</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Raff</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Roberts</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Walter</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Molecular Biology of the Cell</source>
            <publisher>New York: Garland</publisher>
            <edition>4</edition>
            <pubdate>2002</pubdate>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Generic Feature Format</p>
            </title>
            <url>http://song.sourceforge.net/gff3.shtml</url>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Chado schema</p>
            </title>
            <url>http://www.gmod.org/schema</url>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Generic Model Organism Database</p>
            </title>
            <url>http://www.gmod.org</url>
         </bibl>
         <bibl id="B19">
            <title>
               <p>Chaos-XML</p>
            </title>
            <url>http://www.fruitfly.org/chaos-xml</url>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Comparative Genomics Library</p>
            </title>
            <url>http://www.yandell-lab.org</url>
         </bibl>
         <bibl id="B21">
            <title>
               <p>A formal definition of binary topological relationships.</p>
            </title>
            <aug>
               <au>
                  <snm>Egenhofer</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Lecture Notes Comp Sci</source>
            <pubdate>1989</pubdate>
            <volume>367</volume>
            <fpage>457</fpage>
            <lpage>472</lpage>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Basic local alignment search tool.</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1990.9999</pubid>
                  <pubid idtype="pmpid" link="fulltext">2231712</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>The Bioperl toolkit: Perl modules for the life sciences.</p>
            </title>
            <aug>
               <au>
                  <snm>Stajich</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Block</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Boulez</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
               <au>
                  <snm>Chervitz</snm>
                  <fnm>SA</fnm>
               </au>
               <au>
                  <snm>Dagdigian</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Fuellen</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Gilbert</snm>
                  <fnm>JG</fnm>
               </au>
               <au>
                  <snm>Korf</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Lapp</snm>
                  <fnm>H</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Res</source>
            <pubdate>2002</pubdate>
            <volume>12</volume>
            <fpage>1611</fpage>
            <lpage>1618</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">187536</pubid>
                  <pubid idtype="pmpid" link="fulltext">12368254</pubid>
                  <pubid idtype="doi">10.1101/gr.361602</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>Relations in biological ontologies.</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Ceusters</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>K&#246;hler</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Kumar</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Lomax</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Mungall</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Neuhaus</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Rector</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Rosse</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2005</pubdate>
            <inpress/>
         </bibl>
         <bibl id="B25">
            <title>
               <p>A taxonomy of part-whole relations.</p>
            </title>
            <aug>
               <au>
                  <snm>Winston</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Chaffin</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Herrmann</snm>
                  <fnm/>
               </au>
            </aug>
            <source>Cog Sci</source>
            <pubdate>1987</pubdate>
            <volume>11</volume>
            <fpage>417</fpage>
            <lpage>444</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1016/S0364-0213(87)80015-0</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <aug>
               <au>
                  <snm>Simons</snm>
                  <fnm>P</fnm>
               </au>
            </aug>
            <source>Parts - A Study in Ontology</source>
            <publisher>Oxford: Clarendon Press</publisher>
            <pubdate>1987</pubdate>
         </bibl>
         <bibl id="B27">
            <aug>
               <au>
                  <snm>Husserl</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Logical Investigations</source>
            <publisher>London: Routledge &amp; Keagan Paul</publisher>
            <pubdate>1970</pubdate>
            <volume>II</volume>
         </bibl>
         <bibl id="B28">
            <title>
               <p>Flybase Re-annotation guideline</p>
            </title>
            <url>http://www.fruitfly.org/annot/reannot-guidelines.html</url>
         </bibl>
         <bibl id="B29">
            <title>
               <p>Entrez Gene</p>
            </title>
            <url>http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=gene</url>
         </bibl>
         <bibl id="B30">
            <title>
               <p>The <it>Drosophila melanogaster </it>tropomyosin II gene produces multiple proteins by the use of alternate tissue specific promoters and alternate splicing.</p>
            </title>
            <aug>
               <au>
                  <snm>Hanke</snm>
                  <fnm>PD</fnm>
               </au>
               <au>
                  <snm>Storti</snm>
                  <fnm>RV</fnm>
               </au>
            </aug>
            <source>Mol Cell Biol</source>
            <pubdate>1988</pubdate>
            <volume>8</volume>
            <fpage>3591</fpage>
            <lpage>3602</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">365414</pubid>
                  <pubid idtype="pmpid" link="fulltext">2851721</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Evidence for the widespread coupling of alternative splicing and nonsense-mediated mRNA decay in humans.</p>
            </title>
            <aug>
               <au>
                  <snm>Lewis</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>RE</fnm>
               </au>
               <au>
                  <snm>Brenner</snm>
                  <fnm>SE</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci USA</source>
            <pubdate>2003</pubdate>
            <volume>100</volume>
            <fpage>189</fpage>
            <lpage>192</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">140922</pubid>
                  <pubid idtype="pmpid" link="fulltext">12502788</pubid>
                  <pubid idtype="doi">10.1073/pnas.0136770100</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>The <it>Drosophila melanogaster </it>genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Celniker</snm>
                  <fnm>CE</fnm>
               </au>
               <au>
                  <snm>Rubin</snm>
                  <fnm>GE</fnm>
               </au>
            </aug>
            <source>Annu Rev Genomics Hum Genet</source>
            <pubdate>2003</pubdate>
            <volume>4</volume>
            <fpage>89</fpage>
            <lpage>117</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1146/annurev.genom.4.070802.110323</pubid>
                  <pubid idtype="pmpid" link="fulltext">14527298</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <aug>
               <au>
                  <snm>Cruse</snm>
                  <fnm>DA</fnm>
               </au>
            </aug>
            <source>Lexical Semantics</source>
            <publisher>Cambridge, UK: Cambridge University Press</publisher>
            <pubdate>1986</pubdate>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Sequence Ontology</p>
            </title>
            <url>http://song.sourceforge.net</url>
         </bibl>
         <bibl id="B35">
            <title>
               <p>FlyBase release 3.2</p>
            </title>
            <url>http://www.fruitfly.org/annot/release3.html</url>
         </bibl>
         <bibl id="B36">
            <title>
               <p>An integrated computational pipeline and database to support whole-genome sequence annotation.</p>
            </title>
            <aug>
               <au>
                  <snm>Mungall</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Misra</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Berman</snm>
                  <fnm>BP</fnm>
               </au>
               <au>
                  <snm>Carlson</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Frise</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Harris</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Marshall</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Shu</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kaminker</snm>
                  <fnm>JS</fnm>
               </au>
               <au>
                  <snm>Prochnik</snm>
                  <fnm>SE</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2002</pubdate>
            <volume>3</volume>
            <fpage>research0081.1</fpage>
            <lpage>0081.11</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1186/gb-2002-3-12-research0081</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B37">
            <title>
               <p>Genomes Division of GenBank</p>
            </title>
            <url>http://ftp.ncbi.nlm.nih.gov/genomes</url>
         </bibl>
      </refgrp>
   </bm>
</art>
