Philip E. Bourne, Helen M. Berman, Brian McMahon, Keith D.Watenpaugh, John Westbrook and Paula M.D.Fitzgerald
Methods in Enzymology. 1997 277, 571-590.
The Protein Data Bank (PDB) format provides a standard representation for macromolecular structure data derived from X-ray diffraction and NMR studies. This representation has served the community well since its inception in the 1970's (Bernstein et al.1) and a large amount of software that uses this representation has been written. However, it is widely recognized that the current PDB format cannot express adequately the large amount of data (content) associated with a single macromolecular structure and the experiment from which it was derived in a way (context) that is consistent and permits direct comparison with other structure entries. Structure comparison, for such purposes as better understanding biological function, assisting in the solution of new structures, drug design, and structure prediction, becomes increasingly valuable as the number of macromolecular structures continues to grow at a near exponential rate. It could be argued that the description of the required content of a structure submission could be met by additional PDB record types. However, this format does not permit the maintenance of the automated level of consistency, accuracy, and reproducibility required for such a large body of data.
A variety of approaches for improved scientific data representation is being explored (IEEE2). The approach described here, which has been developed under the auspices of the International Union of Crystallography (IUCr), is to extend the Crystallographic Information File (CIF) data representation used for describing small molecule structures and associated diffraction experiments. This extension is referred to as the macromolecular Crystallographic Information File (mmCIF) and is the subject of this paper. The paper briefly covers the history of mmCIF, similarities to and differences from the PDB format, contents of the mmCIF dictionary, and how to represent structures using mmCIF. The mmCIF home page (mmCIF3) contains a historic description of the development of the dictionary, current versions of the dictionary in text and HTML formats, software tools, archives of the mmCIF discussion list, and a detailed on-line tutorial (Bourne4).
CIF was developed to describe small molecule organic structures and the crystallographic experiment by the International Union of Crystallography (IUCr) Working Party on Crystallographic Information at the behest of the IUCr Commission on Crystallographic Data and the IUCr Commission on Journals. The result of this effort was a core dictionary of data items sufficient for archiving the small molecule crystallographic experiment and its results (Hall et al.5, IUCr6). This core dictionary was adopted by the IUCr at its 1990 Congress in Bordeaux. The format of the small molecule CIF dictionary and the data files based upon that dictionary conform to a restricted version of the Self Defining Text Archive and Retrieval (STAR) representation developed by Hall (Cook and Hall7, Hall and Spadaccini8). STAR permits a data organization that may be understood by analogy with a spoken language (Fig. 1).
Figure 1 Components of the STAR/CIF data representation and their analogy to a natural language.
STAR defines a set of encoding rules similar to saying the English language is comprised of 26 letters. A Dictionary Definition Language (DDL) is defined which uses those rules and which provides a framework from which to define a dictionary of the terms needed by the discipline. Think of the DDL as a computer readable way of declaring that words are made up of arbitrary groups of letters and that words are organized into sentences and paragraphs. The DDL provides a convention for naming and defining data items within the dictionary, declaring specific attributes of those data items, for example, a range of values and the data type, and for declaring relationships between data items. In other words, the DDL defines the format of the dictionary and any new words that are added must conform to that format. Just as words are constantly being added to a language, data items will be added to the dictionaries as the discipline evolves. The STAR encoding rules and the DDL are being used to develop a variety of dictionaries and reference files, for example, the powder diffraction dictionary, the modulated structures dictionary, a file of ideal geometry for amino acids, and an NMR dictionary. This extensibility is attractive since the same basic reading and browsing software (context-based tools) can be used irrespective of the data content. Data files (this paper is an example in our language analogy) are composed of data items found in the dictionaries.
In 1990, the IUCr formed a working group to expand the core dictionary to include data items relevant to the macromolecular crystallographic experiment. Version 1.0 of the mmCIF dictionary (Fitzgerald et al.9, mmCIF3), which encompasses many data items from the current core dictionary (IUCr10), is in the final stage of review by COMCIFs, the IUCr appointed committee overseeing CIF developments. This dictionary has been written using DDL v2.1.1 (Westbrook and Hall11), which is significantly enhanced, yet upwardly compatible with DDL v1.4 (IUCr12) currently used for the small molecule dictionary.
In developing version 1.0 of the mmCIF dictionary we made the following
Based on the above, a mmCIF dictionary with approximately 1500 data items (including those data items taken from the small molecule dictionary) was developed. It is not expected that all relevant data items will be present in each mmCIF data file. What data items are mandatory to describe the structure and experiment adequately needs to be decided by community consensus.
The format of a mmCIF containing structural data can best be introduced through analogy with the existing PDB format. A PDB file consists of a series of records each identified by a keyword (e.g., HEADER, COMPND) of up to 6 characters. The format and content of fields within a record are dependent on the keyword. A mmCIF, on the other hand, always consists of a series of name-value pairs (a data item) defined by STAR, where the data name is preceded by a leading underscore (_) to distinguish it from the data value. Thus, every field in a PDB record is represented in mmCIF by a specific data name. The PDB HEADER record,
HEADER PLANT SEED PROTEIN 11-OCT-91 1CBN
_struct.title 'PLANT SEED PROTEIN'
_struct_keywords.text 'plant seed protein'
The name-value pairing represents a major departure from the PDB file format and has the advantage of providing an explicit reference to each item of data within the data file, rather than having the interpretation left to the software reading the file. The name matches an entry in the mmCIF dictionary where characteristics of that data item are explicitly defined. Where multiple values for the same data item exist, the name of the data item or items concerned is declared in a header and the associated values follow in strict rotation. This is a STAR rule referred to as a loop_ construct. This loop_ construct is illustrated in the representation of atomic coordinates.
ATOM N N VAL A 11 . 25.369 30.691 11.795 1.00 17.93 . 11 1
ATOM C CA VAL A 11 . 25.970 31.965 12.332 1.00 17.75 . 11 2
ATOM C C VAL A 11 . 25.569 32.010 13.881 1.00 17.83 . 11 3
# [data omitted]
Note that the name construct is of the form _category.extension. The category explicitly defines a natural grouping of data items such that all data items of a single category are contained within a single loop_. There is no restriction on the length of name, beyond the record length limit of 80 characters mentioned below, and while there is no formal syntax within name beyond the category and extension separated by a period, by convention the category and extension are represented as an informal hierarchy of parts, with each part separated by an underscore (_). The components of _atom_site.label are examples.
Questions that arise concerning the separation of data names and data values are solved with some additional syntax. For example, what if the data value contains white space, an underscore, or runs over several lines? Similarly, what if a value in a loop_ is undefined or has no meaning in the context in which it is defined? The following syntax rules, which are a more restricted set of rules than permitted by STAR, complete the mmCIF description.
Comments are preceded by a hash (#) and terminated by a new line.
Data values on a single line may be delimited by pairs of single (') or double (") quotes.
Data values that extend beyond a single line are enclosed within semicolons (;) as the first character of the line that begins the text block and the first character of the line following the last line of text.
Data values which are unknown are represented by a question mark (?).
Data values which are undefined are represented by a period (.).
The length of a record in mmCIF is restricted to 80 characters.
Only printable ASCII characters are permitted.
Only a single level of loop_ is permissible.
To complete the introductory picture of the appearance of a mmCIF data file consider the notion of scope. A PDB file has essentially one form of scope - the complete file. Thus, a single structure or an ensemble of structures is represented by a single file with each member of the ensemble separated by a PDB MODEL keyword record. There is no computer readable mechanism for associating components of say the REMARK records with a particular member of the ensemble. The mmCIF representation deals with this issue by using the STAR data block concept. Data blocks begin with data_ and have a scope that extends until the next data_ or an end-of-file is reached. A name may appear only once in a data block, but data items may appear in any order. A consequence of these STAR rules is that the combination of data block name and data name is always unique.
Table I summarizes the category groups, their associated individual categories and their definitions as found in the mmCIF dictionary version 0.8.02 dated March 18, 1996. This comprehensive hierarchy of categories follows closely the progress of the experiment and the subsequent structure description.
The categories describing the crystallographic experiment are relatively self explanatory and will not be detailed here. We will, however, outline the data model used to describe the resulting structure and its description.
The structural data model can most simply be described as containing three interrelated groups of categories: ATOM_SITE categories that give coordinates and related information of the structure; ENTITY categories, which describe the chemistry of the components of the structure, and STRUCT categories, which analyze and describe the structure.
The data items in the ATOM_SITE category record details about the atom sites including the coordinates, the thermal displacement parameters, the errors in the parameters and include a specification of the component of the asymmetric unit to which an atom belongs.
The ENTITY category categorizes the unique chemical components of the asymmetric unit as to whether they are polymer, non-polymer or water. The characteristics of a polymer are described by the ENTITY_POLY category and the sequence of the chemical components comprising the polymer by the ENTITY_POLY_SEQ category. The CHEM_COMP categories describe the standard geometries of the monomer units such as the amino acids and nucleotides as well as that of the ligands and solvent groups.
The STRUCT_BIOL category allows the person to describe
the biologically relevant features of a structure and its component parts.
The STRUCT_BIOL_GEN category provides the information about how
to generate the biological unit from the components of the asymmetric unit
which are in turn specified by the STRUCT_ASYM category. Various
features of the structure such as intermolecular hydrogen bonds, special
sites and secondary structure are specified in STRUCT_CONN, STRUCT_SITE
and STRUCT_CONF, respectively. Figure 2 illustrates the interrelationships
among these categories.
Figure 2 a) The relationships between categories which describe
biologically relevant structure. b) The relationships between categories
describing polymer structure, the atomic coordinates, and those categories
which describe structural features such as hydrogen bonding and secondary
These and other major descriptive features of the mmCIF dictionary are best explored by example. A browsable dictionary can be found at the mmCIF WWW site (mmCIF3) as well as some complete examples. Complete examples for all nucleic acids can be found at the Nucleic Acid Database WWW site (NDB13). Partial mmCIFs for every structure in the PDB are available at two WWW sites (PDB14, SDSC15) having been generated with the program pdb2cif (Bernstein et al.16).
Starting simply, consider the protein crambin which is a single polypeptide chain of 48 residues and in the low temperature form at 0.83 Å resolution (Teeter et al.17; PDB code 1CBN) has nearly all the protein bound solvent resolved as well as an ethanol molecule co-crystallized. The protein shows recognizable sequence micro heterogeneity at positions 22 (Pro/Ser) and 25 (Leu/Ile) and 24% of residues show discrete disorder. While these features are described using data items in the mmCIF dictionary, they are not detailed here for the sake of simplicity.
Since the biological function of this molecule is unknown, no biologically relevant structural components are justified. A single identifier (crambin_1) is used to identify the unknown biological function of this molecule.
; The function of this protein is unknown and therefore the
biological unit is assumed to be the single polypeptide
chain without co-crystallization factors i.e. ethanol.
The single biological descriptor, crambin_1, is generated from the single polypeptide chain found in the asymmetric unit without any symmetry transformations applied. The polypeptide chain is designated chain_a.
The chemical components of the asymmetric unit are three entities: a single polypeptide chain characterized as a polymer, ethanol characterized as non-polymer, and water. Whether the source of the entity is a natural product, or it has been synthesized is also indicated.
A polymer 4716 'NATURAL'
ethanol non-polymer 52 'SYNTHETIC'
H20 water 18 .
It is then possible to expand upon this basic description of each entity using the entity.id as a reference. So for example the common and systematic names are specified as,
_entity_name_sys.name 'Crambe Abyssinica'
Similarly, the natural and synthetic description can be given in more detail, so for the natural product we have,
_entity_src_nat.common_name 'Abyssinian Cabbage'
Using the entities as building blocks the contents of the asymmetric unit are specified. Crambin is straightforward since each entity appears only once in the asymmetric unit.
chain_a A 'Single polypeptide chain'
ethanol ethanol 'Cocrystallized ethanol molecule'
H20 H20 .
Entities classified as polymer, in this instance only that entity identified as A, is further described. First, the overall features of the polypeptide chain.
_entity_poly.type_details 'Microheterogeneity at 22 and 25'
and then the component parts,
A 1 THR A 2 THR
# [data omitted]
A 22 PRO A 23 GLU
A 24 ALA A 25 LEU
# [data omitted]
A 47 ALA A 48 ASN
The entity may also exist in other databases and these references may be cited and described. For the entity designated A, which is defined in Genbank but without sequence microheterogeneity we have,
1 A crambin_1 'Genbank' '493916' 'entire' 'no' .
2 A crambin_1 'PDB' '1CBN' 'entire' 'no' .
Once each polymer entity is defined, the details of the secondary structure are defined using the STRUCT_CONF category.
H1 HELX_RH_AL_P ILE chain_a 7 PRO chain_a 19 'HELX-RH3T 17-19'
H2 HELX_RH_AL_P GLU chain_a 23 THR chain_a 30 'Alpha-N start'
S1 STRN_P CYS chain_a 32 ILE chain_a 35 .
S2 STRN_P THR chain_a 1 CYS chain_a 4 .
S3 STRN_P ASN chain_a 46 ASN chain_a 46 .
S4 STRN_P THR chain_a 39 PRO chain_a 41 .
T1 TURN-TY1_P ARG chain_a 17 GLY chain_a 20 .
T2 TURN-TY1_P PRO chain_a 41 TYR chain_a 44 .
These assignments are further enumerated over those made in a PDB file for the record types HELIX, TURN and SHEET. Moreover, the STRUCT_CONF_TYPE category (Table I) specifies the method of assignment which could, for example, be deduced by the crystallographer from the electron density maps or defined algorithmically.
HELX_RH_AL_P 'author judgement' .
STRN_P 'author judgement' .
TURN_TY1_P 'author judgement' .
# HELX_RH_P 'Kabsch and Sander' 'Biopolymers (1983) 22:2577'
The commented entry at the end is a hypothetical example for a calculated assignment. Data items also exist (Table I) for the description of beta sheets, but are not shown in this introductory example.
Interactions between various portions of the structure are described by the STRUCT_CONN and associated STRUCT_CONN_TYPE category.
SS1 disulf CYS chain_a 3 S 1_555 CYS chain_a 40 S 1_555 .
SS2 disulf CYS chain_a 4 S 1_555 CYS chain_a 32 S 1_555 .
# [data omitted]
HB1 hydrog SER chain_a 6 OG positive 1_555 .
LEU chain_a 8 O negative 1_556 .
HB2 hydrog ARG chain_a 17 N positive 1_555 .
ASP chain_a 43 O negative 1_554 .
# [data omitted]
These intermolecular interactions are partially specified on PDB CONNECT records. However mmCIF provides an additional level of detail such that the criteria used to define an interaction may be given using the STRUCT_CONN_TYPE category. Here is a hypothetical example used to describe a salt bridge and a hydrogen bond.
saltbr 'negative to positive distance > 2.5 \%A and < 3.2 \%A ' .
hydrog 'N to O distance > 2.5 \%A, < 3.2 \%A, NOC angle < 120°' .
Consider a mmCIF representation for a more complex structure. The gene regulatory protein 434 CRO complexed with a 20 base pair DNA segment containing operator (Mondragon and Harrison18; PDB code 3CRO).
; The complex consists of 2 protein domains bound to a
20 base pair DNA segment.
; Each of the 2 protein domains is a single homologous
polypeptide chain of 71 residues designated L and R.
; The two strands (A and B) are complementary given a one
The protein/DNA complex, the protein, and the DNA are considered as three separate biological components each generated from the contents of the asymmetric unit. No crystallographic symmetry need be applied to generate the biologically relevant components.
complex L 1_555
complex R 1_555
complex A 1_555
complex B 1_555
protein L 1_555
protein R 1_555
DNA A 1_555
DNA B 1_555
Since each protein domain is chemically identical they constitute a single entity which has been designated dimer. The complementary DNA strands are not chemically identical and therefore constitute two separate entities:
L dimer '71 residue polypeptide chain'
R dimer '71 residue polypeptide chain'
A DNA_A '20 base strand'
B DNA_B '20 base strand'
H2O water 'solvent'
Features of the CRO 434 secondary structure and intermolecular contacts can be described in the same way in which crambin was represented and are not repeated.
In preparing these examples of representing macromolecular structure using mmCIF it was necessary to return to the original papers since not all the relevant information could be retrieved from the PDB entry. This is evidence that mmCIF provides additional information which also has the advantage of being in a computer readable form. The consequence is that it places additional emphasis on the person preparing the mmCIF. It is anticipated that full use of the expressive power of mmCIF will only be made when existing structure solution and refinement programs are modified to maintain mmCIF data items and software tools exist to help prepare and use a mmCIF effectively. A variety of software tools have been developed for mmCIF (Bernstein, et al. 16; Westbrook, et al. 19). A description of a variety of other efforts can be found elsewhere (Bourne20). Code and documentation is available at the mmCIF WWW site (mmCIF3). A long term goal might be to maintain all aspects of the structure determination in an electronic laboratory notebook that uses mmCIF as its underlying data representation. The notebook would have a "journal" button that would be used at the appropriate time.
The development of the mmCIF dictionary has been a community effort.
2. IEEE Metadata. http://www.llnl.gov/liv_comp/metadata/ (1996).
3. mmCIF. http://ndbserver.rutgers.edu/NDB/mmcif/ (1996).
4. P.E. Bourne. http://www.sdsc.edu/pb/cif/overview.html (1996).
13. NDB. http://ndbserver.rutgers.edu/ (1996).
Table 1 The mmCIF category groups and associated categories taken from the mmCIF dictionary
|CATEGORY GROUPS AND MEMBERS
|All category groups
|Details of each atomic position
|Anisotropic thermal displacement
|Details pertaining to all atom sites
|Details pertaining to alternative atoms sites as found in disorder etc.
|Details pertaining to alternative atoms sites as found in ensembles e.g. from NMR and modeling experiments
|Generation of ensembles from multiple conformations
|Comments concerning one or more atom sites
|Properties of an atom at a particular atom site
|Detail on the creation and updating of the mmCIF
|Author(s) of the mmCIF including address information
|Author(s) to be contacted
|Unit cell parameters
|How the cell parameters were measured
|Details of the reflections used to determine the unit cell parameters
|Details of the chemical components
|Bond angles in a chemical component
|Atoms defining a chemical component
|Characteristics of bonds in a chemical component
|Details of the chiral centers in a chemical component
|Atoms comprising a chiral center in a chemical component
|Linkages between chemical groups
|Planes found in a chemical component
|Atoms comprising a plane in a chemical component
|Details of the torsion angles in a chemical component
|Target values for the torsion angles in a chemical component
|Details of the linkages between chemical components
|Details of the angles in the chemical component linkage
|Details of the bonds in the chemical component linkage
|Chiral centers in a link between two chemical components
|Atoms bonded to a chiral atom in a linkage between two chemical components
|Planes in a linkage between two chemical components
|Atoms in the plane forming a linkage between two chemical components
|Torsion angles in a linkage between two chemical components
|Target values for torsion angles enumerated in a linkage between two chemical components
|Composition and chemical properties
|Atom position for 2-D chemical diagrams
|Bond specifications for 2-D chemical diagrams
|Literature cited in reference to the data block
|Author(s) of the citations
|Editor(s) of citations where applicable
|Computer programs used in the structure analysis
|More detailed description of the software used in the structure analysis
|Superseded by DATABASE_2
|Codes assigned to mmCIFs by maintainers of recognized databases
|CAVEAT records originally found in the PDB version of the mmCIF data file
|MATRIX records originally found in the PDB version of the mmCIF data file
|REMARK records originally found in the PDB version of the mmCIF data file
|Taken from the PDB REVDAT records
|Taken from the PDB REVDAT records
|TVECT records originally found in the PDB version of the mmCIF data file
|Details of diffraction data and the diffraction experiment
|Diffraction attenuator scales
|Details on how the diffraction data were measured
|Orientation matrices used when measuring data
|Reflections that define the orientation matrix
|Details on the radiation and detector used to collect data
|Unprocessed reflection data
|Details pertaining to all reflection data
|Details of reflections used in scaling
|Details of the standard reflections used during data collection
|Details pertaining to all standard reflections
|Details pertaining to each unique chemical component of the structure
|Keywords describing each entity
|Details of the links between entities
|Common name for the entity
|Systematic name for the entity
|Characteristics of a polymer
|Sequence of monomers in a polymer
|Source of the entity
|Details of the natural source of the entity
|Identifier for the data block
|Experimental details relating to the physical properties of the material, particularly absorption
|Physical properties of the crystal
|Details pertaining to the crystal faces
|Conditions and methods used to grow the crystals
|Components of the solution from which the crystals were grown
|Derived geometry information
|Derived bond angles
|Derived intermolecular contacts
|Derived torsion angles
|Used by journals and not the mmCIF preparer
|General phasing information
|Phase averaging of multiple observations
|Phasing information from an isomorphous model
|Phasing via multiwavelength anomolous dispersion (MAD)
|Details of a cluster of MAD experiments
|Overall features of the MAD experiment
|Ratios between pairs of MAD datasets
|Details of individual MAD datasets
|Phasing via single and multiple isomorphous replacement
|Details of individual derivatives used in MIR
|Details of calculated structure factors
|As above but for shells of resolution
|Details of heavy atom sites
|Details of each shell used in MIR
|Details of data sets used in phasing
|Values of structure factors used in phasing
|Used when submitting a publication as a mmCIF
|Authors of the publication
|To include special data names in the processing of the manuscript
|Details of the structure refinement
|Details pertaining to the refinement of isotropic B values
|History of the refinement
|Details pertaining to the least squares restraints used in refinement
|Results of refinement broken down by resolution
|Details pertaining to the refinement of occupancy factors
|Details pertaining to the reflections used to derive the atom sites
|Details pertaining to all reflections
|Details pertaining to scaling factors used with respect to the structure factors
|As REFLNS, but by shells of resolution
|Details pertaining to a description of the structure
|Details pertaining to structure components within the asymmetric unit
|Details pertaining to components of the structure that have biological significance
|Details pertaining to generating biological components
|Keywords for describing biological components
|Description of views of the structure with biological significance
|Conformations of the backbone
|Details of each backbone conformation
|Details pertaining to intermolecular contacts
|Details of each type of intermolecular contact
|Description of the chemical structure
|Calculation summaries at the monomer level
|Calculation summaries specific to nucleic acid monomers
|Calculation summaries specific to protein monomers
|Calculation summaries specific to cis peptides
|Details of domains within an ensemble of domains
|Beginning and end points within polypeptide chains forming a specific domain
|Description of ensembles
|Description of domains related by non-crystallographic symmetry
|Operations required to superimpose individual members of an ensemble
|External database references to biological units within the structure
|Describes the alignment of the external database sequence with that found in the structure
|Describes differences in the external database sequence with that found in the structure
|Beta sheet description
|Hydrogen bond description in beta sheets
|Order of residue ranges in beta sheets
|Residue ranges in beta sheets
|Topology of residue ranges in beta sheets
|Details pertaining to specific sites within the structure
|Details pertaining to how the site is generated
|Keywords describing the site
|Description of views of the specified site
|Details pertaining to space group symmetry
|Equivalent positions for the specified space group