PDBx/mmCIF Syntax

The syntax used in mmCIF data files and dictionaries is derived from the STAR (Self-defining Text Archive and Retrieval) grammar. In its simplest form, an mmCIF file looks like a paired collection of data item names and values. In the following example of assigning values to cell constants, for instance, the interpretation of the syntax is straightforward.

# 
_cell.entry_id           4HHB 
_cell.length_a           63.150 
_cell.length_b           83.590 
_cell.length_c           53.800 
_cell.angle_alpha        90.00 
_cell.angle_beta         99.34 
_cell.angle_gamma        90.00 
_cell.Z_PDB              4

mmCIF data item names are identified by the leading underscore character. The underscore is followed by a text string which is interpreted in mmCIF as containing both a category name and a keyword name separated by a period. The keyword portion of the name is the unique identifier of the data item within the category. In the example above, all of the data items belong to the CELL category. The above example also illustrates the one-to-one correspondence required between item names and item values. Data category and data item names are not case sensitive.

The next example illustrates the how text strings are expressed in mmCIF. Short text strings may be enclosed in single or double quotation marks. Text strings which span multiple lines are enclosed by semi-colons that are placed at the first character position of the line. There are two special characters used as placeholders for mmCIF item values which for some reason cannot be explicitly assigned. The question mark (?) is used to mark an item value as missing. A period (.) may be used to identify that there is no appropriate value for the item or that a value has been intentionally omitted.

_entity_src_gen.entity_id                          1 
_entity_src_gen.pdbx_gene_src_gene                 'MT3707, MTCY07H7B.20, panC, Rv3602c' 
_entity_src_gen.pdbx_gene_src_scientific_name      'Mycobacterium tuberculosis' 
_entity_src_gen.pdbx_gene_src_ncbi_taxonomy_id     1773 
_entity_src_gen.pdbx_host_org_scientific_name      'Escherichia coli' 
_entity_src_gen.pdbx_host_org_ncbi_taxonomy_id     562 
_entity_src_gen.pdbx_host_org_vector_type          plasmid 
_entity_src_gen.pdbx_host_org_tissue               ? 
_entity_src_gen.pdbx_host_org_vector               ? 
_entity_src_gen.plasmid_name                       pET30a 

_struct_ref.id                         1 
_struct_ref.db_name                    UNP 
_struct_ref.db_code                    PANC_MYCTU 
_struct_ref.pdbx_db_accession          P0A5R0 
_struct_ref.entity_id                  1 
_struct_ref.biol_id                    . 
_struct_ref.pdbx_seq_one_letter_code   
;MTIPAFHPGELNVYSAPGDVADVSRALRLTGRRVMLVPTMGALHEGHLALVRAAKRVPGS 
VVVVSIFVNPMQFGAGEDLDAYPRTPDDDLAQLRAEGVEIAFTPTTAAMYPDGLRTTVQP 
GPLAAELEGGPRPTHFAGVLTVVLKLLQIVRPDRVFFGEKDYQQLVLIRQLVADFNLDVA 
VVGVPTVREADGLAMSSRNRYLDPAQRAAAVALSAALTAAAHAATAGAQAALDAARAVLD 
AAPGVAVDYLELRDIGLGPMPLNGSGRLLVAARLGTTRLLDNIAIEIGTFAGTDRPDGYR 
;

Vectors and tables of data may be encoded in mmCIF using a loop_ directive. To build a table, the data item names corresponding to the table columns are preceded by the loop_ directive, and followed by the corresponding rows of data. The following example builds a table of author names.

# 
loop_
_citation_author.citation_id 
_citation_author.name 
_citation_author.ordinal 
primary 'Fermi, G.'     1  
primary 'Perutz, M.F.'  2  
primary 'Shaanan, B.'   3  
primary 'Fourme, R.'    4  
1       'Perutz, M.F.'  5  
1       'Hasnain, S.S.' 6  
1       'Duke, P.J.'    7  
1       'Sessler, J.L.' 8  
1       'Hahn, J.E.'    9  
2       'Fermi, G.'     10 
2       'Perutz, M.F.'  11 
3       'Perutz, M.F.'  12 
4       'Teneyck, L.F.' 13 
4       'Arnone, A.'    14 
5       'Fermi, G.'     15 
6       'Muirhead, H.'  16 
6       'Greer, J.'     17 
#

The use of the loop_ directive in mmCIF has a few restrictions. First, it is required that all of the data items within the loop belong to the same mmCIF category. Second, the number of data values following the loop must be an exact multiple of the number of data item names. Finally, mmCIF prohibits the nesting of loop_ directives.

mmCIF uses data blocks to organize related information and data. A data block is a logical partition of a data file or dictionary created using a data_ directive. A data block may be named by appending a text string after the data_ directive and a data block is terminated by either another data_ directive or by the end of the file. The following example shows a very simple example of a pair of abbreviated data blocks.

#
# --- Lines beginning with # are treated as comments 
#
data_X987A
_entry.id                              X987A

_exptl_crystal.id                  'Crystal A'
_exptl_crystal.colour              'pale yellow'
_exptl_crystal.density_diffrn      1.113
_exptl_crystal.density_Matthews    1.01 

_cell.entry_id                         X987A
_cell.length_a                         95.39
_cell.length_a_esd                      0.05
_cell.length_b                         48.80
_cell.length_b_esd                      0.12
_cell.length_c                         56.27
_cell.length_c_esd                      0.06

# Second data block
data_T100A

_entry.id                           T100A

_exptl_crystal.id                  'Crystal B'
_exptl_crystal.colour              'orange'
_exptl_crystal.density_diffrn      1.156
_exptl_crystal.density_Matthews    1.06

_cell.entry_id                         T100A
_cell.length_a                         68.39
_cell.length_a_esd                      0.05
_cell.length_b                         88.70
_cell.length_b_esd                      0.12
_cell.length_c                         76.27
_cell.length_c_esd                      0.06

The above example illustrates how data blocks can be used to separate similar information pertaining to different structures. This separation is required because the mmCIF syntax prohibits the repetition of the same category at multiple places within the same data block. As a result, the simple concatenation of the contents of the above two data blocks into a single data block would be syntactically incorrect.

Merging the data blocks in the above example raises some additional issues associated with the mmCIF data model and the structure of these specific categories. In above example, it would be possible to merge the information in the EXPTL_CRYSTAL category into a single data block by reorganizing this category using a loop_ directive. However, certain mmCIF categories like CELL and ENTRY may contain only a single value within the data block and therefore cannot be looped. The single-valued property of the data items in these categories is a consequence of the definition of the key items in these two categories. The category key for the CELL category, _cell.entry_id, is a defined as a child definition of _entry.id. This item is defined as the data block identifier and may therefore assume only a single value.

Definitions in the mmCIF dictionary are encapsulated in named save frames. A save frame begins with the save_ directive and is terminated by another save_ directive. Save frames are named by appending a text string to the save_ token. In mmCIF dictionaries, save frames are used to encapsulate item and category definitions. The mmCIF dictionary is composed of a data block containing thousands of save frames, where each save frame contains a different definition. Save frames may only appear in mmCIF dictionaries and they may not be nested. The following example shows the save frame containing the definition of the data item _exptl.details.

save__exptl.details
    _item_description.description
;              Any special information about the experimental work prior to the
               intensity measurement. See also _exptl_crystal.preparation.
;
    _item.name                  '_exptl.details'
    _item.category_id             exptl
    _item.mandatory_code          no
    _item_aliases.alias_name    '_exptl_special_details'
    _item_aliases.dictionary      cif_core.dic
    _item_aliases.version         2.0.1
    _item_type.code               text
     save_

Save frames play a much more important role in STAR than in mmCIF. In a STAR file application such as NMR-STAR where a save frame acts as a reuseable cell of information which can be referenced and expanded within the file. Save frames are referenced in a STAR file by preceding the save frame name with a dollar sign. The use of save frames in mmCIF has been limited to the organization and scoping features that they provide. mmCIF does not support references to save frames or the use of save frames for purposes other than for encapsulating dictionary definitions.