The syntax used in mmCIF data files and dictionaries is derived from the STAR (Self-defining Text Archive and Retrieval) grammar. In its simplest form, an mmCIF file looks like a paired collection of data item names and values. In the following example of assigning values to cell constants, for instance, the interpretation of the syntax is straightforward.
# _cell.entry_id 4HHB _cell.length_a 63.150 _cell.length_b 83.590 _cell.length_c 53.800 _cell.angle_alpha 90.00 _cell.angle_beta 99.34 _cell.angle_gamma 90.00 _cell.Z_PDB 4
mmCIF data item names are identified by the leading underscore character.
The underscore is followed by a text string which is interpreted in mmCIF
as containing both a category name and a keyword name separated by a
period. The keyword portion of the name is the unique identifier of the
data item within the category. In the example above, all of the data items
belong to the CELL
category.
The above example also illustrates the one-to-one correspondence required between
item names and item values. Data category and data item names are not case sensitive.
The next example illustrates the how text strings are expressed in mmCIF.
Short text strings may be enclosed in single or double quotation marks.
Text strings which span multiple lines are enclosed by semi-colons that are placed
at the first character position of the line.
There are two special characters
used as placeholders for mmCIF item values which for some reason cannot be
explicitly assigned.
The question mark (?
) is used to mark an item value as missing.
A period (.
) may be used to identify that there is no
appropriate value for the item or that a value has been intentionally omitted.
_entity_src_gen.entity_id 1 _entity_src_gen.pdbx_gene_src_gene 'MT3707, MTCY07H7B.20, panC, Rv3602c' _entity_src_gen.pdbx_gene_src_scientific_name 'Mycobacterium tuberculosis' _entity_src_gen.pdbx_gene_src_ncbi_taxonomy_id 1773 _entity_src_gen.pdbx_host_org_scientific_name 'Escherichia coli' _entity_src_gen.pdbx_host_org_ncbi_taxonomy_id 562 _entity_src_gen.pdbx_host_org_vector_type plasmid _entity_src_gen.pdbx_host_org_tissue ? _entity_src_gen.pdbx_host_org_vector ? _entity_src_gen.plasmid_name pET30a _struct_ref.id 1 _struct_ref.db_name UNP _struct_ref.db_code PANC_MYCTU _struct_ref.pdbx_db_accession P0A5R0 _struct_ref.entity_id 1 _struct_ref.biol_id . _struct_ref.pdbx_seq_one_letter_code ;MTIPAFHPGELNVYSAPGDVADVSRALRLTGRRVMLVPTMGALHEGHLALVRAAKRVPGS VVVVSIFVNPMQFGAGEDLDAYPRTPDDDLAQLRAEGVEIAFTPTTAAMYPDGLRTTVQP GPLAAELEGGPRPTHFAGVLTVVLKLLQIVRPDRVFFGEKDYQQLVLIRQLVADFNLDVA VVGVPTVREADGLAMSSRNRYLDPAQRAAAVALSAALTAAAHAATAGAQAALDAARAVLD AAPGVAVDYLELRDIGLGPMPLNGSGRLLVAARLGTTRLLDNIAIEIGTFAGTDRPDGYR ;
Vectors and tables of data may be encoded in mmCIF using a
loop_ directive. To build a table, the data item names
corresponding to the table columns are preceded by the
loop_
directive, and followed by the corresponding rows of data.
The following example builds a table of author names.
# loop_ _citation_author.citation_id _citation_author.name _citation_author.ordinal primary 'Fermi, G.' 1 primary 'Perutz, M.F.' 2 primary 'Shaanan, B.' 3 primary 'Fourme, R.' 4 1 'Perutz, M.F.' 5 1 'Hasnain, S.S.' 6 1 'Duke, P.J.' 7 1 'Sessler, J.L.' 8 1 'Hahn, J.E.' 9 2 'Fermi, G.' 10 2 'Perutz, M.F.' 11 3 'Perutz, M.F.' 12 4 'Teneyck, L.F.' 13 4 'Arnone, A.' 14 5 'Fermi, G.' 15 6 'Muirhead, H.' 16 6 'Greer, J.' 17 #
The use of the loop_
directive in mmCIF has a few restrictions.
First, it is required that all of the data items within the loop
belong to the same mmCIF category. Second, the number of data values
following the loop must be an exact multiple of the number of
data item names. Finally, mmCIF prohibits the nesting of
loop_
directives.
mmCIF uses data blocks to organize related information and data. A data
block is a logical partition of a data file or dictionary created
using a data_
directive. A data block may be named by
appending a text string after the data_
directive and
a data block is terminated by either another data_
directive
or by the end of the file. The following example shows a very simple
example of a pair of abbreviated data blocks.
# # --- Lines beginning with # are treated as comments # data_X987A _entry.id X987A _exptl_crystal.id 'Crystal A' _exptl_crystal.colour 'pale yellow' _exptl_crystal.density_diffrn 1.113 _exptl_crystal.density_Matthews 1.01 _cell.entry_id X987A _cell.length_a 95.39 _cell.length_a_esd 0.05 _cell.length_b 48.80 _cell.length_b_esd 0.12 _cell.length_c 56.27 _cell.length_c_esd 0.06 # Second data block data_T100A _entry.id T100A _exptl_crystal.id 'Crystal B' _exptl_crystal.colour 'orange' _exptl_crystal.density_diffrn 1.156 _exptl_crystal.density_Matthews 1.06 _cell.entry_id T100A _cell.length_a 68.39 _cell.length_a_esd 0.05 _cell.length_b 88.70 _cell.length_b_esd 0.12 _cell.length_c 76.27 _cell.length_c_esd 0.06
The above example illustrates how data blocks can be used to separate similar information pertaining to different structures. This separation is required because the mmCIF syntax prohibits the repetition of the same category at multiple places within the same data block. As a result, the simple concatenation of the contents of the above two data blocks into a single data block would be syntactically incorrect.
Merging the data blocks in the above example
raises some additional issues associated with the
mmCIF data model and the structure of these specific categories.
In above example, it would be possible to merge
the information in the EXPTL_CRYSTAL
category
into a single data block by reorganizing this category
using a loop_
directive. However,
certain mmCIF categories like CELL
and
ENTRY
may contain only a single value within
the data block and therefore cannot be looped.
The single-valued property of the data items in these
categories is a consequence of the definition of the key items in these two
categories. The category key for the CELL
category,
_cell.entry_id
, is a defined as a child definition of
_entry.id
. This item is defined as the data block
identifier and may therefore assume only a single value.
Definitions in the mmCIF dictionary are encapsulated in named save frames.
A save frame begins with the
save_
directive and is terminated by another save_
directive. Save frames are named by appending a text string
to the save_
token. In mmCIF dictionaries,
save frames are used to encapsulate item and category definitions.
The mmCIF dictionary is composed of a data block containing thousands
of save frames, where each save frame contains a different definition.
Save frames may only appear in mmCIF dictionaries and they may not be nested.
The following example shows the save frame containing the definition
of the data item _exptl.details
.
save__exptl.details _item_description.description ; Any special information about the experimental work prior to the intensity measurement. See also _exptl_crystal.preparation. ; _item.name '_exptl.details' _item.category_id exptl _item.mandatory_code no _item_aliases.alias_name '_exptl_special_details' _item_aliases.dictionary cif_core.dic _item_aliases.version 2.0.1 _item_type.code text save_
Save frames play a much more important role in STAR than in mmCIF. In a STAR file application such as NMR-STAR where a save frame acts as a reuseable cell of information which can be referenced and expanded within the file. Save frames are referenced in a STAR file by preceding the save frame name with a dollar sign. The use of save frames in mmCIF has been limited to the organization and scoping features that they provide. mmCIF does not support references to save frames or the use of save frames for purposes other than for encapsulating dictionary definitions.