PDBx/mmCIF General FAQ

  1. PDB entries are distributed in the PDB File Format following the specification described in the Contents Guide Version 3.30 (Nov. 21, 2012). The PDB file format is no longer being modified or extended to support new content.
  2. Large structures (containing >62 chains and/or 99999 ATOM records) that cannot be fully represented in the PDB File Format are available in the PDB archive as single PDBx/mmCIF files. These large structures are also distributed in “bundled” TAR files containing a collection of best effort/minimal files in the PDB File Format.
  1. PDBx/mmCIF became the standard PDB archive format in 2014.
  2. All PDB data processing and annotation will be performed using PDBx/mmCIF at all wwPDB sites.
  3. PDBx/mmCIF consists of categories of information represented as tables and keyword value pairs.
  4. The categories in mmPDBx/mmCIF have explicit relationships with one another.
  5. PDBx/mmCIF imposes no limitations for the number of atoms, residues or chains that can be represented in a single PDB entry (no split entries!).
  6. Each data item in a PDBx/mmCIF file is precisely defined in a PDBx Exchange Data Dictionary The content of data dictionary is fully software accessible.
  7. All of the data items in the current PDB format have corresponding items data items in the PDBx/mmCIF format.
  8. Chemical descriptions of all of the monomers and ligands in PDB entries are provided in the PDB Chemical Component Dictionary which is in PDBx/mmCIF format.
  9. PDBx/mmCIF is supported by visualization applications such as Jmol, Chimera, and OpenRasMol as well as structure determination systems such as CCP4 and Phenix.
  1. The format is based on a context-free grammar. PDBx/mmCIF has a simple grammar. Data are presented in either key-value or tabular form. It is much easier to parse than the record-oriented PDB format. Say good-bye to "exception" handling when reading old-style PDB flat files!
  2. There are no column width limitations.
  3. All relationships between common data items (e.g. atom and residue identifiers) are explicitly documented within the PDBx Exchange Dictionary. This permits software applications to evaluate and validate referential integrity with any PDB entry.
  4. The mmCIF/PDBx Exchange Dictionary provides metadata (e.g. data types, allowed ranges, controlled vocabularies) which can be used to generate a validating mmCIF parser or a database loader.
  5. Parsing tools are available in most popular languages (e.g. C/C++, Java, Python, Perl, FORTRAN) and toolkits (e.g. BioJava and BioPython).
  6. Mapping information between the residue sequences of the experimental sample and the model coordinates is included within each entry.
  7. PDB Chemical reference data are maintained and distributed in PDBx/mmCIF format.
Plans for a more PDB friendly mmCIF/PDBx ATOM records
  • All records on a single text line
  • Columns presented in standard column order.
  • Tabular presentation with leading record names (e.g. ATOM, CELL, REFINE)
  • Method independent features in left-most columns (e.g. identifiers & coordinates)
  • Method specific features in the right-most columns (e.g. ADPs, NMR order/disorder parameters)
  • Continue to support PDB nomenclature semantics (e.g. PDB style chains, residue numbering, and insertion codes)

The following examples show the ATOM records from the current PDB format and an example from the proposed stylized PDBx/mmCIF format. In the PDBx/mmCIF example the order of columns places the chain, residue and atom nomencature items in the left-most columns. Data items that depend on the experimental method (e.g. occupancy, B-value ) are placed in columns to the left. All of the items of the atom record in the PDBx/mmCIF format example are placed on a single text line and are white-space delimited.

Example of Record-oriented PDB Format ATOM Records
ATOM 1 N GLN A 39 24.690 -27.754 24.275 1.00 60.76 N ATOM 2 CA GLN A 39 23.581 -26.768 24.416 1.00 60.98 C ATOM 3 C GLN A 39 23.990 -25.379 23.905 1.00 59.98 C ATOM 4 O GLN A 39 25.070 -25.209 23.330 1.00 60.25 O ATOM 5 CB GLN A 39 23.136 -26.685 25.878 1.00 60.69 C ATOM 6 N VAL A 40 23.115 -24.395 24.122 1.00 59.58 N ATOM 7 CA VAL A 40 23.342 -23.010 23.690 1.00 57.26 C ATOM 8 C VAL A 40 24.000 -22.152 24.778 1.00 56.00 C ATOM 9 O VAL A 40 23.992 -20.920 24.692 1.00 55.53 O ATOM 10 CB VAL A 40 22.015 -22.337 23.275 1.00 57.32 C
Example of PDBx/mmCIF ATOM Records (atom_site category)
loop_ _atom_site.group_PDB _atom_site.id _atom_site.auth_atom_id _atom_site.type_symbol _atom_site.auth_comp_id _atom_site.auth_asym_id _atom_site.auth_seq_id _atom_site.Cartn_x _atom_site.Cartn_y _atom_site.Cartn_z _atom_site.pdbx_PDB_model_num _atom_site.occupancy _atom_site.pdbx_auth_alt_id _atom_site.B_iso_or_equiv ATOM 1 N N GLN A 39 24.690 -27.754 24.275 1 1.000 . 60.760 ATOM 2 CA C GLN A 39 23.581 -26.768 24.416 1 1.000 . 60.980 ATOM 3 C C GLN A 39 23.990 -25.379 23.905 1 1.000 . 59.980 ATOM 4 O O GLN A 39 25.070 -25.209 23.330 1 1.000 . 60.250 ATOM 5 CB C GLN A 39 23.136 -26.685 25.878 1 1.000 . 60.690 ATOM 6 N N VAL A 40 23.115 -24.395 24.122 1 1.000 . 59.580 ATOM 7 CA C VAL A 40 23.342 -23.010 23.690 1 1.000 . 57.260 ATOM 8 C C VAL A 40 24.000 -22.152 24.778 1 1.000 . 56.000 ATOM 9 O O VAL A 40 23.992 -20.920 24.692 1 1.000 . 55.530 ATOM 10 CB C VAL A 40 22.015 -22.337 23.275 1 1.000 . 57.320 ATOM 11 N N ALA A 41 24.560 -22.804 25.797 1 1.000 . 54.570

PDB entries in PDBx/mmCIF format are stored on the ftp sites of the wwPDB partners at one of the locations:

Entries containing very large structures stored PDBx/mmCIF format are currently stored separately one of the locations:

The PDBx/mmCIF format files are named following the convention <PDB_4-LETTER-ID_CODE>.cif.gz (e.g. 1abc.cif.gz). Experimental data files containing X-ray structure factors are only distributed in PDBx/mmCIF format and are named following an older PDB naming convention r<PDB_ID_CODE>sf.ent.gz (e.g. r1abcsf.ent.gz).

A complete description of the download options for PDB data files is maintained at here by the wwPDB. The special handling of PDB entries containing very large structures is available here.

The PDBx/mmCIF format has a simple appearance with only a few syntax elements. All of syntax elements used in PDBx data files are shown in the following snippet describing polymer sequence.

The essential syntax features include:

  • All data items are identified by name and begin with the underscore character, _entity_poly.entity_id.
  • Data item names can be decomposed into a category name and an attribute name, _category.attribute which are separated by a period.
  • Data categories are presented in two styles: key-value and tabular. In the example, categories entity_name_com and entity_poly both use the key-value style and the entity_poly_seq category uses the tabular style. In the tabular sytle, the data item names correpsonding to the table columns follow a reserved loop_ token which are followed by the rows of data rows of white-space delimited data values.
  • Any character data value may be quoted using encapsulating single or double quotes; however, character values containing internal whitespace (e.g. the value of _entity_name_com.name) must be quoted. Character values that extend over multiple lines are quoted using leading and trailing semi-colons positioned at the first character position of the records surronding the multi-line character value (e.g._entity_poly.pdbx_seq_one_letter_code).
  • Lines beginning with the hash symbol # are comments.

Look here for a more complete description of PDBx/mmCIF data file and dictionary syntax.

#  <-- a comment line
_entity_name_com.entity_id  1
_entity_name_com.name       "Pantoate--beta-alanine ligase, Pantoate-activating enzyme"

_entity_poly.entity_id                      1
_entity_poly.type                           'polypeptide(L)'
_entity_poly.nstd_linkage                   no
_entity_poly.nstd_monomer                   no
_entity_poly.pdbx_seq_one_letter_code
;AMAIPAFHPGELNVYSAPGDVADVSRALRLTGRRVMLVPTMGALHEGHLALVRAAKRVPGSVVVVSIFVNPMQFGAGGDL
DAYPRTPDDDLAQLRAEGVEIAFTPTTAAMYPDGLRTTVQPGPLAAELEGGPRPTHFAGVLTVVLKLLQIVRPDRVFFGE
KDYQQLVLIRQLVADFNLDVAVVGVPTVREADGLAMSSRNRYLDPAQRAAAVALSAALTAAAHAATAGAQAALDAARAVL
DAAPGVAVDYLELRDIGLGPMPLNGSGRLLVAARLGTTRLLDNIAIEIGTFAGTDRPDGYR
;

#
loop_
_entity_poly_seq.entity_id
_entity_poly_seq.num
_entity_poly_seq.mon_id
_entity_poly_seq.hetero
1 1   ALA n
1 2   MET n
1 3   ALA n
1 4   ILE n
1 5   PRO n
1 6   ALA n
1 7   PHE n
# ....  abbreviated ....
The PDBx/mmCIF data files produced by the wwPDB conform to both the CIF 1.0 and 1.1 syntax specifications. The current syntax specification for CIF 1.1 is maintained at the IUCr CIF site.

Yes, the atom coordindate records in the PDBx/mmCIF data distributed by the wwPDB are stored on individual lines each beginning with either 'ATOM' or 'HETATM'. The elements of each coordinate record are white-space delimited. For example, PDBx/mmCIF coordinate records in PDB entries all have the following regular layout.

loop_
_atom_site.group_PDB
_atom_site.id
_atom_site.type_symbol
_atom_site.label_atom_id
_atom_site.label_alt_id
_atom_site.label_comp_id
_atom_site.label_asym_id
_atom_site.label_entity_id
_atom_site.label_seq_id
_atom_site.pdbx_PDB_ins_code
_atom_site.Cartn_x
_atom_site.Cartn_y
_atom_site.Cartn_z
_atom_site.occupancy
_atom_site.B_iso_or_equiv
_atom_site.Cartn_x_esd
_atom_site.Cartn_y_esd
_atom_site.Cartn_z_esd
_atom_site.occupancy_esd
_atom_site.B_iso_or_equiv_esd
_atom_site.pdbx_formal_charge
_atom_site.auth_seq_id
_atom_site.auth_comp_id
_atom_site.auth_asym_id
_atom_site.auth_atom_id
_atom_site.pdbx_PDB_model_num
ATOM   1    N  N   . VAL A 1 1   ? 6.204   16.869  4.854   1.00 49.05 ? ? ? ? ? ? 1   VAL A N   1
ATOM   2    C  CA  . VAL A 1 1   ? 6.913   17.759  4.607   1.00 43.14 ? ? ? ? ? ? 1   VAL A CA  1
ATOM   3    C  C   . VAL A 1 1   ? 8.504   17.378  4.797   1.00 24.80 ? ? ? ? ? ? 1   VAL A C   1
ATOM   4    O  O   . VAL A 1 1   ? 8.805   17.011  5.943   1.00 37.68 ? ? ? ? ? ? 1   VAL A O   1
ATOM   5    C  CB  . VAL A 1 1   ? 6.369   19.044  5.810   1.00 72.12 ? ? ? ? ? ? 1   VAL A CB  1
ATOM   6    C  CG1 . VAL A 1 1   ? 7.009   20.127  5.418   1.00 61.79 ? ? ? ? ? ? 1   VAL A CG1 1
ATOM   7    C  CG2 . VAL A 1 1   ? 5.246   18.533  5.681   1.00 80.12 ? ? ? ? ? ? 1   VAL A CG2 1
ATOM   8    N  N   . LEU A 1 2   ? 9.096   18.040  3.857   1.00 26.44 ? ? ? ? ? ? 2   LEU A N   1
ATOM   9    C  CA  . LEU A 1 2   ? 10.600  17.889  4.283   1.00 26.32 ? ? ? ? ? ? 2   LEU A CA  1
ATOM   10   C  C   . LEU A 1 2   ? 11.265  19.184  5.297   1.00 32.96 ? ? ? ? ? ? 2   LEU A C   1
ATOM   11   O  O   . LEU A 1 2   ? 10.813  20.177  4.647   1.00 31.90 ? ? ? ? ? ? 2   LEU A O   1
ATOM   12   C  CB  . LEU A 1 2   ? 11.099  18.007  2.815   1.00 29.23 ? ? ? ? ? ? 2   LEU A CB  1
ATOM   13   C  CG  . LEU A 1 2   ? 11.322  16.956  1.934   1.00 37.71 ? ? ? ? ? ? 2   LEU A CG  1
ATOM   14   C  CD1 . LEU A 1 2   ? 11.468  15.596  2.337   1.00 39.10 ? ? ? ? ? ? 2   LEU A CD1 1
ATOM   15   C  CD2 . LEU A 1 2   ? 11.423  17.268  0.300   1.00 37.47 ? ? ? ? ? ? 2   LEU A CD2 1

The following command will extract the PDB atom record name, atom name, residue name, chain identifier, residue number, Cartesian X, Y, and Z coordinates from the above snippet of PDBx/mmCIF coordinate data for PDB entry 4HHB.

                grep '^ATOM' 4HHB.cif | awk '{print $1, $25, $23, $24, $22, $11, $12, $13}'
            
Coordinate data is recorded in PDBx/mmCIF ATOM_SITE data category. This brief tutorial describes the PDBx/mmCIF representation of coordinated data and the relationship to PDB format coodinate data items.
This brief tutorial describes the PDBx/mmCIF representation of polymer and non-polymer molecular entities.
The collection of PDBx/mmCIF data categories used in the Chemical Component Dictionary are in the CHEM_COMP_DICTIONARY category group.
The collection of PDBx/mmCIF data categories and items used in the Biologically Interesting molecule Reference Dictionary (BIRD) are in the BIRD_DICTIONARY category group.