Designing a molecular structure data model
In this chapter we construct and implement an example data model which represents the
three-dimensional structures of large biological molecules. If you are unfamiliar with the
basic principles of biological molecules and their structures, see the introductions to
Chapters 11
and
15
, which aim to be suitable for non-biologists. Specifically the data
model will be for linear polymers, such as DNA, RNA and protein, where a longer
molecule is built of smaller components linked together into a chain. It is a relatively
simple data model, and it could certainly be extended, but we will avoid adding
complications and keep things as clear as possible for this book. As such, we will make
various simplifying assumptions about molecules and biology, but that is the case with all
data models, it is all just a matter of degree. Specifically, we will ignore issues such as
how the molecules might have a few extra or a few absent atoms (mostly hydrogen ions
and small modifications) or how the molecules might have extra links, which are not part
of the main linear chain, like the disulphide links found in some proteins. We will not use
any formal computer methods to describe the construction of the data model. Instead, we
will rely upon relatively plain English. There are formal modelling techniques, like UML
(Unified Modeling Language), for example, but such things are well beyond the scope of
this book.
Our model will describe the identities and the relative three-dimensional positions of all
of the atoms which collectively can be considered a macromolecular structure; the precise
shape of large biological molecules. This structure may be composed of any number of
polymer molecules that come together, but is frequently used to describe just one
molecule. Each molecular chain will have a distinct biological type, i.e. DNA, RNA or
protein, and we can mix polymer types however we like. For example, we might want to
consider the structure of a protein bound to a section of DNA.
We will sometimes expect more than one set of three-dimensional coordinates for a
given molecule, which means that for the same set of atoms we can describe alternative
arrangements or conformations. Describing multiple conformations is useful to indicate
situations where the precise structure is uncertain and to describe the outcome of
dynamical simulations of the molecule, where each set of coordinates could represent a
different point in time or a different outcome. By allowing discrete collections of
coordinates for a given molecule, we generate what is sometimes referred to as a
structural ensemble. This term is used to emphasise the ‘togetherness’ of a bundle of
related conformations.
In our model we will identify a given structure by a name, which will be a textual
identifier, and we will also include a non-mandatory property, the Protein Data Bank
identifier, to indicate when the data has come from an entry in the main biological
coordinate database. The Worldwide Protein Data Bank
1
is a publicly available database
that stores the structures of molecules. These were mostly determined by X-ray
crystallography but many have been determined by other techniques such as nuclear
magnetic resonance (NMR). Despite the name suggesting that the PDB database is only
for proteins, these days it contains coordinate data for DNA and RNA too, although the
protein structures vastly outnumber the other types. The structures that we are modelling
might have been entered into this database, and we want to keep track of that.
Accordingly, we use the textual PDB identifier that is unique to each entry in the PDB.
Naturally, the PDB has its own data model to describe biological structures and their
associated data, and it is far more extensive and complicated than the one we are using
here. In their data model the PDB identifier is mandatory, but in our data model we will
make it optional; the data doesn’t have to come from this database in every case.
There are many design decisions in our example data model, about which things to
describe, which things we ignore and what rules we apply. We will discuss the aspects of
our particular model as we go through the example. However, which precise details we
have chosen is not the most important thing; the idea is to empower you to create your
own data models to do exactly what you want.
Do'stlaringiz bilan baham: |