Python Programming for Biology: Bioinformatics and Beyond

Designing a molecular structure data model

Download 7,75 Mb.

Pdf ko'rish

bet	100/514
Sana	30.12.2021
Hajmi	7,75 Mb.
	#91066

1 ... 96 97 98 99 100 101 102 103 ... 514

Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Designing a molecular structure data model

In this chapter we construct and implement an example data model which represents the

three-dimensional structures of large biological molecules. If you are unfamiliar with the

basic principles of biological molecules and their structures, see the introductions to

Chapters 11

and

, which aim to be suitable for non-biologists. Specifically the data

model will be for linear polymers, such as DNA, RNA and protein, where a longer

molecule is built of smaller components linked together into a chain. It is a relatively

simple data model, and it could certainly be extended, but we will avoid adding

complications and keep things as clear as possible for this book. As such, we will make

various simplifying assumptions about molecules and biology, but that is the case with all

data models, it is all just a matter of degree. Specifically, we will ignore issues such as

how the molecules might have a few extra or a few absent atoms (mostly hydrogen ions

and small modifications) or how the molecules might have extra links, which are not part

of the main linear chain, like the disulphide links found in some proteins. We will not use

any formal computer methods to describe the construction of the data model. Instead, we

will rely upon relatively plain English. There are formal modelling techniques, like UML

(Unified Modeling Language), for example, but such things are well beyond the scope of

this book.

Our model will describe the identities and the relative three-dimensional positions of all

of the atoms which collectively can be considered a macromolecular structure; the precise

shape of large biological molecules. This structure may be composed of any number of

polymer molecules that come together, but is frequently used to describe just one

molecule. Each molecular chain will have a distinct biological type, i.e. DNA, RNA or

protein, and we can mix polymer types however we like. For example, we might want to

consider the structure of a protein bound to a section of DNA.

We will sometimes expect more than one set of three-dimensional coordinates for a

given molecule, which means that for the same set of atoms we can describe alternative

arrangements or conformations. Describing multiple conformations is useful to indicate

situations where the precise structure is uncertain and to describe the outcome of

dynamical simulations of the molecule, where each set of coordinates could represent a

different point in time or a different outcome. By allowing discrete collections of

coordinates for a given molecule, we generate what is sometimes referred to as a

structural ensemble. This term is used to emphasise the ‘togetherness’ of a bundle of

related conformations.

In our model we will identify a given structure by a name, which will be a textual

identifier, and we will also include a non-mandatory property, the Protein Data Bank

identifier, to indicate when the data has come from an entry in the main biological

coordinate database. The Worldwide Protein Data Bank

is a publicly available database

that stores the structures of molecules. These were mostly determined by X-ray

crystallography but many have been determined by other techniques such as nuclear

magnetic resonance (NMR). Despite the name suggesting that the PDB database is only

for proteins, these days it contains coordinate data for DNA and RNA too, although the

protein structures vastly outnumber the other types. The structures that we are modelling

might have been entered into this database, and we want to keep track of that.

Accordingly, we use the textual PDB identifier that is unique to each entry in the PDB.

Naturally, the PDB has its own data model to describe biological structures and their

associated data, and it is far more extensive and complicated than the one we are using

here. In their data model the PDB identifier is mandatory, but in our data model we will

make it optional; the data doesn’t have to come from this database in every case.

There are many design decisions in our example data model, about which things to

describe, which things we ignore and what rules we apply. We will discuss the aspects of

our particular model as we go through the example. However, which precise details we

have chosen is not the most important thing; the idea is to empower you to create your

own data models to do exactly what you want.

Download 7,75 Mb.

Do'stlaringiz bilan baham:

1 ... 96 97 98 99 100 101 102 103 ... 514