When data is stored in a plain text file, although interpreting the individual component
characters is trivial, if we are to understand what the contents of the file actually mean we
have to understand the way in which the data inside the file is structured. This is just like
written language where knowing the alphabet is not enough, we also need to understand
concepts like words, sentences and punctuation. Ultimately it is the decision of the
computer programmer as to how the data in saved files is structured. However, where it is
important that files should be understood by a variety of programs the data will be
represented in a standardised, and hopefully documented, way. The data standard for a file
according to lines; they are subdivided by special end-of-line control characters
A very common file structure is to have one record of data per line, often with a single
header line at the top of the file, to describe the contents of the lines. One possibility is
that the file represents a table with each line describing one row. The fields (or cells) in
each row, one for each column of the table, may be demarked in various ways: this could
be according to the position within the row, i.e. the position of a character relative to the
start, or specified by special separating characters, like commas or whitespace (tabs, blank
spaces etc.). An alternative to a fixed order of fields is that the lines consist of pairs of
named keys (identifiers) and corresponding values. Here the keys will generally come
from a fixed, allowed set of keys and in some instances the data values that are addressed
Another common file structure is to have tags that identify and specify the beginning
and end of a record. Often these tags can be nested, one inside the other, thus denoting
containment. For example, the XML (eXtensible Markup Language) data standard uses
tags where the record starts with text like ‘’ and ends with the text ‘’,
where here ‘NAME’ is the identifier for the element.
Sometimes a programming language will have its own, inbuilt formats for representing
data structures created in that language. This process is referred to as serialisation. In
general the serialisation format will be specific to the language in question, and may
require special modules to be installed. However, if data is only going to be used in a
single language environment, using serialisation can offer an efficient means to store the
active data, in terms of both speed and programming ease. Python has a serialisation
method which is referred to as pickling. Such ‘pickle’ files are usually textual, but they are
not so easy for a person to interpret. For example, the Python list data structure:
x = [1, 2, 'a', 'b', True, None]
is saved by the pickling method as:
(lp1
I1
aI2
aS'a'
aS'b'
aI01
aNa.
Given most file formats, it is normally easier to write files than it is to read them; it is
easier to extract data from a controlled and standardised in-memory representation (e.g.
Python structures) than it is to interpret someone else’s text, which may not be
standardised or even fully understood. So when writing a program to read a file you need
to parse the file (determine its syntactic structure) and also confirm that the content is
valid, however that may be defined. When writing a file you just have to make sure that
you are following the rules for the file layout. A common programming paradigm is to
first read one or more files, do some processing and then write out one or more files.
Although the processing is normally the major objective of the program, it is not
uncommon to have situations where most effort needs to be spent creating the code to do
the reading and writing of the files, especially for simple programs. If there is already an
existing piece of tested code, such as the importable BioPython module, which can be
used to read and write files, then this is often used in preference to spending time writing
something new. See
Chapter 11
for examples that use BioPython to read and write files.
Do'stlaringiz bilan baham: