Extensible Markup Language (XML) is a way of storing information in files in a standard,
textual way. Although it is rather verbose, it is very popular and there are many tools for
parsing XML files, which makes it relatively easy to use. An XML file is ordered like a
system. At the outermost level is the ‘root’ of the data tree. Each node in the tree is called
an XML element. Each element has a tag defining what kind of element it is, and may also
have any number of attributes, some text and can contain any number of other (child)
elements. Each element, except for the root element, has a unique parent element. The
XML tools let you navigate this tree.
automatically check for this. An XML file may also be required to be valid, in the sense of
satisfying some ‘schema’, which defines what the hierarchy can be, including a
specification of the tags, attributes and parent/child relationships.
The parsing tools will
also automatically check for validity, if the XML file specifies a schema. Schemas can be
defined either with a DTD (Document Type Definition) or with an XML schema. We do
not consider this issue further here. Note that just because an XML file is well formed and
valid does not mean that the data it contains is correct or meaningful.
Python, from version 2.5, includes an XML parser called ElementTree, which provides
a very convenient way of reading XML files. It can also be used to write XML files.
Because ElementTree does all the tricky parsing and validation work, in some sense it is
easier to read XML files than it is to write them. So when you read an XML file you only
need to pay attention to the information you want, but when you write an XML file you
have to include all the information that the schema is expecting.
ElementTree includes a quick C-language implementation of the parser (hidden
underneath the Python), and it is recommended that this is how you use it:
from xml.etree import cElementTree as ElementTree
The first step when using ElementTree to read an XML file is to parse it using the
module’s parse() function, which provides a handle to the XML tree object:
xmlFile = 'examples/protein.xml'
tree = ElementTree.parse(xmlFile)
root = tree.getroot()
The parse() function accepts either the path to the file or a file handle object. The
getroot() function on the tree handle then returns the root (top) object. From the root
object you can navigate down the tree hierarchy, extracting the information you need
using several functions that ElementTree provides for every node element.
Given a node element, you can access any text that may be associated with it via
node.text. Access to the attributes is obtained by treating a node almost as if it were a
dictionary. So node.keys() returns the attribute names, and node.get(name) returns the
value of the attribute with the given name, or None if there is no attribute with that name.
However you cannot use the syntax node[name]. This is because instead node[n] returns
the n
th
child of the node.
The node.find(pattern) function lets you find the first descendant of a node that matches
the pattern, or None if there are none matching. At its simplest, the pattern can just be a
tag, which would then find the first child of the node that has that tag. But you can get
further down the tree by using a Unix-style file-system path syntax, so, for example,
find(‘PubDate/Year’) would find the first grandchild where the tag is ‘Year’ and its parent
(so the child of the original node) has tag ‘PubDate’.
You can even use wildcards for any of the tags on the path, so find(‘*/Year’) would
match all children and find the first grandchild where the tag is ‘Year’, i.e. the
intermediate tag does not matter. However, you unfortunately cannot use wildcards to
match part of a tag, so find(‘Pub*/Year’) would not work.
The findall(pattern) function works in the same way as find(pattern) except that it
returns all matching elements, instead of just the first one. Also, the findtext(pattern)
function returns the text of the first element that matches the pattern, or None if there is no
match. This is convenient shorthand, so
text = node.findtext(pattern)
is the same as
element = node.find(pattern)
if element:
text = element.text
else:
text = None
Do'stlaringiz bilan baham: