Python Programming for Biology: Bioinformatics and Beyond



Download 7,75 Mb.
Pdf ko'rish
bet81/514
Sana30.12.2021
Hajmi7,75 Mb.
#91066
1   ...   77   78   79   80   81   82   83   84   ...   514
Bog'liq
[Tim J. Stevens, Wayne Boucher] Python Programming

Reading XML files

Extensible Markup Language (XML) is a way of storing information in files in a standard,

textual way. Although it is rather verbose, it is very popular and there are many tools for

parsing XML files, which makes it relatively easy to use. An XML file is ordered like a

tree, with a containment hierarchy, so in some sense like the directory structure of a file

system. At the outermost level is the ‘root’ of the data tree. Each node in the tree is called

an XML element. Each element has a tag defining what kind of element it is, and may also

have  any  number  of  attributes,  some  text  and  can  contain  any  number  of  other  (child)

elements.  Each  element,  except  for  the  root  element,  has  a  unique  parent  element.  The

XML tools let you navigate this tree.

An  XML  file  needs  to  be  syntactically  well  formed,  and  the  parsing  tools  will

automatically check for this. An XML file may also be required to be valid, in the sense of

satisfying  some  ‘schema’,  which  defines  what  the  hierarchy  can  be,  including  a



specification  of  the  tags,  attributes  and  parent/child  relationships.  The  parsing  tools  will

also automatically check for validity, if the XML file specifies a schema. Schemas can be

defined either with a DTD (Document Type Definition) or with an XML schema. We do

not consider this issue further here. Note that just because an XML file is well formed and

valid does not mean that the data it contains is correct or meaningful.

Python, from version 2.5, includes an XML parser called ElementTree, which provides

a  very  convenient  way  of  reading  XML  files.  It  can  also  be  used  to  write  XML  files.

Because ElementTree does all the tricky parsing and validation work, in some sense it is

easier to read XML files than it is to write them. So when you read an XML file you only

need to pay attention to the information you want, but when you write an XML file you

have to include all the information that the schema is expecting.

ElementTree  includes  a  quick  C-language  implementation  of  the  parser  (hidden

underneath the Python), and it is recommended that this is how you use it:

from xml.etree import cElementTree as ElementTree

The  first  step  when  using  ElementTree  to  read  an  XML  file  is  to  parse  it  using  the

module’s parse() function, which provides a handle to the XML tree object:

xmlFile = 'examples/protein.xml'

tree = ElementTree.parse(xmlFile)

root = tree.getroot()

The  parse()  function  accepts  either  the  path  to  the  file  or  a  file  handle  object.  The

getroot()  function  on  the  tree  handle  then  returns  the  root  (top)  object.  From  the  root

object  you  can  navigate  down  the  tree  hierarchy,  extracting  the  information  you  need

using several functions that ElementTree provides for every node element.

Given  a  node  element,  you  can  access  any  text  that  may  be  associated  with  it  via

node.text.  Access  to  the  attributes  is  obtained  by  treating  a  node  almost  as  if  it  were  a

dictionary.  So  node.keys()  returns  the  attribute  names,  and  node.get(name)  returns  the

value of the attribute with the given name, or None if there is no attribute with that name.

However you cannot use the syntax node[name]. This is because instead node[n]  returns

the n

th

child of the node.



The node.find(pattern) function lets you find the first descendant of a node that matches

the pattern, or None if there are none matching. At its simplest, the pattern can just be a

tag,  which  would  then  find  the  first  child  of  the  node  that  has  that  tag.  But  you  can  get

further  down  the  tree  by  using  a  Unix-style  file-system  path  syntax,  so,  for  example,

find(‘PubDate/Year’) would find the first grandchild where the tag is ‘Year’ and its parent

(so the child of the original node) has tag ‘PubDate’.

You  can  even  use  wildcards  for  any  of  the  tags  on  the  path,  so  find(‘*/Year’)  would

match  all  children  and  find  the  first  grandchild  where  the  tag  is  ‘Year’,  i.e.  the

intermediate  tag  does  not  matter.  However,  you  unfortunately  cannot  use  wildcards  to

match part of a tag, so find(‘Pub*/Year’) would not work.

The  findall(pattern)  function  works  in  the  same  way  as  find(pattern)  except  that  it

returns  all  matching  elements,  instead  of  just  the  first  one.  Also,  the  findtext(pattern)




function returns the text of the first element that matches the pattern, or None if there is no

match. This is convenient shorthand, so

text = node.findtext(pattern)

is the same as

element = node.find(pattern)

if element:

text = element.text

else:


text = None


Download 7,75 Mb.

Do'stlaringiz bilan baham:
1   ...   77   78   79   80   81   82   83   84   ...   514




Ma'lumotlar bazasi mualliflik huquqi bilan himoyalangan ©hozir.org 2024
ma'muriyatiga murojaat qiling

kiriting | ro'yxatdan o'tish
    Bosh sahifa
юртда тантана
Боғда битган
Бугун юртда
Эшитганлар жилманглар
Эшитмадим деманглар
битган бодомлар
Yangiariq tumani
qitish marakazi
Raqamli texnologiyalar
ilishida muhokamadan
tasdiqqa tavsiya
tavsiya etilgan
iqtisodiyot kafedrasi
steiermarkischen landesregierung
asarlaringizni yuboring
o'zingizning asarlaringizni
Iltimos faqat
faqat o'zingizning
steierm rkischen
landesregierung fachabteilung
rkischen landesregierung
hamshira loyihasi
loyihasi mavsum
faolyatining oqibatlari
asosiy adabiyotlar
fakulteti ahborot
ahborot havfsizligi
havfsizligi kafedrasi
fanidan bo’yicha
fakulteti iqtisodiyot
boshqaruv fakulteti
chiqarishda boshqaruv
ishlab chiqarishda
iqtisodiyot fakultet
multiservis tarmoqlari
fanidan asosiy
Uzbek fanidan
mavzulari potok
asosidagi multiservis
'aliyyil a'ziym
billahil 'aliyyil
illaa billahil
quvvata illaa
falah' deganida
Kompyuter savodxonligi
bo’yicha mustaqil
'alal falah'
Hayya 'alal
'alas soloh
Hayya 'alas
mavsum boyicha


yuklab olish