Tuesday, September 1, 2020

XML Processing in Python



Hello dear readers! welcome back to another section of my tutorial on Python. In this tutorial post, we are going to be discussing about the Python XML Processing. Try to read through this detailed guide carefully and feel free to ask your questions.

XML is a portable and open source language that allows developers to create applications that can be read by other application, mindless of the operating system and/or the developmental language.

What is XML?

The Extensible Markup Language is a markup language much like Html, XHTML or SGML. This is recommeded by the World Wide Web Consortium and is available as an open standard.

The XML is extremely useful for keeping track of small to medium amounts of data without needing a SQL-based backbone.


XML Parser Architectures and APIs

Python standard library provides a little but useful set of interfaces to work with XML.

The two most basic and broadly used APIs to XML data are known as the SAX and DOM interfaces.

  • Simpe API for XML(SAX) - Here, you register callbacks for the events of interest and allow the parser to proceed through the document. This is actually useful when your documents are very large or when you have memory limitations, it parses the file as it is reading it from the disk and the full file is never stored in memory.
  • Document Object Module (DOM) API - This is a World Wide Web Consortium(W3C) recommendation where the full file is read into memory and stored in a hierarchical (tree based) form to show all the nice features of an XML document.


SAX cannot process information as fast as DOM can when working with large files. On the other hand, using DOM exclusively can in fact kill your resources, especially if used on a lot of small files.

SAX is a read-only, while DOM let changes to the XML file. Since these two different APIs actually compliment each other, there is no reason at all why you cannot use them together in large projects.

Example

For all our XML code examples, we will be using a simple XML file movies.xml as an input -

<collection shelf="New Arrivals">
<movie title="Enemy Behind">
   <type>War, Thriller</type>
   <format>DVD</format>
   <year>2003</year>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Talk about a US-Japan war</description>
</movie>
<movie title="Transformers">
   <type>Anime, Science Fiction</type>
   <format>DVD</format>
   <year>1989</year>
   <rating>R</rating>
   <stars>8</stars>
   <description>A schientific fiction</description>
</movie>
   <movie title="Trigun">
   <type>Anime, Action</type>
   <format>DVD</format>
   <episodes>4</episodes>
   <rating>PG</rating>
   <stars>10</stars>
   <description>Vash the Stampede!</description>
</movie>
<movie title="Ishtar">
   <type>Comedy</type>
   <format>VHS</format>
   <rating>PG</rating>
   <stars>2</stars>
   <description>Viewable boredom</description>
</movie>
</collection>

RECOMMENDED POST: Python Operators


Parsing XML with SAX APIs

The SAX is a standard interface for  the event driven XML parsing. Parsing an XML with SAX normally needs that you create your own ContentHandler by subclassing xml.sax.ContentHandler.

Your newly created ContentHandler handles the particular tags and attributes of your flavor(s) of XML. A ContentHandler object provides methods to handle various parsing events. It is owning parser call ContentHandler methods as it parses the XML file.

The methods startDocument and the endDocument are called at the start and end of the XML file. The method characters(text) is passed character data of the XML file via parameter text.

The ContentHandler is called at the start and end of each element. If the parser isn't in namespace mode, then the following methods startElement(tag, attributes) and endElement(tag) are called; else, the corresponding methods startElementNS and endElementNS are called. Here, the tag is the element's tag and the attributes is an Attribute object.


The following below are other important methods to understand before proceeding -


The make_parser Method

The following method creates a new parser object and returns the object. The parser object created will be of the first parser type that the system finds.

Syntax

Following is the syntax for using the make_parser method -

xml.sax.make_parser( [parser_list] )

Parameter Details

Following below is the details of the parameters -

  • parser_list - The optional argument consisting of a list of parsers to be used which must all implement the SAX make_parser method.

The parse Method

The following method creates a SAX parser and makes use of it to parse a document.

Syntax

Following is the syntax for using the parse method -

xml.sax.parse( xmlfile, contenthandler[, errorhandler])

Parameter Details

Following below is the details of the parameters -

  • xmlfile - The name of the XML file to read from.
  • contenthandler - This must be a ContentHandler object.
  • errorhandler - If specified, it must be a SAX ErrorHandler object.


The parseString Method

There is one more method to create a SAX parser and parse the specified XML string.

Syntax

Following is the syntax for using the parse method -

xml.sax.parseString( xmlstring, contenthandler[, errorhandler])

Parameter Details

Following below is the details of the parameters -

  • xmlstring - This is the name of the XML string to read from.
  • contenthandler - This must be a ContentHandler object.
  • errorhandler - If specified, it must be a SAX ErrorHandler object.

Example

Following below is an example -

#!/usr/bin/python

import xml.sax

class MovieHandler( xml.sax.ContentHandler ):
   def __init__(self):
      self.CurrentData = ""
      self.type = ""
      self.format = ""
      self.year = ""
      self.rating = ""
      self.stars = ""
      self.description = ""

   # Call when an element starts
   def startElement(self, tag, attributes):
      self.CurrentData = tag
      if tag == "movie":
         print "*****Movie*****"
         title = attributes["title"]
         print "Title:", title

   # Call when an elements ends
   def endElement(self, tag):
      if self.CurrentData == "type":
         print "Type:", self.type
      elif self.CurrentData == "format":
         print "Format:", self.format
      elif self.CurrentData == "year":
         print "Year:", self.year
      elif self.CurrentData == "rating":
         print "Rating:", self.rating
      elif self.CurrentData == "stars":
         print "Stars:", self.stars
      elif self.CurrentData == "description":
         print "Description:", self.description
      self.CurrentData = ""

   # Call when a character is read
   def characters(self, content):
      if self.CurrentData == "type":
         self.type = content
      elif self.CurrentData == "format":
         self.format = content
      elif self.CurrentData == "year":
         self.year = content
      elif self.CurrentData == "rating":
         self.rating = content
      elif self.CurrentData == "stars":
         self.stars = content
      elif self.CurrentData == "description":
         self.description = content
  
if ( __name__ == "__main__"):
   
   # create an XMLReader
   parser = xml.sax.make_parser()
   # turn off namepsaces
   parser.setFeature(xml.sax.handler.feature_namespaces, 0)

   # override the default ContextHandler
   Handler = MovieHandler()
   parser.setContentHandler( Handler )
   
   parser.parse("movies.xml")

Output

When the above code is executed, it will produce the following result -

*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Year: 2003
Rating: PG
Stars: 10
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Year: 1989
Rating: R
Stars: 8
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Stars: 10
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Stars: 2
Description: Viewable boredom


Parsing XML with DOM APIs
The document object module is a cross-language API which comes from the W3C for accessing and modifying XML documents.

The DOM is extremely useful for random-access applications. SAX only allows you view one bit of the document at a time. So if you are looking at one SAX element, then you have no access to another.

Here is the simplest way to quickly load an XML document and to create a minidom object using the xml.dom module. The minidom object provides a simple parser method that speedily creates a DOM tree from the XML file.

The sample phrase calls the phrase(file [,parser]) method of the minidom object to parse the XML file selected by file into a DOM tree object.

Example
#!/usr/bin/python

from xml.dom.minidom import parse
import xml.dom.minidom

# Open XML document using minidom parser
DOMTree = xml.dom.minidom.parse("movies.xml")
collection = DOMTree.documentElement
if collection.hasAttribute("shelf"):
   print "Root element : %s" % collection.getAttribute("shelf")

# Get all the movies in the collection
movies = collection.getElementsByTagName("movie")

# Print detail of each movie.
for movie in movies:
   print "*****Movie*****"
   if movie.hasAttribute("title"):
      print "Title: %s" % movie.getAttribute("title")

   type = movie.getElementsByTagName('type')[0]
   print "Type: %s" % type.childNodes[0].data
   format = movie.getElementsByTagName('format')[0]
   print "Format: %s" % format.childNodes[0].data
   rating = movie.getElementsByTagName('rating')[0]
   print "Rating: %s" % rating.childNodes[0].data
   description = movie.getElementsByTagName('description')[0]
   print "Description: %s" % description.childNodes[0].data

Output
When the above code is executed, it will produce the following result -

Root element : New Arrivals
*****Movie*****
Title: Enemy Behind
Type: War, Thriller
Format: DVD
Rating: PG
Description: Talk about a US-Japan war
*****Movie*****
Title: Transformers
Type: Anime, Science Fiction
Format: DVD
Rating: R
Description: A schientific fiction
*****Movie*****
Title: Trigun
Type: Anime, Action
Format: DVD
Rating: PG
Description: Vash the Stampede!
*****Movie*****
Title: Ishtar
Type: Comedy
Format: VHS
Rating: PG
Description: Viewable boredom


Alright guys! This is where we are rounding up for this tutorial post. In my next tutorial, we are going to be discussing about the Python GUI Programing.

Feel free to ask your questions where necessary and i will attend to them as soon as possible. If this tutorial was helpful to you, you can use the share button to share this tutorial.

Follow us on our various social media platforms to stay updated with our latest tutorials. You can also subscribe to our newsletter in order to get our tutorials delivered directly to your emails.

Thanks for reading and bye for now.
Share:

0 comments:

Post a Comment

Hello dear readers! Please kindly try your best to make sure your comments comply with our comment policy guidelines. You can visit our comment policy page to view these guidelines which are clearly stated. Thank you.