-
Notifications
You must be signed in to change notification settings - Fork 73
Using the Reader API
Note: This is copy-pasted from my blog post, written a year back.
This post describes the Reader API[1] of ruby-libxml.
Several techniques exist to parse XML documents. You can read up on them on this Wikipedia article. Reader provides a StAX API for parsing XML documents.
The Reader API provides a "cursor" that moves forward through the XML document node by node and you process the data in a node while the cursor is at it. This paradigm is also called "pull parsing". You can initialize an XML document from a file, string, uri or an io object and then call XML::Reader#read to move through the document. The read method returns false when there is no more node to read. Optionally you can provide a hash while initializing the document to control how parsing is done. Typically, you would do something like this:
doc = XML::Reader.file("trees.xml", :options =>XML::Parser::Options::NOENT)
process(doc) while doc.read
Possible parsing options are constants of the class XML::Parser::Options[2]. More than one options can be combined using bitwise or ( | ).
After a document is parsed you should free the resources by calling doc.close. Getting Information from the current node.
While the cursor is at one of the nodes, you can query it for: Node Type: doc.node_type, will return the type of the node from the following, Start of an element : 1 Attributes : 2 Text : 3 CDATA : 4 Entity References : 5 Entity Declarations : 6 Processing Instruction : 7 Comments : 8 Document : 9 DTD/Doctype : 10 Document Fragment : 11 Notation : 12 Whitespace : 13 Significant Whitespace : 14 End of an element : 15 End entity : 16 XML Declaration : 17 See [3] for a description of all the node types. Constants are defined for the node types under the XML::Reader[1] class. Name : doc.name, will return the qualified name of the node( prefix + local name ) Local Name : doc.local_name, will return the local name of the node( name, without the associated prefix ) Prefix : doc.prefix, will return the namespace prefix associated with the node Namespace : doc.namespace_uri, will return the URI of the node's namespace. Namespace declarations are also considered node, in line with the DOM API. You can use doc.namespace_declaration? to find if the attribute node is a namespace declaration or not. Given the prefix( see 2 ) you can find out the associated namespace with doc.lookup_namespace("prefix"); use nil if you want the default namespace. Value : doc.value, will return the text value of the node if present else nil. Alternatively, you can also check if the node has a text value or not by doc.has_value? Empty : doc.empty_element?, will tell you if the node is empty or not. Empty elements are those that are closed in their start tag itself. So, Depth : doc.depth, will return the depth of the node in the tree from the base element Reading attributes.
To find out if a node has an attribute or not, use doc.has_attributes?. You can find the attribute count of the node with doc.attribute_count. Even though attributes are also nodes, doc.read does not move the cursor to an attribute node. Attributes can be accessed in a hash like manner with the [] method. [] can be called with the attribute's name or index( the first attribute is indexed 0). With the doc.move_to_next_attribute you can move the cursor to the next attribute. It returns 1 if the cursor moved to the next attribute and 0 if there is no attribute to move to. While the cursor is at an attribute node you can query it like any other node( for name, value, node type, depth ) as described above. You must remember to move back to the element node by doc.move_to_element. Alternatively, you can call the move_to_attribute function on the cursor with a node's name as the argument to move to an attribute node. I prefer the array notation. read_attribute_value is a related method whose use I have not understood fully. Refer the document if you will. Validation.
To check if the XML document confirms to valid schema definition, call the schema_validate method on the reader object and pass it the location of the schema file. It returns 0 if the document validates and -1 in case of an error. Note that this function should be called just after you instantiate a Reader object. Trying to validate an XML document after you have started reading( called read on the document object ) is an error.
doc.schema_validate("schema.xsd")
There are a few more API calls which you can refer here [1]
Below is a "hello world" code example using the Reader API, a sample XML file and the result of parsing it. Since I have described the technicalities above, I am not going to walk you through the code.
require "rubygems"
require "xml"
#parse the sample.xml ignoring whitespaces and
#performing entity substitution.
doc = XML::Reader.file("sample.xml", :options => XML::Parser::Options::NOBLANKS |
XML::Parser::Options::NOENT
)
#display a node's name: local and prefix
def display_name( node )
puts "\tName: #{node.name}"
if node.prefix
puts "\t\tPrefix: #{node.prefix}" if node.prefix
puts "\t\tLocal: #{node.local_name}"
end
end
#display attributes of a node
def display_attributes( node )
node.attribute_count.times do | index |
puts "Attribute # #{index + 1}"
node.move_to_next_attribute
display node
end
node.move_to_element
end
#process a node
def display( node )
display_name node
puts "\tDepth: #{node.depth}"
puts "\tEmpty Element" if node.empty_element?
puts "\tValue: #{node.value}" if node.has_value?
display_attributes node
print "\n"
end
#shift through the document.
i = 1
while doc.read
unless doc.node_type == XML::Reader::TYPE_END_ELEMENT
puts "Node # #{i}"
display doc
i += 1
end
end
#free the resources
doc.close
Sample: it is an NeXML file.
<?xml version="1.0" encoding="ISO-8859-1"?>
<nex:nexml
version="0.8"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.nexml.org/1.0 ../xsd/nexml.xsd"
xmlns:nex="http://www.nexml.org/1.0"
generator="mesquite"
xmlns:xlink="http://www.w3.org/1999/xlink"
xmlns="http://www.nexml.org/1.0">
<otus
id="taxa1"
label="My taxa block"
xml:base="http://example.org/"
xml:id="taxa1"
class="taxset1"
xml:lang="EN"
xlink:href="#taxa1">
<!--
The taxon element is analogous to a single label in
a nexus taxa block. It may have the same additional
attributes (label, xml:base, xml:lang, xml:id, xlink:href
and class) as the taxa element.
-->
<otu id="t1"/>
<otu id="t2"/>
<otu id="t3"/>
<otu id="t4"/>
<otu id="t5"/>
</otus>
</nex:nexml>
Output:
Node # 1
Name: nex:nexml
Prefix: nex
Local: nexml
Depth: 0
Attribute # 1
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 2
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 3
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 4
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 5
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 6
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Attribute # 7
Name: xmlns:xsi
Prefix: xmlns
Local: xsi
Depth: 1
Value: http://www.w3.org/2001/XMLSchema-instance
Node # 2
Name: otus
Depth: 1
Attribute # 1
Name: id
Depth: 2
Value: taxa1
Attribute # 2
Name: id
Depth: 2
Value: taxa1
Attribute # 3
Name: id
Depth: 2
Value: taxa1
Attribute # 4
Name: id
Depth: 2
Value: taxa1
Attribute # 5
Name: id
Depth: 2
Value: taxa1
Attribute # 6
Name: id
Depth: 2
Value: taxa1
Attribute # 7
Name: id
Depth: 2
Value: taxa1
Node # 3
Name: #comment
Depth: 2
Value:
The taxon element is analogous to a single label in
a nexus taxa block. It may have the same additional
attributes (label, xml:base, xml:lang, xml:id, xlink:href
and class) as the taxa element.
Node # 4
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t1
Node # 5
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t2
Node # 6
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t3
Node # 7
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t4
Node # 8
Name: otu
Depth: 2
Empty Element
Attribute # 1
Name: id
Depth: 3
Value: t5
XML::Reader is primarily a streaming interface, but, it also provides convenient methods to mix the DOM API( XML::Parser ). Xpath queries can then be used. Perhaps I will write about it in some future post, after I have tried it out. You can find good info at [3].
[1] XML::Reader API Docs [2] XML Parsing options [2] Node Types [3] Tutorial to LibXML Reader Interface( in C )