Clojure Cookbook: XML/HTML Processing

Traversing an XML Document

Problem

How do I work with XML documents?

Solution

Clojure has powerful libraries for processing XML documents. One low-level approach is to use the function clojure.xml.parse to read a document and parse it into a map of the root element with child elements nested within it. parse accepts a File, an InputStream, or a String containing a URI for its argument.

Suppose the following XML document is located in a file named "calendar.xml":

<?xml version="1.0"?>
<calendar>
  <holiday type="International">
    <name>International Lefthanders Day</name>
    <date>
      <month>August</month>
      <day>13</day>
    </date>
  </holiday>
  <holiday type="Personal">
    <name>Rover's birthday</name>
    <date>
      <month>October</month>
      <day>12</day>
    </date>
  </holiday>
  <holiday type="National">
    <name>Groundhog Day</name>
    <date>
      <month>February</month>
      <day>2</day>
    </date>
  </holiday>
  <holiday type="State">
    <name>Kamehameha Day</name>
    <date>
      <month>June</month>
      <day>11</day>
    </date>
  </holiday>
</calendar>

parse returns a map with three keys:
(use '[clojure.xml :only (parse)])
(def xml-doc (parse (File. "calendar.xml")))
(keys xml-doc) => (:tag :attrs :content)

The :tag of the root element:
(:tag xml-doc) => :calendar

It has no attributes but contains 4 child elements:
(:attrs xml-doc) => nil
(count (:content xml-doc)) => 4

The first child element is a <holiday> element:
(def holiday (first (:content xml-doc)))
(:tag holiday) => :holiday
(:attrs holiday) => {:type "International"}

The holiday contains 2 children of its own, a <name> element and a <date> element:

(:content holiday) =>
[{:tag :name, :attrs nil, :content ["International Lefthanders Day"]} 
 {:tag :date, :attrs nil, :content [{:tag :month, :attrs nil, :content ["August"]} {:tag :day, :attrs nil, :content ["13"]}]}]

There is a higher-level approach, rather than using parse directly, which may be more convenient. The function clojure.core/xml-seq provides a sequence wrapper that allows you to perform a depth-first traversal of the XML document:

(map (fn [elt] (or (:tag elt) elt)) (xml-seq xml-doc)) =>
(:calendar 
  :holiday :name "International Lefthanders Day" :date :month "August" :day "13" 
  :holiday :name "Rover's birthday" :date :month "October" :day "12" 
  :holiday :name "Groundhog Day" :date :month "February" :day "2" 
  :holiday :name "Kamehameha Day" :date :month "June" :day "11")

We can use a list comprehension to extract some relevant info:

(defn holiday-name [holiday] (first (:content (first (:content holiday)))) )
(defn holiday-month [holiday] (first (:content (first (:content (second (:content holiday)))))))
(defn holiday-day [holiday] (first (:content (second (:content (second (:content holiday)))))))

(for [elt (xml-seq xml-doc) :when (= :holiday (:tag elt))] [(holiday-name elt) (holiday-month elt) (holiday-day elt)]) => 
(["International Lefthanders Day" "August" "13"] 
 ["Rover's birthday" "October" "12"] 
 ["Groundhog Day" "February" "2"]
 ["Kamehameha Day" "June" "11"])

Comments

Add a New Comment
or Sign in as Wikidot user
(will not be published)
- +

Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License