Clojure

Scraping an HTML dictionary with Babashka and Bootleg

In an earlier post I used this simple online dictionary to get words' parts of speech. The script isn't complex, but I'll document a few things about it here.

I usually develop scripts interactively via an nREPL. When using the quick-starting Babashka an nREPL isn't always necessary, but in this case I needed to download 28 HTML pages, several of which were a megabyte or more, and working interactively allowed me to save downloaded data to a var and work with it without having to re-fetch every time I ran the script. Of course, I could have downloaded the data to a local file, then slurped it, but that's two scripts and feels over-engineered for a small task like this, especially when an interactive nREPL already avoids that problem.

To start a Babashka nREPL, use bb nrepl-server.

The HTML on the dictionary site isn't strict XML, so we can't use clojure.data.xml, which errors on the doctype declaration. Instead we can use Bootleg:

(require '[babashka.pods :as pods]
         '[babashka.curl :as curl]
         '[clojure.string :as str])

(pods/load-pod 'retrogradeorbit/bootleg "0.1.9")

(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]]
         '[pod.retrogradeorbit.hickory.select :as s])

The dictionary's index lists the separate pages for each letter, plus a page named "wb1913_new.html". While we could crawl each page based on a list of letters, it's not too difficult to collect the appropriate anchor tags' href attributes and request them:

(def base-url "https://www.mso.anu.edu.au/~ralph/OPTED/v003/")

(def index
  (-> (curl/get base-url)
      :body
      str/trim
      (convert-to :hickory)))

(def pages
  (->> index
       (s/select (s/tag :a))
       (sequence
         (comp
           (map #(-> % :attrs :href))
           (filter #(re-find #"^wb1913_.*\.html$" %))
           (map #(str base-url %))))
       (pmap #(-> % curl/get :body str/trim))
       (into [])))

pmap allows us to download pages in parallel, rather than one at a time.

To parse each page, let's convert the HTML to Hickory data structures. Converting to :hickory only returns one HTML tree, so we need to use :hickory-seq to get all of them, which we can then wrap under a single root node:

(defn parse-page [page]
  (-> page
      (convert-to :hickory-seq)
      (as-> $ {:type :element :tag :div :content $})
      (->> (s/select (s/tag :p)))
      (->> (keep (fn [el]
                   (let [word (->> el
                                   (s/select (s/tag :b))
                                   first :content first)
                         pos (->> el
                                  (s/select (s/tag :i))
                                  first :content first)]
                     (when pos
                       (str word "," pos))))))))

Using Hickory selectors, we get all the <p> tags, and for each <p> tag the first <b> tag and the first <i> tag, which respectively contain the word and its part of speech.

Note the keep function above. I don't use it often, but it's more or less a combination of map, filter, and some.

I saved the data in a CSV format:

(->> pages
     (mapcat parse-page)
     (str/join "\n")
     (spit some-path))

I could have used pr-str and written EDN, but I chose a line-based format to make editing out off-color words a little easier.

December 2022