In an earlier post I used this simple online dictionary to get words' parts of speech. The script isn't complex, but I'll document a few things about it here.
I usually develop scripts interactively via an nREPL. When using the quick-starting Babashka an nREPL isn't always necessary, but in this case I needed to download 28 HTML pages, several of which were a megabyte or more, and working interactively allowed me to save downloaded data to a var and work with it without having to re-fetch every time I ran the script. Of course, I could have downloaded the data to a local file, then slurp
ed it, but that's two scripts and feels over-engineered for a small task like this, especially when an interactive nREPL already avoids that problem.
To start a Babashka nREPL, use bb nrepl-server
.
The HTML on the dictionary site isn't strict XML, so we can't use clojure.data.xml
, which errors on the doctype declaration. Instead we can use Bootleg:
(require '[babashka.pods :as pods]
'[babashka.curl :as curl]
'[clojure.string :as str])
(pods/load-pod 'retrogradeorbit/bootleg "0.1.9")
(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]]
'[pod.retrogradeorbit.hickory.select :as s])
The dictionary's index lists the separate pages for each letter, plus a page named "wb1913_new.html". While we could crawl each page based on a list of letters, it's not too difficult to collect the appropriate anchor tags' href
attributes and request them:
(def base-url "https://www.mso.anu.edu.au/~ralph/OPTED/v003/")
(def index
(-> (curl/get base-url)
:body
str/trim
(convert-to :hickory)))
(def pages
(->> index
(s/select (s/tag :a))
(sequence
(comp
(map #(-> % :attrs :href))
(filter #(re-find #"^wb1913_.*\.html$" %))
(map #(str base-url %))))
(pmap #(-> % curl/get :body str/trim))
(into [])))
pmap
allows us to download pages in parallel, rather than one at a time.
To parse each page, let's convert the HTML to Hickory data structures. Converting to :hickory
only returns one HTML tree, so we need to use :hickory-seq
to get all of them, which we can then wrap under a single root node:
(defn parse-page [page]
(-> page
(convert-to :hickory-seq)
(as-> $ {:type :element :tag :div :content $})
(->> (s/select (s/tag :p)))
(->> (keep (fn [el]
(let [word (->> el
(s/select (s/tag :b))
first :content first)
pos (->> el
(s/select (s/tag :i))
first :content first)]
(when pos
(str word "," pos))))))))
Using Hickory selectors, we get all the <p>
tags, and for each <p>
tag the first <b>
tag and the first <i>
tag, which respectively contain the word and its part of speech.
Note the keep
function above. I don't use it often, but it's more or less a combination of map
, filter
, and some
.
I saved the data in a CSV format:
(->> pages
(mapcat parse-page)
(str/join "\n")
(spit some-path))
I could have used pr-str
and written EDN, but I chose a line-based format to make editing out off-color words a little easier.