exupero's blog
RSSApps

Mapping regex matches

When I need to transform matches of a regular expression within a string, I can use clojure.string/replace with a replacement string that uses group matchers:

(clojure.string/replace "This is _simple_ markup."
                        #"_([^_]+)_"
                        "<strong>$1<strong>")
"This is <strong>simple<strong> markup."

or, for more sophisticated transformations, clojure.string/replace accepts a function to generate replacement strings:

(clojure.string/replace "This is _simple_ markup."
                        #"_([^_]+)_"
                        (fn [[_ s]]
                          (str "<strong>"
                               (clojure.string/upper-case s)
                               "</strong>")))
"This is <strong>SIMPLE</strong> markup."

But occasionally I need to break matches out of a string altogether. For example, suppose instead of converting from a string of markup to a string of HTML code I wanted to convert markup to Hiccup. I haven't found a Clojure function that can do that, so here's a small function I wrote:

(defn re-map [re f s]
  (remove #{::padding}
    (interleave
      (clojure.string/split s re)
      (concat (map f (re-seq re s)) [::padding]))))

Use it like this:

(re-map #"_([^_]+)_"
        (fn [[_ s]]
          [:strong s])
        "This is _simple_ markup.")
("This is " [:strong "simple"] " markup.")

While re-map is a small function, it deserves some explanation. At its core, all it does is split the string by the regex and interleave the result with a sequence of transformed regex matches. The unusual bit is the appended keyword ::padding. That's done because interleave quits when it reaches the end of the shortest sequence, but the sequence of split strings might be one longer than the sequence of regex matches. To handle that, re-map pads the end of the sequence, then, after interleaving, removes the extraneous value.

This works even when the regex matches at the beginning of the string, thanks to clojure.string/split including an empty string at the start of its output:

(clojure.string/split "_This_ is simple markup." #"_([^_]+)_")
["" " is simple markup."]
(re-map #"_([^_]+)_"
        (fn [[_ s]]
          [:strong s])
        "_This_ is simple markup.")
("" [:strong "This"] " is simple markup.")

If you don't want any empty strings in the final result, you can remove them with the ::padding keyword:

(defn re-map [re f s]
  (remove #{"" ::padding}
    (interleave
      (clojure.string/split s re)
      (concat (map f (re-seq re s)) [::padding]))))
(re-map #"_([^_]+)_"
        (fn [[_ s]]
          [:strong s])
        "_This_ is simple markup.")
([:strong "This"] " is simple markup.")

If you have an alternative technique for handling this situation, feel free to email me.