In the previous post I used a Janet parsing expression to generate a concrete syntax tree made of nested tuples, similar to Instaparse's output of nested Clojure vectors. Since I wanted to pipe the CST to a Babashka script, I needed to serialize the Janet data structure as EDN. The encoding function I ended up with is far from a complete EDN serializer, since my CST only contained arrays, tuples, strings, and numbers, and being a Janet novice there may be a better or more idiomatic approach, so if you have tips, please email me. In the meantime, here's what I did.
To serialize a number, we only have to convert it to a string:
(defn encode-number [cst]
(string cst))
(print (encode-number 5))
5
Encoding strings is more work, since we need to escape backslashes, double quotes, newlines, and tabs (and possibly more things that aren't covered here):
(defn encode-string [cst]
(let [s (->> cst
(string/replace-all "\\" "\\\\")
(string/replace-all "\"" "\\\"")
(string/replace-all "\n" "\\n")
(string/replace-all "\t" "\\t"))]
(buffer/push-string @"\"" s "\"")))
(print (encode-string "Does this \"escape\"...\tproperly?\n"))
"Does this \"escape\"...\tproperly?\n"
Note that string contents are pushed into a byte buffer, denoted by @""
. Initially I built strings by concatenating strings together, but strings in Janet are immutable and concatenation required a lot of copying. On a large CST that ended up being orders of magnitude slower than using mutable buffers, though less because of string encoding and more because of encoding arrays, which had to reduce over large sequences:
(var encode nil)
(defn encode-array [cst]
(buffer/push-string @"["
(reduce buffer/push-string @""
(interpose " " (map encode cst)))
"]"))
(var encode nil)
is necessary because encode-array
recursively encodes its elements, which could themselves be numbers, strings, arrays, or tuples. We'll define encode
as the main entrypoint below.
Encoding a tuple looks very similar to encoding an array, except instead of simply concatenating all the encoded elements between two square brackets, we include as the first element a keyword that identifies the type of the CST node:
(var encode nil)
(defn encode-tuple [cst]
(let [[head] cst
tail (drop 1 cst)]
(buffer/push-string @"[:" head " "
(reduce buffer/push-string @""
(interpose " " (map encode tail)))
"]")))
Now we can define the central entrypoint:
(set encode
(fn [cst]
(cond
(number? cst) (encode-number cst)
(string? cst) (encode-string cst)
(array? cst) (encode-array cst)
(tuple? cst) (encode-tuple cst))))
(print (encode @['(:unparsed "An")
'(:unparsed " ")
'(:annotation (:name "Annotation"))]))
[[:unparsed "An"] [:unparsed " "] [:annotation [:name "Annotation"]]]
Beyond figuring out the performance of string concatenation and the API for buffers, the implementation felt very straight-forward. Much of that is due, naturally, to Janet's resemblance to Clojure, including macros like defn
and let
and functions like reduce
, interpose
, and drop
.
With a full-featured PEG and this minimal encoder, parsing the Java and Kotlin files I needed to manipulate was suitably fast. The actual refactoring of the CST, however, I implemented in Clojure using Babashka, in order to use clojure.zip, but I am keeping an eye out for more opportunities to use Janet.