In the previous post I outlined Grascii, an ASCII notation for Gregg shorthand forms. Underlying Grascii's command-line tool is a dictionary of about 15,000 words. That's a lot more words than I expected for a dictionary that has to be more or less manually constructed, covering around 75% of a Gregg dictionary I have as a PDF. As much as it is, however, 15,000 words is only a sample of the English language. For comparison, the CMU Pronouncing Dictionary has nine times as many words as Grascii's dictionary, and my system's spelling list has 15 times as many. To supplement Grascii's dictionary, let's generate Grascii forms.
Gregg has almost 100 abbreviations for common prefixes and suffixes. While Gregg is written phonetically, these prefixes and suffixes are fairly standard in English spelling, so we can find them with regular expressions and replace them with their equivalent Grascii notation. Here's an example:
(require '[clojure.string :as str])
(defn grascii [word]
(-> word
str/lower-case
(str/replace #"^center" "S)N^")
(str/replace #"^deter" "D^")
(str/replace #"^inter" "N^")
(str/replace #"^extra" "ES)^")
(str/replace #"^circu" "S(^")
(str/replace #"^grand" "G^")
(str/replace #"^magni" "M^")
(str/replace #"^multi" "MU^")
(str/replace #"^over" "O^")
(str/replace #"^under" "U^")
,,,))
Since we're only handling prefixes and suffixes here, the input word is lower-cased less to avoid having a (?i)
in every regex and more to make it plain which characters are in Grascii notation and which haven't been rewritten.
(grascii "interview")
"N^view"
Some Gregg prefixes should be used regardless of the actual vowel sound at their end. For those, let's match any vowel using a regex character class:
(defn grascii [word]
(-> word
str/lower-case
,,,
(str/replace #"^centr[aeiouy]+" "S)N^")
(str/replace #"^constr[aeiouy]+" "KS(^")
(str/replace #"^contr[aeiouy]+" "K^")
(str/replace #"^detr[aeiouy]+" "D^")
(str/replace #"^distr[aeiouy]+" "DS(^")
(str/replace #"^electr[aeiouy]+" "EL^")
,,,))
(grascii "contrary")
"K^ry"
These regexes are getting noisy. The character class dominates and makes it harder to see the important part: the prefix. Let's add a function that generates these regexes for us:
(defn prefix [s]
(re-pattern
(str "^"
(str/replace (name s) #"[aeiouy]\*" "[aeiouy]+"))))
This feels a little meta: we're using a regex to turn a string into a regex.
(prefix 'contra*)
#"^contr[aeiouy]+"
Using that function, we can see more easily the prefixes that are being replaced:
(defn grascii [word]
(-> word
str/lower-case
,,,
(str/replace (prefix 'centra*) "S)N^")
(str/replace (prefix 'constru*) "KS(^")
(str/replace (prefix 'contra*) "K^")
(str/replace (prefix 'detra*) "D^")
(str/replace (prefix 'distra*) "DS(^")
(str/replace (prefix 'electri*) "EL^")
,,,))
(grascii "contrary")
"K^ry"
Our prefix
function shows us that regular expressions aren't an inherent part of the algorithm, just an implementation detail. Rather than writing regexes manually every time we want to add a replacement, let's generate them from a more Gregg-like notation. Here's a function that uses leading and trailing hyphens to know whether to make regex for a suffix or a prefix:
(defn gregg [s]
(-> (name s)
(str/replace #"\*[aeiouy]|[aeiouy]\*" "[aeiouy]+")
(str/replace #"^(.+)-$" "^$1")
(str/replace #"^-(.+)$" "$1\\$")
re-pattern))
(gregg 'center-)
#"^center"
(gregg 'centra*-)
#"^centr[aeiouy]+"
(gregg '-bility)
#"bility$"
(gregg '-*ality)
#"[aeiouy]+lity$"
Now we don't have to write a regex for each new prefix or suffix replacement.
The last step is to move these replacement pairs out to a data literal:
(def replacements
'[centra*- "S)N^"
center- "S)N^"
constru*- "KS(^"
contra*- "K^"
counter- "K^"
detra*- "D^"
deter- "D^"
distra*- "DS^"
electri- "EL^"
,,,
-bility "^B"
-*acity "^S)"
-*ality "^L"
-*amity "^MT"
-*inity "^NT"
-*arity "^R"
-*itic "^A"
-*icical "^A|"
-*istic "^ST"
,,,])
(defn grascii [word]
(reduce (fn [word [prefix-or-suffix replacement]]
(str/replace word (gregg prefix-or-suffix) replacement))
(str/lower-case word)
(partition 2 replacements)))
These replacement pairs are specified as a vector instead of a map so we can control the order in which replacements are tried. For example, if -bility was at the end of the list, then -*ality would always match before it. Since -bility comes earlier, we get a more specific Grascii form for certain words:
(grascii "constructability")
"KS(^cta^B"
It's not a perfect Grascii generator, but as a native English-speaker I usually don't need help knowing the Gregg forms for sounds in the middle of a word. More often I need a reminder that there's a standard prefix or suffix that would make writing a form faster and easier.