exupero's blog
RSSApps

Phonetic major system mnemonics

In the previous post we used a version of the major system to encode digit sequences as letters and searched for phrases that encoded those digits. To make things easier, I used a letter-based encoding rather than the major system's actual phonetic encodings. In my experience the phonetic encodings take a little getting used to but ultimately allow converting words to digits by sounding them out rather than writing them down, and for me it's usually more convenient to decode by mouth than to visualize how a word is spelled. For example, in my letter-based encoding the word "knew" represented the digits 7 and 2, which can be hard to remember since it only has one consonant sound. In the phonetic system, the silent k is ignored and only the n encodes a digit.

Naturally computers know nothing of pronunciation, so to use the phonetic major system we need a pronunciation dictionary. I'll use this one from Carnegie Mellon University. We only need lines that start with a letter (no comments and no words for symbols):

(sequence
  (comp
    (filter #(re-find #"^[A-Z]" %))
    (take 10))
  cmu-dictionary)
("A  AH0"
 "A(1)  EY1"
 "A'S  EY1 Z"
 "A.  EY1"
 "A.'S  EY1 Z"
 "A.D.  EY2 D IY1"
 "A.M.  EY2 EH1 M"
 "A.S  EY1 Z"
 "A42128  EY1 F AO1 R T UW1 W AH1 N T UW1 EY1 T"
 "AA  EY2 EY1")

We can start parsing by splitting on whitespace:

(sequence
  (comp
    (filter #(re-find #"^[A-Z]" %))
    (map (fn [line]
           (str/split line #"\s+")))
    (take 10))
  cmu-dictionary)
(["A" "AH0"]
 ["A(1)" "EY1"]
 ["A'S" "EY1" "Z"]
 ["A." "EY1"]
 ["A.'S" "EY1" "Z"]
 ["A.D." "EY2" "D" "IY1"]
 ["A.M." "EY2" "EH1" "M"]
 ["A.S" "EY1" "Z"]
 ["A42128"
  "EY1"
  "F"
  "AO1"
  "R"
  "T"
  "UW1"
  "W"
  "AH1"
  "N"
  "T"
  "UW1"
  "EY1"
  "T"]
 ["AA" "EY2" "EY1"])

Let's map phonemes to digits as follows:

(def phoneme->digits
  {"B" 9
   "CH" 6
   "D" 1
   "DH" 1
   "ER" 4
   "F" 8
   "G" 7
   "JH" 6
   "K" 7
   "L" 5
   "M" 3
   "N" 2
   "NG" 27
   "P" 9
   "R" 4
   "S" 0
   "SH" 6
   "T" 1
   "TH" 1
   "V" 8
   "Z" 0
   "ZH" 0})

When parsing pronunciations into digits, we can drop any numbers from phonemes, which indicate verbal stress and don't affect the encoding:

(sequence
  (comp
    (filter #(re-find #"^[A-Z]" %))
    (map (fn [line]
           (let [[word & phonemes] (str/split line #"\s+")]
             [word
              (transduce
                (comp
                  (map #(str/replace % #"[0-9]+" ""))
                  (map phoneme->digits)
                  (remove nil?))
                str phonemes)])))
    (take 10))
  cmu-dictionary)
(["A" ""]
 ["A(1)" ""]
 ["A'S" "0"]
 ["A." ""]
 ["A.'S" "0"]
 ["A.D." "1"]
 ["A.M." "3"]
 ["A.S" "0"]
 ["A42128" "841211"]
 ["AA" ""])

Here's the full dictionary:

(def dictionary
  (into {}
    (comp
      (filter #(re-find #"^[A-Z]" %))
      (map (fn [line]
             (let [[word & phonemes] (str/split line #"\s+")]
               [word
                (transduce
                  (comp
                    (map #(str/replace % #"[0-9]+" ""))
                    (map phoneme->digits)
                    (remove nil?))
                  str phonemes)]))))
    cmu-dictionary))

which we can use to write an encoding function:

(defn major-phoneme-encode [text]
  (let [digits (map (comp dictionary str/upper-case)
                    (words text))]
    (when-not (some nil? digits)
      (apply str digits))))
(major-phoneme-encode "I wish I knew")
"62"

Checking for nil digits ensures that all the words were found in the dictionary. If text uses some word that's not in the dictionary, the encoder returns nil, since it can't know the pronunciation (and therefore the encoding) of the missing word.

Let's update find-words to handle nil encodings:

(defn find-words [encoder digits words]
  (loop [words words
         ds digits
         match []]
    (cond
      (str/blank? ds) match
      (empty? words) nil
      :else (let [[word & words] words
                  c (encoder word)]
              (cond
                (nil? c)
                , (recur words digits [])
                (str/starts-with? ds c)
                , (recur words
                         (subs ds (count c))
                         (conj match word))
                (seq match)
                , (recur (concat (drop 1 match) [word] words)
                         digits
                         [])
                :else
                , (recur words digits []))))))

Here's an example call that illustrates that find-words skips unknown words like shn in favor of words in the dictionary:

(find-words major-phoneme-encode "62" (words "shn I wish I knew"))
["I" "wish" "I" "knew"]

Here's the digit sequences we've been trying:

DigitsTextVerse
1414Thus the heavens and the earth were finished, and all the host of them.Genesis 2:1
1732And he said unto him, Take me an heifer of three years old, and a she goat of three years old, and a ram of three years old, and a turtledove, and a young pigeon.Genesis 15:9
2236And the Gentiles shall see thy righteousness, and all kings thy glory: and thou shalt be called by a new name, which the mouth of the Lord shall name.Isaiah 62:2
2646At the appointment of Aaron and his sons shall be all the service of the sons of the Gershonites, in all their burdens, and in all their service: and ye shall appoint unto them in charge all their burdens.Numbers 4:27
3183And Jephthah said unto the elders of Gilead, Did not ye hate me, and expel me out of my father’s house? and why are ye come unto me now when ye are in distress?Judges 11:7

Most of these are the same as searching by a letter-based major system, but encoding phonetically does find a soft g in charge in an earlier verse than before.

In the next post we'll try constructing our own mnemonic phrases rather than trying to find them in an existing text.