Dienstag, 29. Mai 2012

Porting PAIP to clojure chapter 2

Porting PAIP to clojure chapter 2

Chapter 2 - Generating random Sentences

1 Porting PAIP to clojure - chapter 2 A Simple Lisp Program 

1.1 Introduction

This is the first post of my series porting PAIP to clojure. It is possible to follow my post even if you don't PAIP but I strongly recommend that you get a copy of it. This chapter contains a program to generate random sentences using a grammar for a subset of english. Norvig presents two versions of the program, a straightforward solution using functions to generate random sentences from a grammar Then the weaknesses of this approach are shown and a better solution is developed using a data-driven solution. Afterwards, the advantages of the data-driven approach are discussed.

1.2 Generating random Sentences

1.2.1 The Grammar

Sentence => Noun-Phrase + Verb-Phrase
Noun-Phrase => Article + Noun
 Article => the, a, ….
A rule consists either of two or more non-terminals like Noun-phrase or of a collection of terminals like the, a,… This representation is very natural if you have ever worked with context-free grammars before. The rule Article => the, a, … means: An article is either the or a or … (Article = the | a | …)

1.2.2 The Program

That is the Common Lisp code for the naive solution. It can be found here.
(defun sentence ()    (append (noun-phrase) (verb-phrase)))
(defun noun-phrase () (append (Article) (Noun)))
(defun verb-phrase () (append (Verb) (noun-phrase)))
(defun Article ()     (one-of '(the a)))
(defun Noun ()        (one-of '(man ball woman table)))
(defun Verb ()        (one-of '(hit took saw liked)))
The translation to clojure is straightforward. But I will use a slightly modified version which is, as I think, a little bot more ideomatic in clojure:
(declare sentence noun-phrase verb-phrase article noun verb one-of)

(defn  sentence [] (concat (noun-phrase) (verb-phrase)))
(defn  noun-phrase [] (concat (article) (noun)))
(defn  verb-phrase [] (concat (verb) (noun-phrase)))
(defn  article [] (one-of ["the" "a"]))
(defn  noun [] (one-of ["man" "ball" "woman" "table"]))
(defn  verb [] (one-of ["hit" "took" "saw" "liked"]))
I use vectors instead of lists and strings instead of symbols. here are the helper-functions: one-of picks one element of the collection and returns a single-element collection containing it.
(defn one-of [coll]
  (if (seq coll)
   [(rand-nth coll)]))
To test, let's get 10 random sentences:
(take 10 (repeatedly sentence))   
this evaluates to:
(("a" "ball" "liked" "the" "ball")
("the" "table" "took" "the" "ball")
("the" "man" "took" "a" "man")
("the" "woman" "saw" "a" "man")
("a" "ball" "hit" "the" "ball")
("a" "table" "hit" "the" "man")
("a" "ball" "took" "a" "ball")
("a" "man" "liked" "a" "ball")
("a" "table" "saw" "the" "ball")
("the" "woman" "hit" "a" "ball"))
I believe it is apparent that this is not a good representation for a grammar. The lisp functions are definitly harder to read than the normal grammar syntax shown above. As an example, if you wanted to represent the following rule: Adj* => e,Adj + Adj* (note that e is the empty symbol) you would have to write a function like this:
(declare Adj)
(defn Adj* []
  (if (= (rand-int 2) 0)
    (concat (Adj) (Adj*))))
There has to be a better way… enter the data-driven solution!

1.3 A data-driven solution

The naive solution is straightforward to implement while making it harder to encode the grammar rules. By using a data-driven approach, it is the other war round. Writing grammar rules becomes easy while implementation may become harder. So let's take a look how the data-driven solution works for this problem.

1.3.1 The representation of the grammar

The first step is to decouple the grammar from the code which processes it. In the book Norvig represents the grammar with the following datastructure:
 (defparameter *simple-grammar*
 '((sentence -> (noun-phrase verb-phrase))
  (noun-phrase -> (Article Noun))
  (verb-phrase -> (Verb noun-phrase))
  (Article -> the a)
  (Noun -> man ball woman table)
   (Verb -> hit took saw liked))
"A grammar for a trivial subset of English.")
Here the grammar rule sentence => noun-phrase + verbphrase gets translated to
(sentence -> (noun-phrase verb-phrase)).
So a list of elements means "produce this nonterminals in sequence".
Multiple nonterminals right of the -> sign mean "choose one of this nonterminals".
Every rule is itself a list and the grammar is a list of rules.
But this representation is not really clojureish?. We can take advantage of clojures literal representation for maps,vectors and sets in addition to lists.
Using these, the meaning of a datastructure- when choosen appropriate- becomes clearer.
Let me give you an example:
What are grammar rules? Rules map a nonterminal to - posible multiple - nonterminals or terminals. Thus it is appropriate to represent them as a map in clojure. I choose to represent the nonterminals as keywords and the values of the map as either one element or a vector of multiple elements.
A vector means: "apply all elements in order".
Norvig represents a choice of nonterminals as simply the nonterminals written after the -> sign.
For me, it was not clear at the beginning that that means "choose one of these nonterminals".
Using a set in clojure, this becomes obvious. Rewritten, the grammar becomes the following:
(def simple-grammar
{:sentence [:noun-phrase :verb-phrase]
 :noun-phrase [:Article :Noun]
 :verb-phrase [:Verb :noun-phrase]
 :Article #{"the" "a"}
 :Noun #{"man" "ball" "woman" "table"}
 :Verb #{"hit" "took" "saw" "liked"}})

(def ^:dynamic *grammar* simple-grammar)
Ok that's for the design part. Now that we have a good representation of our data, the grammar, we have to worry about evaluating it.

1.3.2 Evaluating the grammar

Because I have chosen a different and more ideomatic representation of the grammar in clojure, the code for evaluating the grammar will be different than the code in PAIP. So don't expect a literal translation of the PAIP code.
So, how can we generate a possible sentence:
the function 'generate' will take the startsymbol as argument and retrieves the rule from the grammar.
If there is not a rule for the argument in the grammar, the argument itself is evaluated (thus making it possible to call generate either with the left hand or the right hand side of a rule).
It will recurse on the elements of the rule in sequence, appending the result as it goes along.
When it encounters a set, it will generate one random element of it. it recursively generates the nonterminal. If it encounters a terminal, that is none of the above are true, it just returns a single-element vector of it.
(defn generate [phrase]
    (cond (get *grammar* phrase) (generate (get *grammar* phrase))
          (sequential? phrase) (mapcat generate phrase)
          (set? phrase)  (generate (rand-nth (seq phrase)))
          :else [phrase]))
Amazing how closely the code mimics the description. So this is data-dricen programming.
Decouple the representation of the data from the code which evaluates it.
Let's take a look at a more complicated example and see how the data-driven approach scales. Here's the bigger grammar:

(def bigger-grammar
 {:sentence [:noun-phrase :verb-phrase]
  :noun-phrase #{[:Article :Adj* :Noun :PP*] :Name :Pronoun}
  :verb-phrase [:Verb :noun-phrase :PP*]
  :PP* #{[] [:PP :PP*]}
  :Adj* #{[] [:Adj :Adj*]}
  :PP [:Prep :noun-phrase]
  :Prep #{"to" "in" "by" "with" "on"}
  :Adj #{"big" "little" "blue" "green" "adiabatic"}
  :Article #{"the" "a"}
  :Name #{"Pat" "Kim" "Lee" "Terry" "Robin"}
  :Noun #{"man" "ball" "woman" "table"}
  :Verb #{"hit" "took" "saw" "liked"}
  :Pronoun #{"he" "she" "it" "these" "those" "that"}
(def ^:dynamic *grammar* bigger-grammar)

Let's generate 10 sentences again:

(take 10 (repeatedly #(generate :sentence)))

(("these" "liked" "a" "ball")
 ("a" "ball" "in" "the" "big" "green" "big" "woman" "in" "that" "by" "Terry" "liked" "a" "man")
 ("that" "liked" "that")
 ("Lee" "hit" "it")
 ("Lee" "took" "Lee")
 ("the" "little" "blue" "little" "ball" "in" "Terry" "by" "Robin" "liked" "Pat")
 ("a" "adiabatic" "blue" "blue" "ball" "saw" "the" "man" "in" "Pat" "by" "Lee" "on" "a" "adiabatic" "table"
  "in" "Terry")
 ("Pat" "hit" "Robin" "to" "those")
 ("those" "liked" "the" "woman" "with" "Robin" "with" "these" "in" "the" "table" "to" "Robin" "to" "a"
  "blue" "adiabatic" "ball" "with" "she" "on" "those" "on" "those")
 ("it" "hit" "Kim" "on" "she" "on" "the" "table"))
It works! Enjoy the funny sentences.
But hey, what if we wanted to see how the sentences are generated, to see the parse-tree. Because of the data-driven design, it is easy to implement this. The data doesn't need to be changed, we need only a new evaluation-function.
(defn generate-all [phrase]
    (cond (get *grammar* phrase)  (list phrase (generate-all (get *grammar* phrase)))
          (sequential? phrase) (mapcat generate-all phrase)
          (set? phrase)  (generate-all (rand-nth (seq phrase)))
          :else phrase))
there are only two changes: make a list of phrase and the generated symbols before recursing and return the phrase in the base call instead of a single item vector containing it.
Imaging changing all functions to achieve this instead of just two lines …
(generate-all :sentence)
         (:Name "Kim")
         (:Verb "saw"
            (:Article "the"
               (:Adj "adiabatic" 
                :Adj* ())
             :Noun "ball" 
             :PP* ())
          :PP* ())))
Norvig gives a last example of a generate-all function, which works on the simple grammar and returns all possible sentences defined by the grammar (the language of the grammar). I leaf the implementation to the reader :)

1.4 Advantages of the data-driven solution

Gratulations for making it through the post. With this chapter, Norvig makes a strong point which will be even more important in the next chapters (and is for programming in general). If you use the data-driven approach, you use "the most natural notation available to solve the problem". So instead of worrying how to implement the problem, worry about how to represent you data, so that it is easy to understand and to scale it. With the data-driven solution, you can
  • expand and modify the program easier
  • use different datasets with the same evaluation function
  • use different evaluation functions with the same dataset
  • represent your problem so that it is easier to understand.
If there is something you take away from this post, it is this:
Use the best notation available to solve your problem!

Chapter 4 coming soon. It contains an implementation of the GPS, the General Problem Solver.
Date: 2012-05-28T18:12+0200
Author: kima
Org version 7.8.09 with Emacs version 23
Validate XHTML 1.0

Keine Kommentare:

Kommentar veröffentlichen