Last time we collected and processed the data for generating stupid fake elementary school science questions and answers. The important parts to remember are
- we generated two files
questions.jsonandanswers.jsoncontaining transition dictionaries mapping each word to an array / list of possible following words, and - we used sentinel tokens
__START__and__STOP__to indicate the beginning and end of sentences.
In this post we'll use a Haskell library called servant to build a web service that generates and returns random questions. If you know Haskell, I'm sure you can find fault with the way I did things, but if you don't know Haskell you might find it educational and/or mind-expanding.
(Code, as always, is on GitHub.)
We'll do this in four steps:
- define some types
- write code that can generate a random question, given an (abstract)
GetNextTokenfunction - implement a (concrete)
GetNextTokenbased on our transitions - create an API that serves up the random questions
The Types
Let's start with the types. We need to define the Question that our web service
will return. In our API, a question will have a questionText, a list/array of
answers (which are just strings), and an integer indicating the index of the
correctAnswer. Pretty simple:
data Question = Question
{ questionText :: String
, answers :: [Answer]
, correctAnswer :: Int
} deriving (Eq, Show)
type Answer = String
$(deriveJSON defaultOptions ''Question)
The last line is (I believe) some template Haskell voodoo
that makes it so our service knows how to serialize a Question to
JSON (since we can't send Haskell objects over the wire). I don't understand it,
I just copied it from the docs.
Now we need to define a type for our tokens. One of the benefits of working in a nicely-typed language is that we don't have to use "sentinel values", we can use our type system for that:
data Token = Start | Stop | Word String deriving (Eq, Ord)
So a token is either Start, Stop, or a Word with an associated String
value. The deriving (Eq, Ord) just makes it so that we can test two tokens
for equality and inequalities.
Since our tokens will come from deserializing JSON, we'll also need a Read
instance, which indicates how to parse text into Token objects:
instance Read Token where
readsPrec _ "__START__" = [(Start, "")]
readsPrec _ "__STOP__" = [(Stop, "")]
readsPrec _ w = [(Word w, "")]
Don't get hung up on the details, it does exactly what you'd expect it to do. (If you do get hung up on the details, read the docs.)
We also want to define a type alias
type GetNextToken = Token -> IO Token
that represents a function that takes a Token and returns an IO Token.
If you are not a Haskell person, you are at this point wondering
- Why does it not just return a
Token? - What the hell is an
IO Token?
For the first, Haskell is a pure functional language. This means that if you tried
type GetNextTokenBad = Token -> Token
any instance of GetNextTokenBad would have to always return the same value
for the same input. In particular, it wouldn't be able to choose the next token
randomly. If we want side-effects like randomness
(or printing things, or reading from files),
we need to do computations in the IO context. So when you see
type GetNextToken = Token -> IO Token
you can understand that as a function that takes a token, does something side-effectful,
and returns a new token in the IO context. In particular, this function doesn't
need to return the same value for the same inputs, but also you can only use it
in a context that allows side effects. More on that in a bit.
Generating Random Questions
Now we're ready to write the code for generating a sentence. This is where things
start to get a little complicated. We'll break it into two parts. First, given
a starting Token and a GetNextToken function, we want to generate a list of
Tokens in the IO context:
tokensFrom :: Token -> GetNextToken -> IO [Token]
tokensFrom startToken getNext = do
nextToken <- getNext startToken -- nextToken :: Token
case nextToken of
Stop -> return []
token -> liftA2 (:) (pure token) (tokensFrom token getNext)
This shouldn't be hard conceptually, it's just recursion:
tokensFromtakes a startTokenand aGetNextTokenfunction- it calls the
GetNextTokenfunction on the startingToken - if
nextTokenisStop, the result is an empty list; - otherwise, the result is the list whose first element is
nextToken, and whose subsequent elements are the results oftokensFrom nextToken.
In reality, it's complicated because of the need to do things in an effectful
context. The do is
sugar for
working in the IO context. In particular, it allows us to pull the Token
value out of the result of a GetNextToken call. That is, while getNext returns
an IO Token, as long as we're inside the do block for an IO context, we can
use <- to "get the Token out."
If we find Stop, the result is return []. Notably, this is not the return
you might know from other languages. Here this is
return :: a -> IO a
which sticks a value (in this case the empty list) into an IO context. So, since
[] is a [Token], return [] is an IO [Token].
The last line is even uglier. (:) is the "cons" operator that takes a head and
a tail and produces a list:
(:) :: a -> [a] -> [a]
Here nextToken is a Token, but the recursive call to tokensFrom produces
an IO [Token], so the types don't match up. We've already seen that we can
shove values into an IO context, so we could get by if we had something like
-- | not a real operator
(:???) :: IO a -> IO [a] -> IO [a]
We can get there with liftA2, which (specialized for IO) looks like
liftA2 :: (a -> b -> c) -> IO a -> IO b -> IO c
That is, it "lifts" a function of two arguments into an IO context. If you
work through the types, you get:
liftA2 (:) :: IO a -> IO [a] -> IO [a]
which is exactly what we want.
[Why did I use pure instead of return to stick
nextToken into the IO context? I'm not sure, exactly. In this case they're
the same thing. In the previous instance I was using IO as a Monad, so I used
return; here I'm using it as an Applicative, so I used pure. That's not a
good explanation, and it's probably not even a good reason. I don't care.
(I was also trying not to say "monad" in this post, but I guess I failed.)]
Next we want to turn a list of Tokens into a String:
smartJoin :: [Token] -> String
smartJoin = dropWhile (== ' ') . concat . addSeparators
where
addSeparators = concatMap addSeparator
addSeparator token = case token of
Word w | w `elem` ["?", ",", "."] -> ["", w]
Word w -> [" ", w]
_ -> []
The first thing we do is addSeparators, which turns each Word into a list
[separator, word] and then concatenates the resulting lists.
If the Word is punctuation, the separator is an empty string.
Otherwise it's a space.
(We should never call smartJoin on a list that includes the Start or Stop
tokens, but just in case we add in an empty list, which is the same as ignoring
the token.)
So, for instance, if you were to call
addSeparators [Word "What", Word "is", Word "love", Word "?"]
you would get
[" ", "What", " ", "is", " ", "love", "", "?"]
We then call concat on that to concatenate all the strings
" What is love?"
and dropWhile (== ' ') to get rid of the leading spaces. (I know, sort of clunky.)
Now we're ready to implement our sentence generator:
generate :: GetNextToken -> IO String
generate = fmap smartJoin . tokensFrom Start
To be a jerk, I wrote it in point-free style, it's the same as if I'd done
generate nextToken = fmap smartJoin (tokensFrom Start nextToken)
Here tokensFrom generates IO [Token] (an list of tokens in an effectful context)
and fmap lifts smartJoin (which maps [Token] -> String) into the IO
context, resulting in our desired IO String.
And finally we can create our Question generator:
randomQuestion :: Int -> GetNextToken -> GetNextToken -> IO Question
randomQuestion numAnswers getNextQuestionToken getNextAnswerToken =
Question <$> generate getNextQuestionToken
<*> replicateM numAnswers (generate getNextAnswerToken)
<*> randomRIO (0, numAnswers - 1)
It takes an Int indicating how many answers the question should have.
And it needs two GetNextToken functions, one for generating questionText
and the other for generating Answers.
You can think of <$> and <*> as
plumbing to lift the Question constructor into the
IO context. That's a Haskell-y way of doing (in essence)
-- the constructor is in essence
-- Question :: String -> [Answer] -> Int -> Question
makeEffectfulQuestion :: IO String -> IO [Answer] -> IO Int -> IO Question
makeEffectfulQuestion = liftA3 Question
Here the IO String comes from generate-ing the question,
the IO [Answer] comes from using replicateM to generate multiple answers,
and the IO Int comes from choosing a random "correct answer".
Using Transitions
Now that we have a way to generate Questions using GetNextToken functions,
we have to figure out how create GetNextToken functions from the
transition maps we generated last time. We serialized them as JSON,
but now we want a typed way to work with them in Haskell:
type Transitions = M.Map Token [Token]
Here Transitions is a Map (like a dictionary)
whose keys are Tokens and whose values are lists of Tokens.
However, our serialized map of transitions is a dictionary whose keys are
strings and whose values are lists of strings. That means we need to
deserialize it and then convert the strings to Tokens:
loadTransitions :: String -> IO Transitions
loadTransitions = fmap (textToTokens . fromJust . decode) . BS.readFile
where textToTokens = M.map (map read) . M.mapKeys read
Our loadTransitions is another point-free function. It reads a file
(which gets us some bytes in an IO context), and then uses fmap to lift the three
composed functions into the IO context.
First, decode Maybe-deserializes the bytes into a map (with text keys and values).
After that, fromJust assumes the deserialization succeeded and pulls the map out of the Maybe.
Finally, textToTokens converts the text-texts map into a Token-Tokens map.
(The fromJust isn't a "safe" way to do things (usually we'd want to check that
decode doesn't return Nothing and deal with that somehow),
but because we generated the JSON ourselves, we know it's valid.)
How does textToTokens work? First, it calls M.mapKeys read, which returns the
new Map that results from applying read to each of the input Map's keys.
So it returns a map whose keys are Tokens but whose values are still lists of text.
And then we feed it into M.Map (map read), which returns the Map that results
from calling map read on each of the input Map's values. Those values are
lists of text, so map read converts each one to a list of Tokens.
At the end of the process we have a M.Map Token [Token] as required.
Now we're ready to actually load the data:
questionTransitions :: IO Transitions
questionTransitions = loadTransitions "questions.json"
answerTransitions :: IO Transitions
answerTransitions = loadTransitions "answers.json"
Next, remember that the abstraction we used was
type GetNextToken = Token -> IO Token
so we simply need to implement a function like this that uses our Transitions.
First we write a function to pick a random element of a (nonempty) list.
We get a random Int (in an IO context, of course)
and use it to index into the list:
-- will crash if the input is an empty list
pick :: [a] -> IO a
pick xs = do
idx <- randomRIO (0, length xs - 1) -- choose a random index
return (xs !! idx) -- return that element of the list
And then our implementation is easy, we just create a function that takes as
input a Transitions object and returns the corresponding GetNextToken
function:
randomNextToken :: Transitions -> GetNextToken
randomNextToken transitions token =
case M.lookup token transitions of
Just tokens -> pick tokens
_ -> return Stop -- this shouldn't happen, but let's be safe
If you are confused about why we define it as randomNextToken transitions token,
substitute in the definition of GetNextToken:
randomNextToken :: Transitions -> Token -> IO Token
Once it's applied to a Transitions object, what's left is a function that
looks up a token in the Transitions map and pick one of the
following tokens at random.
The API
Finally, we're ready to create the actual web service. To start with, we define our API:
type API = "question" :> Get '[JSON] Question
It has a single endpoint "question", which responds to HTTP GET requests
and returns a Question serialized into JSON.
My first attempt at implementing this turned out to be really slow.
After poking around at a lot of stuff, I finally figured out it was because
every reference to the effectful questionTransitions and answerTransitions
was deserializing them from disk again. Needless to say, that was not the desired
behavior.
After some digging I found System.IO.Memoize, which memoizes expensive IO actions (like deserializing a giant transitions object).
Initially this didn't help because I was memoizing too late. So I moved it
right to app startup:
startApp :: IO ()
startApp = do
cachedQt <- eagerlyOnce questionTransitions
cachedAt <- eagerlyOnce answerTransitions
run 8080 $ simpleCors $ app cachedQt cachedAt
(Incidentally, most of this stuff is standard servant boilerplate,
just tweaked in order to use my cached Transitions.)
The type of eagerlyOnce is
eagerlyOnce :: IO a -> IO (IO a)
Since questionTransitions is IO Transitions, this means that
eagerlyOnce questionTransitions is IO (IO Transitions). Since we're in an IO
context, the <- means that cachedQt and cachedAt are both IO Transitions
(and that they should memoize their values).
(The simpleCors is just middleware that allows our service to handle
cross-origin requests.)
Now we can define our Application.
Which again needs the cached transitions as inputs, I am not very happy about the ugly way we're passing them around everywhere, but when I tried to avoid that by e.g. moving all the helpers into the `startApp` function, I got all sorts of cryptic "Couldn't match type" errors, so eventually I gave up and accepted my fate.
It's pretty simple (again, this is all basically servant boilerplate):
app :: IO Transitions -> IO Transitions -> Application
app cachedQt cachedAt = serve api (server cachedQt cachedAt)
And finally we define the server:
server :: IO Transitions -> IO Transitions -> Server API
server cachedQt cachedAt = liftIO $ do
qt <- cachedQt
at <- cachedAt
randomQuestion 4 (randomNextToken qt) (randomNextToken at)
In an IO context it retrieves the cached transitions for the questions and
answers, and then it uses them to generate a random Question. It then uses
liftIO to lift the Question out of the IO context and into the Server
context.
There is a tiny amount of more boilerplate:
api :: Proxy API
api = Proxy
AND THAT'S IT. If you build and run it, you'll end up with a (very fast) service running on localhost:8080:
$ curl http://localhost:8080/question
{"answers":["a rainstorm lasting several times","preventing too many babies the fall leaves","worms from the morning.","conserving water."],"correctAnswer":0,"questionText":"In order of behavior is most important to make life must first"}
The Punchline
After all that work, I spent a couple of hours trying to deploy this to an EC2 machine,
failing miserably. The generated executable depends on a bunch of libraries on
my system. When I tried to statically include those, the compilation failed.
And the EC2 machine was way too underpowered to install stack and build it
there. The Internet/StackOverflow was not a lot of help.
At the end of the day, I just rewrote it in flask and deployed that version. :sad_face
(However, it was only because I had a (much faster) flask version that I realized
the servant version was way too slow and went down the System.IO.Memoize path,
so in that sense it's a good thing!)
The flask version is up and running at http://54.174.99.38/question:
$ curl http://54.174.99.38/question
{"questionText": "Which system?", "answers": ["Absorbing water plants than the air pollution", "It will be healthy", "flood the air pollution", "tying a great gardener."], "correctAnswer": 0}
But it's a cheap EC2 nano instance, so please be gentle.
Next Time
In the third (and final) post, we'll build a quiz webapp that uses this service.