Let’s have a look at how to build an LL(1) parser and what considerations we need to make when writing an LL(1) grammar.
For this post I assume basic knowledge about grammars and languages, but I try to be gentle.
Motivation
The lack of clear and concise information about this topic lead me to think that it could be useful for many people to have a small guide for it. What are the key constraints that an LL(1) grammar has? How to actually write a parser for a grammar you designed? The answers to these questions are actually quite simple, yet most sources I had looked up seemed to be either unclear, not very well focused, or difficult to find.
A secondary reason is that, having just finished my compiler course, I would like to write down some of my knowledge before I forget it all again after the summer break (summer breaks have this powerful and mysterious brain erasing capability, and I have yet to find a good way to fight against it). LR parsers are more properly covered on the internet and resources about this are easier to find.
LL Grammar and Language
So what is an LL grammar anyway? It’s a subset of deterministic context-free grammars, specifically those that can be parsed by an LL parser. This parser reads the input from Left to right, and constructs the Leftmost derivation of whatever you are parsing. Note how this definition of LL grammar may sound a little bit obvious (“can be parsed by LL parser”). That is because giving a more precise definition is actually quite complex, as it is not easy to pinpoint which languages fulfill this requirement and which not. I won’t go into detail for this, since it’s not the purpose of this post.
An LL language is one that has an LL grammar to define it. You can also use a non-LL grammar to define an LL language, but this is not convenient. Also sometimes you have a language that may not look like LL at first, but with a few modifications to its grammar you may find out that you can make it LL.
Our goal will be to define a grammar for our language that happens to be LL, such that we can benefit from the properties of its parser.
LL and LR Parsers
An LL parser is predictive. That is, if you follow a set of conditions such that your grammar is LL, your parser will be really fast because it never has to do backtracking and is extremely simple. This simplicity is what it will also allow us to code a parser by hand (coding an LR parser by hand can be an extremely tough task, and everyone relies on existing tools to do this). One of the common ways to implement these parsers is by using recursive descent, which we will use in this post.
By the way, the number within parenthesis in LL(k) or LR(k) means the number of lookahead tokens that the parser is able to see to make the next decision. In our case, LL(1) means our parser can only see the next token before parsing it.
The big brother of an LL(1) parser is an LR(1) parser. Let’s set things clear here before we move on:
- LR(1) parsers can parse ANY deterministic context free language (DCFL). It might be difficult to find a grammar for it, but it is theoretically possible.
- LR(k) parsers, for any k, are just as powerful as LR(1). The grammar needed for it just becomes more complex, verbose or difficult to find.
- LR(k) parsers are more powerful because they are aware of all preceding tokens in the parsing procedure, unlike LL(k) which only knows about the current and next token.
- LL(k) parsers generally don’t use backtracking; if they did, it would defeat their purpose, which is running fast and being really simple.
- LL(1) parsers can ONLY parse LL(1) languages.
- LL(2) parsers can parse LL(1) and LL(2) languages.
- There is an infinite sequence of subsets of DCFLs as we increase the k in LL(k), none of them really ever reaching the whole set.
- Thus, an LL(k+1) parser can parse strictly more grammars than an LL(k) parser, and strictly less than an LR(1) parser.
How to make an LL(I) grammar
If you have a language and want to attempt to make an LL(1) grammar for it (thus demonstrating that what you had is indeed an LL(1) language), there are several constraints that you have to follow when defining it. If all constraints apply, you successfully obtained an LL(1) grammar and you can move on to the next section to build a parser for it.
The constraints are:
- If you have several rules \( X \rightarrow Y, X \rightarrow Z \), then \( First(Y) \cap First(Z) = \emptyset \)
- If you have several rules \( X \rightarrow Y, X \rightarrow Z \), and \( First(Z) \) contains \( \epsilon \), then \( First(Y) \cap Follow(Z) = \emptyset \)
- No left recursions: \( X \rightarrow X… \)
- If you have several rules \( X \rightarrow Y, X \rightarrow Z \), then \(Z\) must not be non-false (i.e. must be able to return false)
The intuition behind these rules boils down to allowing the parser to make the correct decision with just one lookahead token. If there are several rules that start the same way and only differenciate themselves in the last token, the parser cannot know which rule it is until it is too late. If it took the wrong rule at the beginning, it can’t go back (thats the point, remember? no backtracking!). So for example, the second of our rules says that the First set of two right hand sides that share the same left hand side must have no common tokens. If they had any token in common, the parser wouldn’t know which of the two rules to use when it sees that token. Similarly, if the first of those rules is empty, then whatever comes after it (Follow set) also has to be different from the first set of the other rule.
As you will see below, each rule will become a function. Whenever a rule has a non-terminal symbol on the right side, we will call it’s function within our function. If we have left recursions, you can see how it will produce an infinite loop of function calls. That’s the reason behind the third condition.
Lastly, the non-false condition is there to avoid some similar situations to the first two.
Some of these conditions can be derived from the others, but for the sake of clarity and simpler identification we will use them.
How to make an LL(1) parser
Let’s suppose we have a grammar that allows words like these:
{a = 1;}
{if (a) then b = 5;}
{a = a + b - c; a = 4;}
{if (a) then if (b) then c = 0;}
Some words that should not belong are:
a = b # No curly brackets
{a = b} # No semicolon
{a + b = c;} # Assignments don't have value
{if x then 5 = c;} # Numbers can't be assigned
Now it’s our task to use the conditions defined above to actually write an LL grammar. Turns out one good grammar is the following:
program -> { statement_list } eof
statement_list -> statement statement_list
statement_list -> empty
statement -> id = expr ;
statement -> if ( expr ) then statement
expr -> id expr_tail
expr -> num
expr_tail -> + expr
expr_tail -> - expr
expr_tail -> empty
Lets write a python parser that reads the input and tells us if it belongs to our new fancy language or not.
Each rule will be one function. The function will return true if the rule could be applied, and false if not. Inside the function, the right hand side of the rule will be sequantially parsed. If a non-terminal is encountered, it’s function is called. For terminals, they will be compared with the current lookahead symbol. If the first function or the first terminal symbol comparison is false, then the function returns false. If, however, the first symbol(s) already returned true and we advanced in the rule parsing procedure, a function or comparison returning false will imply a parsing error. Let’s have a look with an example.
The first function that our parser will call is the one we usually call “program”, which basically is a function that will tell us if the word if part of the language or not. It’s the function associated to our first rule in the grammar. It will check that the word starts and ends with a curly bracket, and call statement_list to make sure that the thing inside is indeed a list of statements.
program():
if not read("{") return False
if not statement_list() raise ParseError
if not read("}") raise ParseError
return True
If the first token is not an opening curly bracket, program() will return false, saying that what comes now cannot be generated with this rule. It could be that our grammar has more than one starting rule, and we would then go ahead and check those.
However, if the first token matches, it means that undeniably this is part of our program rule. If anything goes wrong while we continue parsing, we can safely assume that our input is not part of the language. We can be sure because our grammar is an LL(1) grammar and fulfils the above described conditions. There cannot be any other rule that starts with an opening curly bracket at the same level as the “program” rule.
This is the essence of LL grammars and parsers. Go ahead and read the last two paragraphs again, and let it sink in. Once you understood this concept, writing the other functions becomes trivial.
Let’s have a look at another function.
def statement_list(self):
# statement_list -> statement statement_list
if self.statement():
if not self.statement_list(): raise ParseError()
return True
# statement_list -> empty
else: return True
Here we are saying that if we find a statement, then we better find a statement_list next, otherwise something is wrong. If we didn’t find a statement in the first place, we return true because our statement list can be empty as well. In this particular case, statement_list will never return false nor raise a ParseError, but the statement function it calls could. We still write everything out for the sake of completeness. All other functions are written in the same manner.
Whenever we have a terminal symbol in our rule, we gotta shift. That’s what the read function is for. It knows where we are with the position attribute, and attempts to shift away the characters it receives as parameter.
The full program is implemented in this github gist here.
Try it out by running python3 ll1_parser.py "someword"
, and it will tell you if its part of the language or not!