An algorithm for a BNF matcher

BNF (short for Backus-Naur Form) is a notation for specifying syntaxes. It allows specifying syntax with sums (|) and products (). For example, foo ::= 1 | 2 specifies that foo can be either 1 or 2, and baz = <foo>2 specifies that baz can be either 12 or 22.

A BNF matcher is a program that takes as input a grammar and a string and tells us whether the string is captured by the grammar or not. A BNF parser works similarly, but instead, it generates a syntax tree out of it rather than telling if a particular string matches the structure.

BNF is important because it’s used as the syntax of languages in computing, mainly in programming languages. That is, any time you get a “Syntax error! Missing ;” or a similar error, you can freely blame BNF 🙂

We will start with a concrete example, then give a general algorithm and finally write a BNF matcher in PHP and Lisp.

To start, let’s consider the following grammar expressed using the BNF notation:

if_stmt := "if " test
test    := name cmp name
cmp     := "==" | "!=" | ">" | "<"
name    := alpha name | alpha
alpha   := "a" | "b" | "c"

There are a few things to note about its structure (and in general about the structure of BNF):

Rules are separated by :=, where on the left-hand side is the rule name, and on the right-hand side is the string that the rule captures
The string within the rules can be separated by zero or more |, which represents a disjunction (sum): a | b matches both a and b.
The string within the rules can also be separated by zero or more spaces, which represents conjunction (product): name := alpha name says that a name is a character followed by another name (recursively): (a a, a b, a c, b a, b b, … a a a, …)
The tokens (space-separated words) within the string in the rules can either represent a literal value (if no rule name matches that particular string), or a rule name. This allows for easier structuring and organization.
1. Note how we define name as alpha name | alpha rather than name := ("a" | "b" | "c") name | ("a" | "b" | "c"). In this case, a rule name is captured.
2. Note how we define if_stmt as "if " test. In this case, if is literally matched (since no rule exists for that name)
Conjunction and disjunction can be used together, in combination (note the definition of name).

Having explained the structure of this grammar, we can now test a few strings.

For example, does this grammar accept the string if a == a? The answer is yes. The rules to apply are in order: if_stmt,test,name,alpha,cmp,name,alpha:

Current string	Rule matched	Note
if a == a	`if_stmt`	The rule matches the string, so we consume the literal `if` and proceed to parse the remainder of the string using the other rules
a == a	`test`	`name cmp name` matches `a == a`
a	`name`, `alpha`	Since `name` can be either `alpha name` (a string of chars) or `alpha` (a single string), it captures the `a`
==	`cmp`	A literal match
a	`name`, `alpha`	Since `name` can be either `alpha name` (a string of chars) or `alpha` (a single string), it captures the `a`

Cool!

For another example, we will show why it doesn’t match the string if a. The only way for the if literal to be matched is to apply the if_stmt rule. But, having applied this rule, and after the if literal is consumed, we have an application of the test rule which is of the form name cmp name. Since we cannot unify the literal a with name cmp name, there are no rules that match this string.

Looking at these two examples, we can derive a more general algorithm for testing whether a string is captured by some BNF grammar.

Whether a rule $R$ accepts a string $S$ , we start by comparing the tokens of both the string $S$ and the rule $R_S$ by defining a rule_match procedure:

If the token within the rule represents a literal and we have a match (i.e. $x$ is a literal for $S = (x, y, ...)$ and $R_S = (x, y' ...)$ ), we proceed with comparing the remaining tokens (i.e. $S = (y, ...)$ and $R_S = (y' ...)$ )
If the token within the rule represents another rule (i.e. the $x$ within $R_S = (x, y' ...)$ is a rule), we apply rule_match recursively to the string $S$ , now applying this new rule $x$ in place of the original one. As soon as the recursion is done, we proceed processing the remainder of the tokens (if any)
Otherwise, the rule does not match the string

Having defined this rule_match procedure, now we can define a more general procedure grammar_match that will call rule_match for every possible rule. If all of the rules return false, then the string is not captured by the grammar. Otherwise, it is captured by the grammar.

Now that we described this algorithm in English, we can write it in a programing language. I wrote this algorithm in PHP, and also in Lisp.