Parsing and evaluating complex search clauses using CNF

I talked a bit about my recent internship in my previous blog post, and mentioned that I was heavily involved in the development of the search component of our data file viewer and editor. The program enabled the user to make very complex searches, for example:

(First-Name = Julie AND (Country = Canada OR Ontario)) OR (Jimi AND (Last-Name = Cullen OR London))

Although the search queries can be complicated, the way they were handled and evaluated was very simple, and I'm going talk a bit about that in this blog post.

What Are We Searching For?

First I should mention a bit about the files that were being searched in. These were data files, mostly output by COBOL programs, which might store data about a company's employees, for example. A typical raw record might look like this:

Jimi Cullen 10101992London Britain 10 Green

In this example it's relatively easy to read (for the sake of people reading this blog post!) but some records might be encoded in a non-human readable way, or include two phone numbers, and employee number, and a national insurance number all in a row: pretty difficult to decipher by eye. Thankfully, it was possible to create a 'layout' file specifying the constituent fields of the records, which would allow the program to display the record in a much more readable way, like so:

  • First-Name: Jimi
  • Last-Name: Cullen
  • Birthdate:
    • DD: 10
    • MM: 10
    • YYYY: 1992
  • City: London
  • Country: Britain
  • Favourite-Number: 10
  • Favourite-Colour: Green

Defining A Search Clause

Valid search clauses were defined recursively, using only file simple rules:

  • A plain string (such as 'Jimi') is a valid search clause, which will match any record containing the given string anywhere in the record.
  • If a layout is supplied, a field comparison (such as 'Country = Canada' or 'Age < 30') is a valid search clause, which will match any record where the specified field satisfies the given constraint.
  • If A and B are valid search clauses, then so is A AND B (e.g. 'Jimi AND City = London'), matching records matched by both of the connected search terms.
  • If A and B are valid search clauses, then so is A OR B (e.g. 'Age < 18 OR Age > 65'), matching all records matching either of the connected search terms.
  • Finally, if A is a valid search clause, then so is (A).

The final rule gives the user some control over the order in which the search clauses are evaluated. (Unless otherwise specified, AND terms are evaluated first from left to right in the search clause).

Using just these rules, we can build up complex, nested search clauses. Going back to the example at the beginning of the post:

(First-Name = Julie AND (Country = Canada OR Ontario)) OR (Jimi AND (Last-Name = Cullen OR London))

You can see that it consists of plain search terms ('Jimi', 'London') and field search terms ('Country = Canada'), connected with ANDs and ORs, and bracketed.

Having these simples rules allowed validation of search clauses to be relatively easy to implement recursively.

Storing And Evaluating The Search Clauses

The user has entered a search clause, and it has been validated so we know that it satisfies the above rules. What next? It would be nice to be able to store and handle the search clause without having to worry about all that nesting - and recursion is fine, but some nice easy iteration might save us some stack space. Hooray for...

Conjunctive Normal Form (CNF)

Consider the search clause

(A OR B OR C) AND (D OR E) AND F AND (G OR H OR I OR J)

Where the letters A, B... represent search clauses of the simplest kinds: plain search clauses and field search clauses. Then you can see that this is a valid search clause according to the rules I outlined above. What's special about this search clause is that it consists of groups of simple search clauses connected by OR (disjunctions), and these disjunctions are all connected together with AND. Hence it is a conjunction of disjunctions, and we say it is in Conjunctive Normal Form.

It is easy to show that any search clause satisfying the rules given earlier can be converted to one in conjunctive normal form:

  • A plain or field search clause is already in CNF.
  • If S is a search clause in CNF, then (S) is also in CNF.
  • If A and B are simple search clauses, then (A AND B) and (A OR B) are in CNF.
  • If A, B, and C are simple search terms, we can make the following conversions:
    • A OR (B AND C) == (A OR B) AND (A OR C)

In this way, we can convert any valid search clause to an equivalent search clause (which will match exactly the same records) in CNF.

The astute among you will notice that conversion to CNF is exponential in the worst case. Rest assured: search queries are unlikely to include enough individual clauses for the conversion to take a significant amount of time or memory. The files we are searching in can be dozens of GB, and a one-time (per query) conversion step is negligible compared to the time it takes to navigate through the file and evaluate the records against the search query.

Evaluation

Once in CNF, the clauses can be easily stored (I used a List of Lists of simple search clauses), simply iterated through, and are easy to reason with. They can also be evaluated efficiently against the records.

Suppose we are evaluating the disjunction

A OR B OR C OR D

First we check whether the record matches clause A. If it doesn't, we check the record against clause B. If it matches B, we don't need to check C or D - if it matches B then it definitely matches A OR B OR C OR D, so we can move on to the next disjunction!

Likewise, when evaluating the whole CNF clause, we only need to go until we find the first disjunction that doesn't match the record, and at that point we can reject the record.

In this way, we can simply and efficiently check the file's records against the search clause.

Potential Optimisations

There are potential optimisations to be made here which I didn't get a chance to implement before my internship ended.

I had the idea of arranging the simple search clauses within the disjunctions in an order such that the most likely clauses to match a record would be evaluated first, to reduce the average number of simple evaluations needing to be performed. An example of this is that, in general, a short string is more likely to be found in a record than a long one (the string 'en' is going to show up in more names than the string 'Cullen'). Of course, we can only guess based on probabilities which clauses are going to match more records. This could be based on more thorough heuristics on the file being searched in. Either way, I'd like to try some benchmarking tests of different ways of choosing the order of evaluation on realistic data to see what's most effective.

Likewise we can try to choose the order in which we evaluate the disjunctions themselves cleverly. It could be that evaluating short disjunctions first would be wise, since if one failed we could reject the record without having to bother checking the longer disjunctions.

Other Considerations

While search clause evaluation is important (if you're checking a few million records against the search clause, you want to be sure you're doing it properly!), a bigger time consideration is file navigation. We can read each record in, one at a time, and check them against the search clause. There are better, quicker ways of going about it - however, that's a consideration for another blog post.