How to implement string matching based on a pattern

by Vincent Rischmann   Last Updated December 03, 2017 19:05 PM

I was asked to build a tool that can identify if a string match a pattern.

Example:

{1:20} stuff t(x) {a,b,c}

would match:

  • 1 stuff tx a
  • 20 stuff t c

It is a sort of regex but with a different syntax

  • Parentheses indicate an optional value
  • {1:20} is a interval; I will have to check if the token is a number and if it is between 1 and 20
  • {a,b,c} is just an enumeration; it can be either a or b or c

Right now I implemented this with a regex, and the interval stuff was a pain to do.

On my own time I tried implementing some kind of matcher by hand, but it turns out it's not that easy to do.

By experimenting I ended up with a function that generates a state table from the pattern and a state machine. It worked well until I tried to implement the optional value, and I got stuck and how to generate the state table.

After that I searched how I could do this, and that led me to stuff like LL parser, LALR parser, recursive-descent parser, context-free grammars, etc. I never studied any of this so it's hard to know what is relevant here, but I think this is what I need:

  • A grammar
  • A parser which generates states from the grammar and a pattern
  • A state machine to see if a string match the states

So my first question is: Is this right ? And second question, what do you recommend I read/study to be able to implement this ?



Answers 1


If the requirements are as described (just matching static text, integers in a set range, optional values and alternation), then your patterns do describe regular languages, so a regular expression is probably the right way to go -- full context-free grammars give you more power than you need here.

If you want to implement a matcher from scratch, you should read up on DFAs, and NFAs. I would also recommend you read Regular Expression Matching Can Be Simple And Fast, which describes how to implement efficient (linear in the length of the text being searched) matching using NFAs and DFAs.

Alternatively, you could just hack something together to turn your patterns into a more common regular expression syntax and use an existing regex engine (I would recommend re2, which was created by the same person who wrote the above description).

To avoid the difficulties of matching integer intervals textually, I would suggest you turn intervals into general integer matching (i.e., match any sequence of digits) and use a separate pass to check whether the value is in the specified interval. If you end up writing your own matcher, you could build this functionality into the core. Note that matching integer intervals textually is entirely possible (they're still regular), but pulling out arbitrary integers and performing a separate range check may be simpler.

John Bartholomew
John Bartholomew
October 19, 2012 23:15 PM

Related Questions




How do you make decorators as powerful as macros?

Updated December 03, 2017 19:05 PM