About Me

My photo
Northglenn, Colorado, United States
I'm primarily a BI Developer on the Microsoft stack. I do sometimes touch upon other Microsoft stacks ( web development, application development, and sql server development).

Thursday, October 25, 2007

Optimize your regular expressions

An interesting article by Tess: http://blogs.msdn.com/tess/archive/2007/10/25/net-hang-case-study-high-cpu-because-of-poorly-formatted-regular-expressions-identifying-tight-loops.aspx

She went into the issue that if you don't optimize your regular expressions, the run time can be increase exponentially as the size of the string increases.

The reason for this, is because the regular expression is a NFA (Non-Deterministic Finite Automata)

Her example:
given the string: helloworldhelloworld
using regex: ([a-z]+)*= versus [a-z]*

given the string: abcde
Regexp: (abe|abc)(dg|dc|de) versus ab(e|c)d(g|c|e)

The long version

The order of execution here will be something like

a matches a in abe
b matches b in abe
c doesn't match e in abe, backtrack to try with the next expression
c matches c in abc - done with first part
d matches d in dg
e doesn't match g in dg, backtrack to try with the next expression
e doesn't match c in dc, backtrack to try with the next expression
e matches e in de - done with the regular expression


She recommends to avoid/decrease


  • nested quantifiers (*, +, {m,n})

  • backtracks

  • ambiguous matches (i.e. in this case the number of combinations that cause a positive match for part of the string)

No comments: