“All, but these” Regular Expressions

The so-called “negative lookarounds” are a lesser known and rather useful feature of regular expressions—for instance, you can use these to construct an expression matching on any text, but “dog” or “cat”. Or, in ORF’s context, match on any email address, but those ending in “@mydomain.org”.

This comes very handy with ORF’s exception lists. For instance, let’s assume you want to turn on Greylisting for just a few selected local recipients. It makes sense, some mailboxes are just too important to suffer the delay that normally comes with the technology. As ORF has a Recipient Exception List for Greylisting, this is something you can do right away—yet I doubt you’d love adding 990 excepted local recipients to the list, just for the sake of enabling Greylisting for 10 non-excepted recipients (and then keep that list up to date, for a 1000-mailbox company). Of course, there is a better solution and yes, it is negative lookarounds.

In this particular case, a negative lookbehind regex will help. This is what it looks like for one email address (a@example.com):

.*(?<!^a@example\.com)$

Both a@example.com and b@example.com:

.*(?<!^a@example\.com)(?<!^b@example\.com)$

a@example.com, b@example.com and c@example.com:

.*(?<!^a@example\.com)(?<!^b@example\.com)(?<!^c@example\.com)$

The first regex will match anything, but a@example.com. The second regex will match anything, but a@example.com or b@example.com. The third one—you already figured it out.

The logic around exception lists gets a little complicated, but how this would translate to human language is: “Make an exception if the recipient address is anything, but a@example.com”. Note that all addresses must be listed in the same regex: if you tell ORF to “Make an exception if the recipient address is anything, but a@example.com”, ORF will look no further, because it already knows the zzz@example.com is excepted, regardless if you have a second similar statement about b@example.com. What you really need to tell ORF is “Make an exception if the recipient address is anything, but a@example.com or b@example.com or c@example.com”, in a single step, using the combination technique above.

Just a few ideas what can be done with the help of this trick:

  • To enable certain tests for specific domains or recipents only.
  • To disable Recipient Validation for all domains, but one or a few.
  • To blacklist all .dat attachments except winmail.dat.
  • To blacklist all attachments, except .pdf and .doc ones.

And the list goes on. You can learn more about lookaround assertions at regular-expressions.info and in the PCRE manual.

7 thoughts on ““All, but these” Regular Expressions

  1. CFrankB

    I remember a couple years ago in ORF newsgroup someone posted that starting regexes with ^.*? or maybe atleast .*? was much more efficient than just .*

    Anyone confirm?

  2. Peter

    @CFrankB: I do not see any reason why ^.*? would be more efficient. The ^ character matches on the beginning of the string, .* means “zero or more arbitrary characters”, ? means “zero or one of the preceeding character/expression”. ^ certainly does not help here, because ORF already uses regex anchoring (implicit ^).

    I guess what it might have been is “non-capturing parentheses” (e.g. “(?:expression)”). This indeed speeds things up, because the regex engine does not have to remember so many details (http://www.regular-expressions.info/brackets.html). It is not exactly easy to calculate how efficient a regular expression is and it also largely depends on the input data, but given how small strings ORF typically applies regexes on, the difference is probably insignificant.

  3. Peter

    Oh yes, greedy expressions. Rich is probably correct, yet I would say these do not make much difference performance-wise, due to the small input data. When constructing keyword filters, however, you should watch out for these – for instance, a huge mailing list daily digest email could take significant processor resources to be evaluated against a regex.

  4. ChrisF

    Just come across this old post but it is relevant to what I am trying to do in regex.

    If I wanted a match on everything but .pl files, how would it be written? I’ve tried a few ways and always get errors on the regex validation.

    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

AlphaOmega Captcha Classica  –  Enter Security Code