Using the Mac or iPhone's Built In Regex Routines

Posted on Jan 30, 2010

In which a convenient method of using POSIX regular expressions from Objective-C is presented.

It’s a common complaint that the Mac and iPhone platforms don’t have native support for regular expressions, but that’s not entirely true. If you drop down to the UNIX core, there’s an implementation of the old (and only partially busted) POSIX regular expression interfaces. Here, I’ll show a simple Objective-C wrapper class for them that lets you use them conveniently in Mac or iPhone apps.

Before I start, some preemptive remarks: There’s a lot wrong with POSIX regexes to modern eyes. Firstly, and most glaringly, the work on byte streams, and know nothing of characters beyond ASCII. That means that if you’re not careful about your string encoding, and what your regexes specify, you might end up mangling your strings pretty badly. If you don’t understand what UTF-8 is beyond that it’s a “text encoding” go and read up on how it works, and how characters in it relate to bytes before you use these routines. If you’re aware, these are safe to use - just bear in mind that , for example, .{1,4} will match 1-4 bytes, not characters. Secondly, they’re slow. Like, really slow. Even using precompiled regexes, they were around 500 times slower than PERL’s regex routines the last time I benchmarked. Some of this can be said to be the ‘fault’ of the POSIX standard, which specifies a more expressive regex concept than PERL’s, but a lot of it is just that the implementation Apple uses is crufty and old. There are better implementations out there. Anyway, suffice to say, you don’t want to be doing tons of text processing with these routines in performance sensitive code. Thirdly, the syntax of them is not the familiar PERL-compatible syntax we’re all mostly familiar with - you’ll need to think a bit differently to use them. In day-to-day use, this mostly just means that the character classes are specified differently (e.g. [[:digit:]] instead of \d). See man re_format for more details.

Having said all that, why would you want to use these? Well, if you’re aware of the issues above, and all you want to do is use a couple of convenient regexes in your code - parse a couple of HTTP headers, or match a few strings now and again, say - these routines work just fine. Including the couple of ObjC files I present here is a lot lighter weight in code-size and complexity terms than including a whole regex library. I use it all over the place in Eucalyptus, and it does a sterling job.

So, here’s the interface:

@interface THRegex : NSObject {
- (id)initWithPOSIXRegex:(NSString *)regexString;
- (id)initWithPOSIXRegex:(NSString *)regexString flags:(int)flags;
+ (id)regexWithPOSIXRegex:(NSString *)regexString;
+ (id)regexWithPOSIXRegex:(NSString *)regexString flags:(int)flags;
- (BOOL)matchString:(NSString *)string;
- (NSString *)match:(NSInteger)index;
@end

@interface NSString (THRegex)
- (THRegex *)matchPOSIXRegex:(NSString *)regexString;
- (THRegex *)matchPOSIXRegex:(NSString *)regexString flags:(int)flags;
- (NSString *)stringByEscapingPOSIXRegexCharacters;
@end

It’s pretty easy to use. At its simplest, you can just do things like:

if([@"myString" matchPOSIXRegex:@"^.*Str.*$"]) {
    NSLog(@"Match!");
}

You can also pull out parenthesized matches (watch out - they start at 1, match 0 is the entire regex match):

NSString *contentRange = [myHTTPHeaders objectForKey:@"content-range"];
THRegex *contentStartRegex = [contentRange matchPOSIXRegex:@"^[[:space:]]*bytes[[:space:]]+([[:digit:]]+)" flags:REG_EXTENDED|REG_ICASE];
if(contentStartRegex) {
    NSString *contentsStartString = [contentStartRegex match:1];
    NSLog(@"Received Content-Range: %@", contentsStartString);
}

If you’re going to be doing a few matches at once against the same regex, you can also construct a THRegex object and use it directly, instead of the NSString category (contrived example follows - this is not the best way to get a list of filenames matching a pattern):

NSMutableArray *mySpreadsheetFiles = [NSMutableArray array];
THRegex *matchSpreadsheets = [[THRegex alloc] initWithPOSIXRegex:@".*\\.xls"];
for(NSString filename in [[NSFileManager defaultManager] directoryContentsAtPath:myDocumentsPath]) {
    if([matchSpreadsheets matchString:filename]) {
        [mySpreadsheetFiles addObject:filename];
    }
}
[matchSpreadsheets release];
NSLog(@".xls files: %@", mySpreadsheetFiles);

And that’s about it. One final implementation detail: because the POSIX routines are so slow at compiling regular expressions, in my routines, the first time you use a regex, it’s compiled and cached so that it’ll be available with no delay the next time you use it. This makes sense if all the regexes you use are hard-coded - it’s only going to increase your memory usage by a small, constant factor. If you don’t want this behaviour - say, because you’re generating regexes on the fly, or because users can enter regexes in your UI (I really don’t recommend this by the way - it’s a certainty that your users will not know all the POSIX regex caveats I mentioned at the top of this post, and it’s unlikely they even know POSIX regex syntax), you can turn the cache off with a THREGEX_DONT_CACHE compile-time define (e.g. by specifying THREGEX_DONT_CACHE in the “Preprocessor Macros” setting for your target in Xcode).

To find out more about POSIX regular expression syntax, see man re_format. man regex covers the flags (REG_EXTENDED, REG_ICASE etc.) that are also used here when constructing regex objects.

Here’s the code - it’s released under a simple BSD-style licence:

downloadZip.png