Current Tech Tip

Regular Expressions Part II
By: Kevin Finley



Archives
10671 Techwood Circle, Cincinnati, Ohio 45242      Tel: 513-563.2800       Fax: 513.544-6412      Email: andreac@ashwoodcomputer.com
Last month, we received the following requirement:
...read an ID number from users or from a comma separated data feed.  This ID number can be either an SSN or EIN.  They haven't asked yet; however, your users will likely ask you to accept a data field that reads “ID: X###### db” where db is a 2 letter suffix indicating the database from which is came.  The origin database is important because the company sending this feed duplicates id numbers between databases.

We also learned how to solve the first part, identifying a field that contains either a SSN or an EIN depending on the placement of dashes using regular expressions.  This month we will learn to validate the field from the data feed as well and then to extract the important pieces.


The validation expression looks much like what we learned last month:

db_regular_expression = ID: X[0-9]{6} [a-zA-Z]{2}

where this expression says we want to find the exact string (case-sensitive) “ID: X” followed by 6 numbers followed by a single space followed by 2 upper- or lowercase letters.  Remember,
1.regular expressions are case-sensitive by default
2.[] create a character class for one character position
3.{} say to repeat the prior element a specific number of times; we used this to state “repeat the prior character class (numbers) 6 times” and “repeat the prior character class (2 times)”

Now that we have valid input, how do we extract the important stuff to use in a program?  That's just as easy!  Modify the expression as follows:

db_regular_expression = ID: X([0-9]{6}) ([a-zA-Z]{2})

What changed?  We added some parentheses.  Parentheses define a capture group.  Capture groups tell the regular expression parser that you want whatever is contained between them IF the input is valid.  Here is pseudo code to use this expression:

IF db_regular_expression.matches( input ) THEN
id = db_regular_expression.group( 1 )
db = db_regular_expression.group( 2 )
CALL deal_with_id( db, id )
END IF

A few things to note:
There are lots of library flavors of regular expressions so not all expressions work identically everywhere.
Capture group 0 is normally the entire input string.
There are many books describing regular expressions.
You can become proficient with regular expressions in an hour, but it will probably take years to unlock the full power.
Regular expressions can be computationally expensive to construct if you need to do it thousands of times per second.  Most languages/options provide a means to construct the state machine and keep it around for use on future input strings.
Support is widespread:
built into languages like Perl, Ruby, and Javascript,
available as a library (sometimes standard, sometimes downloadable) in C, C++, VB, Java, etc.,
accessible from the command line using grep, egrep, fgrep, sed, awk, etc., and
many editors such as vi, emacs, Eclipse, Visual Studio, etc.


Kevin Finley is an independent consultant specializing in architecture integration including data models, relational databases, object databases, network communication, metadata, and parsing.  He can be reached at ashwood-blurbs@tomorrowenterprises.com.