How to Use
Regular Expressions (2) |
|
||
4. Complex Patterns 4.1 Line Boundaries 4.2 Word Boundaries 4.3 Alternatives 4.4 Special Character Groups and Classes 4.5 Overview and Summary Ok, that was an easy start! But it wasn't very interesting, was it? But if simple search patterns were all that "Regular Expressions" offer, it wouldn't be worth a tutorial. So, there has to be more! Okay, let's get going with the more complicated stuff: 4.1 Line Boundaries Instead of having a regex look for text anywhere in the string we can force it to search in specific parts of the string. These "anchored" patterns have their own metacharacters: ^ and $ The circumflex ^ means that the search pattern is anchored to the start of the line; the dollar $ means that the regex will look for the pattern at the end of a line (Yes, dear experts, for now, let's take a string as one line. Ok?) Example: "^give or take" This pattern will only be matched if 'give' is at the beginning of a line and is followed by 'or take'. Or: "This is the end$" is only matched if it appears at the end of the line. It doesn't matter what comes first: 'This is the end' has to be the end of the line! You can use these two metacharacters to speed up the regex. I admit, it is not all that important when you use regex in TB! because you won't be working with large amounts of data. But on the other hand: it can't hurt anyone ;-) Why does the regex work faster if you use the circumflex or the dollar, you ask? Ok, let's use our example regex "^give or take" on the string 'Once upon a time': the regex machine checks whether the first thing it finds is the beginning of the line. This returns TRUE. Next it checks the following character whether it is a 'g'. The search process is cancelled at once because this returns FALSE! Now what would have happened without the circumflex? The regex machine would have checked the second, third, fourth etc. character to match the search pattern, only to find out that the search pattern doesn't exist in that string. The longer the string, the more time the regex machine takes to fail ;-) Back to top of section Back to top 4.2 Word Boundaries But there is more that regexian offers. Word boundaries! Some people forget about this because they think there is another way to define word boundaries. Believe me, there is, but it's nowhere near as easy as this! "\b" makes the regex searching for the pattern at word boundaries: "\bgive or take". Hey, we know this one, don't we? That is our first example again! The pattern that was found in 'You have to forgive or take the consequences!' but now won't be found thanks to the word boundary metacharacter. I remember a discussion in one of the German TB-lists where someone asked why this metacharacter is necessary, because a word could be recognized by surrounding spaces. This is not a good idea: words could end at question marks, exclamation marks, a full stop.... A regex like "ain " would indeed match 'Again a good idea' but wouldn't find 'Oh no, not again.' You can avoid that when you use "\b" instead. Of course, this metacharacter can be negated, as can the others: "\B" which means that the regex should match characters everywhere in a string other than at word boundaries. Another example should explain this: "Re\B." The regex has to match the characters 'Re' as long as they are not a word boundary, followed by any other character (the dot). Now, we have the string: 'Re: or Reply:'. Try it in the regex tester. What happens? The result is 'Rep'. Replace \B by \b and the regex matches 'Re:'. Everything clear now? Back to top of section Back to top 4.3 Alternatives You remember the first example in this tutorial "give or take"? When I introduced it I made the redundant remark that this regex wouldn't match 'give' OR 'take'. Well, this remark wasn't really redundant: I needed something to start this chapter, some kind of transition To search for alternative patterns, regexian offers a special metacharacter: it is the vertical bar or may be better known as pipe-symbol "|". So, what would have been necessary to search for 'give' or 'take'? "give|take". The regex checks whether it matches 'give'. If not it checks the string for 'take'. What happens if the string contains both alternatives? Well, to be honest, when I started with regex I was convinced that the first alternative in the regular expression would be matched. But no! The regex will match the alternative that comes first in the string! Let's get into details with an example: Given the regex "this|the|that" and the string 'the hand that signed this paper' (Ok, ok. You didn't really expect sample strings from Shakespeare or Yeats, did you?) What does the regex return? 'the' is the answer! Try it in the regex-tester! You may combine alternatives as you have seen in the last example. Just have a look at the following "^re:|^aw:|^fws:". This means that in all three alternatives the regex has to match the beginning of the line first. Some characters follow and each alternative ends with a colon. Yep, you are right: there must be a way to simplify this one. And like in Mathematics you can use brackets to make the regex shorter "^(re|aw|fwd):". Well, those simplifications do not necessarily make it easier to read: "th(is|e|at)" would be a correct and simple alternative to the first example in this chapter but it is not exactly an easy-to-read example. ;-) Back to top of section Back to top 4.4 Special Character Groups and Classes We have already introduced some of the special search patterns for groups and classes of characters. I would like to present some others with varying significance. In almost every real regex you find the character class "\s" . It represents so-called whitespace characters, that is any character which produces white space on the screen: space, tabs, newline, carriage return, line feed. It's ok if you just remember that any void space in a string will be matched. And, of course, you may negate this pattern: "\S" matches any character that does not appear as white space in the string. "\A" is a seldom used search pattern: it matches the beginning of the string. This is not the beginning of the line; no, to search for that we would have used ^. Later when we talk about options like multiline you will see where you can use this one. "\Z" is related to "\A": \Z matches the end of the string and again I can only say: "This is not end of line" because that would have been $. You will see the difference when we talk about options. Sorry, but you have to be patient :-) Back to top of section Back to top 4.5 Overview and Summary This chapter explained some more possibilities in defining search patterns:
Exercises:
the regex matches 'Ra:'. We expected that, didn't we? The regex matches the alternative which comes first in the string. Ooops, the solution of the second exercise already looks quite professional, doesn't it: "^(Re|Re\[\d\]):" Ok, may be you have something different; something that looks a bit simplified like: "^Re(|\[\d\]):". It is a good example because simplified version shows an absolute void as the first alternative in the brackets - the '|' symbol has nothing to the left of it other than the open bracket that starts the "sub-string". Third exercise: "(Re|\)$)" is one solution. You didn't forget to escape the bracket, did you? Fine, well done *g*. Now, if you can, try this one in the regex-tester with the following string: 'Re[2]: bladibla (was: more bla)'. You will see that the regex exactly matches just 'Re' because at this point the regex machine returned TRUE for the match. If the beginning of the string is changed to something else only then will the regex match the bracket. Fourth exercise: the first pattern searches for any text that has the beginning of a line or that starts at the beginning of a line. This would include any text - even a void line would be matched. The second pattern just looks for a single x character that is alone in a line. Last, but not least, the third pattern: it searches for lines that have a beginning and an end, but nothing else: these are void lines! Back to top |