How to Use
Regular Expressions (3) |
|
||
5. Special Elements - Part 1 Everything we've had so far hasn't been too difficult. But this chapter is heavy stuff. Please, do me a favour: read this chapter carefully. Be patient! Try everything with the regex tester; get familiar with the elements in this chapter: they are the essential for creating proper regex. Although this may be a bit more complicated than the chapters before, it is certainly more interesting ;-) 5.1 Quantifier We already know to define patterns for matching single characters, groups of characters, character classes or ranges of characters. We can use alternatives in our search patterns. But something of absolutely vital interest is missing - the ability to define repetitions. You remember the example that was a regex to search for the European formatted date: "\d\d\.\d\d\.\d\d\d\d" For every single digit we wrote "\d". Isn't there another way, much simpler than repeating the metacharacter as often as the regex wants to find the character? Yes, there is! There are quantifiers! + * ? are the most important quantifiers. The "+"-character means that the character preceding the plus-sign has to appear at least once at the specific point of the string. "fo+l" matches 'fool', 'fol' and 'foooool'. "Re:\s+", for example, means that at least one whitespace has to follow 'Re:' to be matched. I hear some of you experts: yes, the usage of quantifiers is not only restricted to characters. You can use them to repeat metacharacters, character classes and some other elements we are yet to learn. ;-) The star "*" represents any number of occurrences of the preceding character at the specific point in the string. 'Any' really means 'any', even if the character doesn't appear at all. Ooops, what's the use of that? Well, let's have a look at the following example: "Re:\s*\w+" Huh, that already looks as cryptic as those regex the experts use <g>. What does this regex mean? Search for a 'Re' followed by a colon. Then any number of whitespace characters may appear - even no spaces at all. What for? In proper subject lines there should be a space. But imagine we would like to match any subject string even if someone modified it manually and deleted the space. We have to tell the regex that there might or might not be a space. Anyway, both possibilities should be found. This can be done with the star as quantifier. Well, finally, there has to be at least one alphanumeric character. Caution: the meaning of this quantifier is sometimes misinterpreted. Look at the following task: a regex has to be defined that matches only lines of a string with only digits in it. One solution I saw was: "^[0-9]*$" But this regex matches void lines as well; the star stands for 'no digit' as well as for 'any digit'. So the regex machine returns TRUE when no digit is in a line. If you want to make sure that there is at least one digit in a line you have to use the plus-sign: "^[0-9]+$". The question mark means that the preceding character may appear once or not at all at the specific point of the string. A bit like the star only that the number of occurrence has the maximum '1'. "h..?s" matches 'hers', 'hips' and 'his' or 'has'. Within 'house' it matches 'hous'; within 'hose' it matches 'hos'. There is another way to define repetitions: "{x,y}" This is a way to explicitly define how many repetitions of the preceding characters you want. In this formula 'x' denotes the minimum number and 'y' the maximum number necessary for the preceding character. "\d{2,4}" means that only two to four digits in a row are matched. If you omit the second number 'y' but leave the comma in the curly brackets "{x,}", then there is no upper limit and the minimum is x-times the preceding character. "\w{3,}" matches any string with at least three word-characters. If you omit not only the second number but the comma as well "{x}", then this means the exact number of appearances of the preceding character. "\d{6}" matches exactly six digits. This quantifier gives us a new way to write our regex that matches European formatted dates : "\d{2}\.\d{2}\.\d{4}" The three quantifiers I introduced at the beginning of this chapter are simply special ways to write one of the following regex: {0,1} = ? {1,} = + {0,} = * Before I can tell you more about quantifiers and what has to be kept in mind when using them, I have to introduce parentheses (round brackets) as a grouping device. Back to top of section Back to top 5.2 Grouping of Elements, Subpattern and Quantifiers again Grouping of Elements In the chapter about alternatives, the parentheses crossed our way for the first time. They were used as they are in maths: common parts of the pattern are written outside the round brackets. Now we will learn something new: we can use the parentheses to group parts of the regex to be dealt with as a single element of the pattern. A following quantifier is applied to the grouped part of the regex. E.g.: "foo(bar)?" matches 'foo' and 'foobar' Another example: "Re\s*(\[\d+\])?:" There it is again, the reply counter in a subject line. This time it looks already quite professional. First of all we look for 'Re'. After any number of whitespaces (or none at all) digits in square brackets may follow. This part is grouped. Finally there has to be a colon. Let's have a closer look at the regex: why is it defined in that way? First the whitespaces: we don't know whether the author of the subject line inadvertently added one or more spaces after the 'Re'. Even if he did nothing and left the string untouched we want the Regex to match the string. Well, I agree, there shouldn't be any space, but you never know … ;-) That's why we use "\s*" at this point. Then the digits in square brackets: we allow any number of digits in the square brackets by using the plus-sign as quantifier. But there has to be at least one digit! Because there is no upper limit for this character, the way to infinity is free Finally the counter '[#]' itself: this part is grouped. This element need not appear in the string to result in a successful match. That is why we use the question mark. The regex therefore will match: 'Re:' 'Re [1]:' 'Re[123]:' It will not match 'Re[]:'. Something to think about and to try on your own what has to be changed so that the regex matches this one? Ok, here is the solution: replace the '+'-sign in the square bracket with a star: "Re\s*(\[\d*\])?:" Within 'Re [1]: [3]:' it matches 'Re [1]:'. It does not match the second reply counter. Ok, if we want to find such awful subject lines we have to work on our regex a bit more: it should match any number of counters that may have colons and -you never know - that may or may not be followed by spaces. Finally the last character has to be at least one colon: "Re\s*(\[\d+\]:*\s*)*:+" Well, it is possible for a subject to begin like that although there is only a small probability that it will really happen. I can envisage many of combinations of reply counters. The regex does not match all of them. If you want to have the regex match other combinations, go ahead, try it! Test it with a regex of your own making, but: there is one major point you should keep in mind. There is no perfect Regex. The more you try to improve the regex to match even more possibilities and combinations of characters, the more complicated the result will be. You will have to pay for this kind of perfectionism: either you won't be able to read your regex anymore or the Regex will become buggy whenever you make even the smallest change to it. It is easier to live with some erroneous matches and to sort them out manually than to create the perfect Regex. Jeffrey Friedl published a regex to match email-addresses in "Mastering Regular Expressions": it is more than 6000 bytes. It was a good example of being too perfect, as he stated. Ok, back to the job-in-hand: let's have another example of how to group elements. We had a pattern to match European formatted dates: "\d{2}\.\d{2}\.\d{4}" As you can see, the beginning "\d{2}\." is repeated. Right, so this can be simplified: "(\d{2}\.){2}\d{4}" The first part, now grouped in parentheses, has to appear twice. This is for example '01.02.'. This is not an optimal version of the search pattern: day and month numbers still have to be two digit numbers and silly values for both are still allowed. But wait; you will get your chance. Let us learn some more elements before you are given the job of optimising the pattern in an exercise <g>. Back to top of section Back to top Subpatterns Grouping with parentheses has another effect in regexian that is widely used in a lot of regular expressions in TB. Characters that were found due to a grouped pattern or element are stored in a temporary variable for further use. These variables are known as a subpattern (SubPatt in TB). We should have a look at an example to help us understand that: 'bill.doors@macrohard.com' We use the regex "(\w+)\.(\w+)@.*". The first parentheses matches 'bill', the second one 'door'. These two are each now stored respectively in subpattern 1 and subpattern 2. Or: "(\d+\.)(\d+\.)" When the string is '22.05.' then '22.' is stored in subpattern 1 and '02.' in subpattern 2. How do I find out which is the first subpattern? Well, in our simple examples it is obvious: everything that is matched by the first pair of round brackets goes to subpattern 1, the second pair returns subpattern 2, etc. But what if the regex looks like: "Re\s*(\[(\d+)\])*:" The part that is enclosed by the first opening bracket and its corresponding closing bracket is stored in subpattern 1. The part that is enclosed by the second pair starting at the second opening bracket is stored in subpattern 2. With 'Re [4]:' our example would result in: Subpattern 1 = '[4]' Subpattern 2 = '4' Important: each opening bracket creates a new variable or subpattern, What does the regex-machine store in a subpattern when a quantifier is applied on a grouped element? Example: "(\d{2}\.){2}\d{4}" If the string is '23.05.2002' the first pattern is matched at '23.'. Now the regex machine goes on to find the same pattern in the string a second time. If successful the matched characters are stored in the same subpattern. In other words: the second match overwrites the first one. In our example the subpattern will show '05.' The regex-tester shows the contents of each subpattern: with every subpattern it will offer another tab panel. That one with '0' on it shows the whole match, while that with '1' on it shows the match of the first subpattern, etc. Back to top of section Back to top And Quantifiers again Ok, now let's move on to some special behaviour relating to quantifiers, Some of them have a 'human' peculiarity: they are greedy! You don't believe that? Well, look at the following string <g>: "The abbreviation 'ISP' stands for 'Internet Service Provider'." We want a regex that finds the text that is enveloped by inverted commas and stores it in a subpattern: "(.*)'(.*)'.*" Nothing difficult really: find everything that comes before an inverted comma, then everything in between and finally everything that follows… And? Did you try it on the regex-tester? What is in subpattern 2? "Internet Service Provider". Ooops, I expected "ISP" because it comes first in the string. :-o It is quite obvious that the first group (.*) greedily matched most of the string and left only what was at least necessary for subpattern 2 to match the whole string. Furthermore, the last element ".*" in the regex allowed 'nothing' or void to follow. Keeping this in mind: this part leads to a successful match even if nothing is to be matched. The star stands for as many appearances as there are or none at all! Ok, here's another example: We want to extract as many parts of an email-address as possible. We've already got a solution for the first part, the name; but that wasn't a good one because it only allowed word characters. We have to make this more generic. Let's take (.*) for the first part. The second part is some text delimited by a dot. But this may appear more than once before the @-sign ends the name section. The Regex should therefore find the following examples of addresses: '1234abc@mail.com' '1234.abc@mail.com' '12-34.abc.def@mail.com' So, the regex starts with "(.*)\.?(.*)*@". After that any text may follow, possibly delimited by more dots. We will ignore this for the example and go for extracting only that text that comes last after the last dot, so that the regex does not get too complicated. This should be done with "(.*)\.(.*)" "(.*)\.(.*)*@(.*)\.(.*)" What do we expect in the subpatterns when '12-34.abc.def@mail.com' Subpattern 1 = '12-34' ? Subpattern 2 = '.abc' or '.def' or 'abc.def' ? Subpattern 3 = 'mail' ? Subpattern 4 = 'com' ? Ask the regex-tester: Subpattern 1 = '12-34.abc' Subpattern 2 = ' def ' Subpattern 3 = 'mail' Subpattern 4 = 'com' Subpattern 1 contains almost the all of the first part, subpattern 2 only the last three characters before the @. Of course, we expected that, didn't we? We already know that the star is greedy: it stored as many characters as it could into the first subpattern. Caution: not only stars, I mean star-signs are mean and greedy Let's take another string to test the regex: '12-34.abc.def@mail.test.com'. Now the star in the third parentheses "(.*)" is greedy and 'eats' almost everything after the @ up to the last dot, storing 'mail.test' and not 'mail'. How can we avoid that? We are going to learn another meaning of the question mark (Calm down, this is only the second one. There are many more to come and you will eventually come to understand why a regex is full of these funny question marks *g*): just add a question mark to the greedy pattern and you make the pattern less greedy. Let's do that. We add a ?-sign to the first pattern: "(.*?)\.(.*)@(.*)\.(.*)" Subpattern 1= '12-34' Subpattern 2= 'abc.def' Subpattern 3= 'mail.test' Subpattern 4='com' For a better understanding I shall try to explain what the regex-machine does: the regex-machine does not restrict the greediness of the (.*). In the moment it discovers the pattern (.*?) the following happens: it stores as much as possible into this subpattern. Then it steps back one character at a time to find a point where a successful match is found. I'm going to explain it using our example regex "(.*?)\.(.*)*" and the string '12-34.abc.def'. The Regex machine stores '12-34.abc' into the first subpattern. This is the maximum that the Regex allows because a dot and some text follow this string. But now the machine realizes that there is a question mark, which suppresses the greediness of the first subpattern. Thus, it steps back one character before the 'c' and checks whether or not the Regex leads to a successful match. No, it does not. So, again, take one step back and a check again. Still no hit. Back again to a position before the 'a'. And now the machine realizes that this would lead to a successful hit because of the preceding dot. The machine takes the position exactly before the first dot. In reality, it would have to do some more back-stepping to find out that this position is the last one possible with the minimum of characters for a successful match. But I reckon we've looked deep enough in to the way it works for now. Back to our first example where we wanted to match text between inverted commas. The regex was "(.*)'(.*)'.*" and the text "The abbreviation 'ISP' stands for 'Internet Service Provider'." Let's alter the Regex to "(.*?)'(.*?)'.*" Both grouped elements need a question mark otherwise "ISP' stands for 'Internet Service Provider" would be stored in the second pattern. To add a question mark in the second element alone wouldn't help very much because the first (.*) remains greedy. Back to top of part Back to top of section Back to top 5.3 Overview and Summary This was a quite difficult section. Not only for you to read and understand. No, it was even difficult to write and create the text, from which I hope you got some idea. This section covers one of the basic elements of regexian that you will need in every Regex. The following elements were presented: · Characters that repeat preceding characters are called quantifiers: + the preceding character must appear at least once ? the preceding character may appear once or never * the preceding character may appear in any amount of times or never There are quantifiers that allow to define exact ranges of the frequency of the preceding character:
2: You've got the solution for question 1? Ok, that solution is quite interesting but now we can try to write an improved Regex for matching European formatted dates. If possible we would like to allow only combinations of digits that look like a terrestrial date. Well, we do not want to exaggerate: it's ok if the Regex matches February, 29th (29.02.) even if it isn't a leap year ;-).The only important points are: it should be in the format DD.MM.YYYY or D.M.YY or any combination and it should be restricted to dates that exist. 3: Imagine you receive bug-reports via an on-line system. The reports are standardized and all have the same format (more or less). We need a regex that extracts the more important information. The reports look like: Sender: firstname.lastname@agency.com Date: TT.MM.JJJJ Report-no.: xyz123 Please try to define a regex that extracts the following parts into subpatterns: first name, last name, agency, date, report-no. 4. Write a regex that matches the time in the form hh:mm:ss. Make sure that only valid combinations are returned. Problem 1: "\d{1,2}\.\d{1,2}\.(\d{4}|\d{2})" You created something else? Doesn't matter, it may be a correct solution: there is often more than one way to do it! "(\d?\d\.){2}(\d{4}|\d{2})" is in my opinion an elegant solution. A not so good idea is something like "\d{2,4}" for matching the year: it allows three digit years. Problem 2: This is a bit tricky. In these cases I like to divide the problem into smaller chunks. Which days are possible: a) 01-09, the preceding zero could be missing. b) 10-29, all months of a year have at least 29 days. Ok, there is one error we are allowed to make: February only has 29 days in leap years. We will assume this is ok, otherwise it might be almost impossible to create the Regex. c) 30, all months except February d) 31, only January, March, May, July, August, October, December. Possible numbers for months are 01-10 (the preceding zero might be missing) and 11, 12. We want to allow two or four digit years. In case of four digit years we only accept those that start with 19xx or 20xx Ok, now we have what we need. Let's start: Case a) and b) combined with the allowed months gives us: "(0?[1-9]|[12][0-9])\.(0?[1-9]|1[0-2])\." Case c) with all possible months: "30\.((0?[13-9])|(1[0-2]))\." And finally case d) with possible months: "31\.(0?[13578]|1[02])\." Now the years: "(\d{2}|(19|20)\d{2})" The first three parts have to be alternatives whereas the pattern for years is mandatory. To avoid that the Regex matches within a longer sequence of digits to find something that only looks like a date, we envelope the whole Regex with \b metacharacters. That should give "\b(((0?[1-9]|[12][0-9])\.(0?[1-9]|1[0-2])\.)|(30\.((0?[13-9])|(1[0-2]))\.) |(31\.(0?[13578]|1[02])\.))(\d{2}|(19|20)\d{2})\b" Note: the regex is wrapped due to layout reasons. All must be used as a single long line! Incredible: that's a cracker! You found something different? Even something better? Well, I think that is 'normal'. You can always write a regex in another way to give the same result. And of course: you can improve almost every Regex. My Regex only shows one way to approach the problem: the way I like to do it. I hope you were able to follow my thinking. Problem 3. This is not very difficult. Again, divided into chunks of the whole problem: First name and last name can be extracted from the mail-address. "Sender:\s*(.*?)\.(.*?)@(.*?)\.\w+\s*" should be sufficient. The question mark in the second subpattern might be redundant because the @-character follows anyway. But it won't hurt anyone, would it? Date: phew, we are in luck. The format is mandatory. We don't have to use the killer regex of problem 2 ;-): "Date:\s*((\d{1,2}\.){2}\d{4})\s*" And now the report number: "Report-no.:\s*(.*)" To make sure that the regex checks the whole string we add \A at the beginning and \Z at the end. "\ASender:\s*(.*?)\.(.*?)@(.*?)\.\w+\s*Date:\s*((\d{1,2}\.){2}\d{4})\s*Report-no.:\s*(.*)\Z" Note: the regex is wrapped due to layout reasons. All must be used as a single long line! Subpattern 1,2,3,4 and 6 will contain the information we wanted. Problem 4. I think we have already had some practise at dividing bigger problems into smaller ones. The time-problem is another one. It should be mere routine now. And, it is much easier than it looks at first sight, because the format is fixed! Hours are from 00 to 19 and 20 to 23 (24 equals 00!!): "([01][0-9]|2[0-3]):" Minutes and seconds have the same format and the same combinations of digits, 00 to 59: "([0-5][0-9]:){2}" Altogether, enclosed by word boundary (\b) metacharacters: "\b([01][0-9]|2[0-3]):[0-5][0-9]:[0-5][0-9]\b" Back to top |