How to Use
Regular Expressions (5) |
|
||
7. How to use Regular Expressions in TB Finally, we can try to use our new language in TB. First of all we have to know which tools are available to work with regular expressions. These tools are TB's macros.
7.1 Macros Not all of TB's macros support the use of regex. Most of the macros have nothing to do with regex, but you can use regex on them to extract or modify the information. And that is one feature of TB that makes it so powerful. The first macro we will look at is: %REGEXPTEXT="regex" What does it do? It searches for the pattern "regex" within the original text of a mail and returns the matched characters. The syntax is quite straightforward, look at the following example: %REGEXPTEXT="[\d\.]+" This macro used in a quick template and applied to a mail returns digits and dots. Let's have a look at a fairly similar macro: %REGEXPQUOTES="regex" This macro does exactly the same as the first one except that the returned text is not plain text but quoted text. That was nice and easy. But when it comes to the extraction of text from the header of a mail (kludges) or address book entries we need to combine some macros: The first one we will need for that is %SETPATTREGEXP. It is used to define the search pattern in the way %SETPATTREGEXP="regex". "regex" is the regular expression you created to match the text. The second one is %REGEXPMATCH. Again, this is easily defined: %REGEXPMATCH="string" with "string" being any text. It can be a template, which means that any generic text can be used, so almost any TB macro can be used to provide the text here. The definition of a regex through %SETPATTREGEXP is valid unless it is overwritten by a second appearance of a %SETPATTREGEXP. This means you can use the same pattern on several different generic texts in one go. Before we have a look at another example I have to correct something. Did I say the syntax is quite easy earlier in this chapter? Well, that's true as long as one only looks at one macro. But let's see how this changes when we let the macro parse some text: We already know the macro %REGEXPQUOTES. This could be written in a different way. Let's assume that we receive Mails from a feedback form. Part of the content is "newsletter: yes" or "newsletter: no". We would like to create an autoresponder that uses exactly this information in a reply template, for example: "Thank you for filling out our feedback form. You entered 'newsletter: yes/no'. Are you sure?" You can create more sophisticated text and a better filter to use different templates for the reply, but for the moment let's stick to this example;-). The macro %QUOTES defines what text is to be used as quoted text in a reply. The only problem is that we have to tell %QUOTES which text should be used. After that we can copy it to the reply template, add our standard text and save it. Ok, first the regex: "^newsletter:\s*(yes|no)". This has to be defined by %SETPATTREGEXP="^newsletter:\s*(yes|no)". We already know that %REGEXPMATCH applies the search pattern on any generic text, so we need a macro that provides the original text of the mail and that is %TEXT. Now we have to put it all together and create a template that uses the macros in the correct order. The only thing that makes it difficult to use these macros are the "-characters which are used as delimiters for the definition part. In %SETPATTREGEXP the search pattern is defined between these and in %QUOTES the text that will be inserted as quoted is defined. Once you start to combine the macros you have to tell TB which "-character is delimiter of which macro: the first macro must know whether the second "-character is the end of the macro or the beginning of the second macro. The same applies at the end of the second macro and so on. This can be achieved by doubling the "-character (escaping) or using different delimiters. Simply, this looks like: %M1="%M2=""Def2""%M3=""Def3""". This is getting a bit confusing and hard to follow, so we could instead say: %M1="%M2='Def2'%M3='Def3'". The example above would look like: %QUOTES="%SETPATTREGEXP='^newsletter:\s*(yes|no)'%REGEXPMATCH='%TEXT'" This example could be written in a simpler way: %REGEXPQUOTES="^newsletter:\s*(yes|no)", but this is because we extracted text out of the original text with %TEXT. Next comes a macro combination that allows the extraction of several parts of the text. We know that we could define subpatterns in the regex by grouping sections with parentheses. We must now find a way to address them within TB. TB provides a macro for this %REGEXPBLINDMATCH="string". But this does not return anything useful. Of course, we wanted to extract parts of the text not the whole text itself. So we still need a macro that allows us to tell the macro which of the subpatterns are to be used. And this is %SUBPATT="n". 'n' denotes the n-th subpattern in the regex. Now this combination will be quite difficult to read and understand. So I will explain it using an example and will generate the whole macro combination bit by bit. After that I will combine everything. From the original date of a mail we want to extract the year, two digits only, and use it as quoted text. The date is provided by %ODATE. The regex is "\d{2}(\d{2})\b". That means we want to extract only two digits if they are preceded by two digits and followed a word boundary. Thus the first macro is: %SETPATTREGEXP="\d{2}(\d{2})\b". The text that is used to find the date is defined using the macro %REGEXPBLINDMATCH="%ODATE". We are looking for the first subpattern, so %SUBPATT="1". Now we put all together, we don't forget to use the alternate '-characters: %QUOTES="%SETPATTREGEXP='\d{2}(\d{2})\b'%- %REGEXPBLINDMATCH='%ODATE'%SUBPATT='1'"Note: the regex is split using the %- macro and can be entered as two lines! Another example? There is a regex for reply templates that modifies the name of the recipient. Instead of 'Gerd Ewald' we would like to have 'Gerd Ewald at TBUDL…..' Well, we could download this regex somewhere, but let us try to create it ourselves. %OFROMNAME will give us the name. The reply address is given by %OREPLYADDR. We will extract the list's name with a regex. Usually the name of the list precedes the @-character: %SETPATTREGEXP="(.*?)\@" This is used in combination with %REGEXPBLINDMATCH="%OREPLYADDR" of which we only want subpattern one : %SUBPATT="1" The result is then the contents of the TO-field. Watch out, before you can enter text this field has to be cleared. This is done by an initial assignment which is void. %TO=""%TO='"%OFROMNAME at %- %SETPATTREGEXP=_(.*?)\@_%- %REGEXPBLINDMATCH=_%OREPLYADDR_%- %SUBPATT=_1_" <%OREPLYADDR>'Note 1: the regex is split using the %- macro and can be entered as seen! Note 2: the regex makes use of a feature of recent versions of TB where any character may be used as a quoting delimiter, in this case the underscore and single quote as well as double quote. Users of earlier versions will have to resort to using the clumsier double delimiter syntax The original reply address has to be added enclosed in "<>"-characters at the end. As you can see, the syntax is quite easy and stereotypical. The only difficult thing is to find out which macro provides the necessary information and how to extract it with the regex. Here's another example that is available at here at the FAQ-page %WRAPPED='Historians believe that on %ODATE%- %SETPATTREGEXP="(?m-s)Date\:\s*?((.*?[\d]{4})\s*?([\d]{0,2}\:%- [\d]{0,2}\:[\d]{0,2})\s*?(.*))"%- %REGEXPBLINDMATCH="%HEADERS" , at %SUBPATT="3"[GMT%SUBPATT="4"]%- (which was %OTIME where I live) you wrote:'%-Here, once again, the %- macro is used to make the whole combination easier to read. This has no special meaning except that it tells TB that the following line should be treated as a continuation of the first line. The %WRAPPED means that the result of the macro combination will be word wrapped at the defined column in TB. What does the macro do? The first part "%WRAPPED='Historians believe that on %ODATE%-" is just some kind of a link up: on every reply the date of the original mail should be added to the text 'Historians believe that on '. The second part contains the regex that is much more interesting to us (I deleted the %- macro to show the regex in one line): "(?m-s)Date\:\s*?((.*?[\d]{4})\s*?([\d]{0,2}\:[\d]{0,2}\:[\d]{0,2})\s*?(.*))"The option multiline is switched on and DotAll is switched off: (?m-s) Then the regex looks for 'Date:', which may be followed by any number of whitespaces. Due to the greediness of the star a question mark follows. The author escaped the colon with a backslash that isn't necessary. I don't know why he did that but it won't cause problems, so we'll leave it alone. Now the first parenthesis follows. There is no need to group this part and I assume it is done for easier reading. You may delete it but then bear in mind that the total number of subpatterns has changed. The second parenthesis looks for anything that consists of four digits. We know that the regex will look in the kludges (%HEADERS) for the date. So we guess that the author will look for something like 'year'. This may be followed by whitespaces. Now we come to the third parenthesis. This is the one the author needs. He searches for three numbers with zero, one or two digits. These numbers are separated with colons. That is obviously the time. Whitespace may follow and with the fourth subpattern all of the rest is matched: this is nomore than the GMT-information. A closer look on the regex shows that it is applied to the header lines and only that only subpattern three and four are really needed. The result could be: 'Historians believe that on Sunday, 7. April 2002 , at 11:22:59[GMT +0200](which was 11:22 where I live) you wrote:'It works although the layout would need a bit DIY. Back to top 7.2 Other Ways to Use Regular Expressions in TB There are other possibilities for using regex in TB than macros. For example the text search option for in the mail editor. It is especially useful to search for strings in long mails with the special features that regex offers. picture in PDF-version and on line only This window may be opened with Ctrl-F or using the 'Edit Find' menu entry in the mail editor. Just enter the regex in the text line. Don't forget to check the 'regular expressions' box in the Options section. In almost the same way I can search for text within stored mail, I can search text mails in folders using regex. Just press F7 while in folder view. This opens a search window, which offers the facility to search for text in mails. In the 'Options' tab panel you can enter the regex in the 'Search for' field. picture in PDF-version and on line only Go to the 'Advanced' tab panel and check "Regular Expressions". picture in PDF-version and on line only You can use regex in filter conditions to optimise the organisation of your inbox. This is a field where regex are as efficient as in macros. Go to the 'Account, Sorting Office/Filters' menu item. Open the filter definition, go to the 'Options' tab panel and check 'Regular Expressions'. picture in PDF-version and on line only Back to top 7.3 Overview and Summary What did we learn in this final chapter? There are several ways to use regular expressions in TB, which are:
You remember the regex we wrote to clean the subject line? "^Re(.*?):\s*(.*?)\s*(\(was:.*\))*$" . Try to improve this one: instead of '(was:xyz)' PGP-users will find '(PGP Decrypted)'. The regex should find these kinds of subject as well. Furthermore the regex should be available within a reply template. Exercise 2: In the last chapter I described a macro that modifies the TO-address for mailing lists: %TO=""%TO='"%OFROMNAME at %- %SETPATTREGEXP=_(.*?)\@_%- %REGEXPBLINDMATCH=_%OREPLYADDR-%- %SUBPATT=_1_" <%OREPLYADDR>'Try to change it in such a way that it is no longer necessary to use %REGEXPBLINDMATCH and %SUBPATT but %REGEXPMATCH. You will need to modify the regex. Hint wanted? Ok: The subpattern was created because otherwise the @-character would have been included in the match. The only thing you have to do is to find a regex that does not match the @-character and has no subpattern. Solution 1: Well, that is not too difficult. You only expand the last part of the regex with an alternative "^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP Decrypted\))*$"But now we have a look at the template. We would like to create a new subject. The macro we need is %SUBJECT Because we use it when we reply to a message and we want to have a proper subject line it should start with: %SUBJECT="Re Then we add the regex: %SUBJECT="Re: %SETPATTREGEXP=""^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP Decrypted\))*$""Note: the regex is wrapped due to layout reasons. All must be used as a single long line! We will apply it to the original subject %OFULLSUBJ and need the second subpattern. %SUBPATT="2" %SUBJECT="Re: %SETPATTREGEXP=""^Re(.*?):\s*(.*?)\s*(\was:.*\)|\(PGP Decrypted\))*$""%REGEXPBLINDMATCH=""%OFULLSUBJ""%SUBPATT=""2"""Note: the regex is wrapped due to layout reasons. All must be used as a single long line! Ok, that's it. Additional exercise: what if the subject does not have any 'Re' but 'AW', 'FWD' or anything else? Go, try to add further alternatives at the start of the regex. Solution 2: A positive lookahead assertion will help ".*?(?=\@)" The assertion will look for the @-character but won't include it in the match. Therefore, the template is easier to write: %TO=""%TO='"%OFROMNAME at %- %SETPATTREGEXP=_.*?(?=\@)_%- %REGEXPMATCH=_%- %OREPLYADDR_" <%OREPLYADDR>'Back to top 8. Final Conclusion Now let's try to explain the example that was given in chapter 1. %QUOTES="%SETPATTREGEXP=""(?is)(-----BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?(.*?)(^(- --|--\n|-----BEGIN PGP SIGNATURE)|\z)""%REGEXPBLINDMATCH=""%text""%SUBPATT=""3""" It starts with %QUOTES=. The text that is matched with the following regex is to be used as quoted text. "%SETPATTREGEXP="" defines the regex: (?is)(-----BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?(.*?)(^(- --|--\n|-----BEGIN PGP SIGNATURE)|\z) Note: the regex is wrapped due to layout reasons. All must be used as a single long line! You know already why there are doubled "-characters: it is to escape them so that they are not taken as part of another macro by mistake, (although you also know there are better ways of writing that too). "(?is)" is the options setting: ignore case and assume the whole text as one single line, furthermore let the dot match newline characters. "(-----BEGIN PGP SIGNED.*?\n(Hash:.*?\n)?\s*)?" This opens the first subpattern. The regex says: find five hyphens followed by the string BEGIN PGP SIGNED. This may be followed by any character sequence or none at all (.*?). Due to the greediness of .* it is restricted by a question mark. Next is a following new line (\n). The new line starts with the string 'Hash:', any character sequence and ends with a new line again. This is the second subpattern and it may appear once or never. Any number of whitespace characters may follow the second subpattern. Then the first subpattern is fully defined by the final parenthesis. Again this is followed by a question mark: that means that the first subpattern may appear only once or not at all. These lines are created by PGP or GnuPG when a message is clear signed. The text is standard and therefore it is easy to define the regex. But the author of that macro combination not only wanted to use it on PGP-signed messages: he or she wanted to use it even on text that hasn't been touched by PGP and therefore do not have these lines. "(.*?)" This is the important third subpattern: the unmodified message text itself. The preceding regex was necessary to locate and isolate this subpattern. The regex just says: "Find anything, no matter what, but don't be greedy." Now the alternation starts: "(^(- --|--\n|-----BEGIN PGP SIGNATURE)|\z)" Subpattern 4 starts and looks for a beginning of a line. Anything we now define in this subpattern has to be at the beginning of the line "(^". Then subpattern 5 follows: "(- --|--\n|-----BEGIN PGP SIGNATURE)" It consists of three alternatives: "- --" resp. "--\n" or "-----BEGIN PGP SIGNATURE" The first alternative is well known once you have seen a clear-signed PGP message : it is the modified signature separator that PGP uses with the extra hyphen and space as an indicator to show where it inserted its own lines. Quite unfortunate really, but we won't discuss it here. Just let's take it as is. The second alternative is the original signature separator. That means that this will be found if the text had no contact with PGP. Actually, it's not quite right, because the proper cut mark is dash-dash-space-newline, so this regex should be: "(^(- --|--\s\n|-----BEGIN PGP SIGNATURE)|\z)" The third alternative is necessary to look for lines that contain the PGP-created hash (ok, ok, there is only a part of the hash, but this is a regex tutorial and not a PGP tutorial. If you need that one go to www.pro-privacy.de ;-)). This is the end of subpattern 5's definition. The second alternative of subpattern 4 "\z)" searches for the end of the string as a counterpart to subpattern 5's search for the beginning of a line. Therefore there doesn't have to be a signature separator or a PGP-hash: The mail just has to end somewhere… To be honest: the author looks for this funny ending of the mail only because of the fact that the proper text of the mail should be easily located and extracted. There is no further interest in these parts. Now the next macro follows: %REGEXPBLINDMATCH=""%text"", which lets the machine apply the regex to the text. The %SUBPATT=""3""" macro returns the proper part of the mail to the %QUOTES variable. That's it. A tutorial that is entirely written without direct feedback was something new to me: you don't notice when it gets too complicated or too academic. I tried to avoid both and I tried to concentrate on those elements of regular expressions that are most useful. I really hope I was successful and that it wasn't boring ;-) The tutorial isn't a perfect and full description of regexian. If I wanted to offer that I could have copied J. Friedl's book into TB's help file. No, the tutorial was meant to give an idea, an initial help to get started. Like any other language you will only learn the vocabulary by doing and using it. If I was able to give you a hand to get started I'm content! I would like to thank those who helped convert my ideas into something readable and useful. My special thanks go to Marck who was very patient and who improved my translation. Thanks to (in alphabetical order): Back to top |