diff options
Diffstat (limited to 'doc/kate/regular-expressions.docbook')
-rw-r--r-- | doc/kate/regular-expressions.docbook | 664 |
1 files changed, 664 insertions, 0 deletions
diff --git a/doc/kate/regular-expressions.docbook b/doc/kate/regular-expressions.docbook new file mode 100644 index 000000000..c15685d75 --- /dev/null +++ b/doc/kate/regular-expressions.docbook @@ -0,0 +1,664 @@ +<appendix id="regular-expressions"> +<appendixinfo> +<authorgroup> +<author>&Anders.Lund; &Anders.Lund.mail;</author> +<!-- TRANS:ROLES_OF_TRANSLATORS --> +</authorgroup> +</appendixinfo> + +<title>Regular Expressions</title> + +<synopsis> This Appendix contains a brief but hopefully sufficient and +covering introduction to the world of <emphasis>regular +expressions</emphasis>. It documents regular expressions in the form +available within &kate;, which is not compatible with the regular +expressions of perl, nor with those of for example +<command>grep</command>.</synopsis> + +<sect1> + +<title>Introduction</title> + +<para><emphasis>Regular Expressions</emphasis> provides us with a way +to describe some possible contents of a text string in a way +understood by a small piece of software, so that it can investigate if +a text matches, and also in the case of advanced applications with the +means of saving pieces or the matching text.</para> + +<para>An example: Say you want to search a text for paragraphs that +starts with either of the names <quote>Henrik</quote> or +<quote>Pernille</quote> followed by some form of the verb +<quote>say</quote>.</para> + +<para>With a normal search, you would start out searching for the +first name, <quote>Henrik</quote> maybe followed by <quote>sa</quote> +like this: <userinput>Henrik sa</userinput>, and while looking for +matches, you would have to discard those not being the beginning of a +paragraph, as well as those in which the word starting with the +letters <quote>sa</quote> was not either <quote>says</quote>, +<quote>said</quote> or so. And then of cause repeat all of that with +the next name...</para> + +<para>With Regular Expressions, that task could be accomplished with a +single search, and with a larger degree of preciseness.</para> + +<para>To achieve this, Regular Expressions defines rules for +expressing in details a generalization of a string to match. Our +example, which we might literally express like this: <quote>A line +starting with either <quote>Henrik</quote> or <quote>Pernille</quote> +(possibly following up to 4 blanks or tab characters) followed by a +whitespace followed by <quote>sa</quote> and then either +<quote>ys</quote> or <quote>id</quote></quote> could be expressed with +the following regular expression:</para> <para><userinput>^[ +\t]{0,4}(Henrik|Pernille) sa(ys|id)</userinput></para> + +<para>The above example demonstrates all four major concepts of modern +Regular Expressions, namely:</para> + +<itemizedlist> +<listitem><para>Patterns</para></listitem> +<listitem><para>Assertions</para></listitem> +<listitem><para>Quantifiers</para></listitem> +<listitem><para>Back references</para></listitem> +</itemizedlist> + +<para>The caret (<literal>^</literal>) starting the expression is an +assertion, being true only if the following matching string is at the +start of a line.</para> + +<para>The stings <literal>[ \t]</literal> and +<literal>(Henrik|Pernille) sa(ys|id)</literal> are patterns. The first +one is a <emphasis>character class</emphasis> that matches either a +blank or a (horizontal) tab character; the other pattern contains +first a subpattern matching either <literal>Henrik</literal> +<emphasis>or</emphasis> <literal>Pernille</literal>, then a piece +matching the exact string <literal> sa</literal> and finally a +subpattern matching either <literal>ys</literal> +<emphasis>or</emphasis> <literal>id</literal></para> + +<para>The string <literal>{0,4}</literal> is a quantifier saying +<quote>anywhere from 0 up to 4 of the previous</quote>.</para> + +<para>Because regular expression software supporting the concept of +<emphasis>back references</emphasis> saves the entire matching part of +the string as well as sub-patterns enclosed in parentheses, given some +means of access to those references, we could get our hands on either +the whole match (when searching a text document in an editor with a +regular expression, that is often marked as selected) or either the +name found, or the last part of the verb.</para> + +<para>All together, the expression will match where we wanted it to, +and only there.</para> + +<para>The following sections will describe in details how to construct +and use patterns, character classes, assertions, quantifiers and +back references, and the final section will give a few useful +examples.</para> + +</sect1> + +<sect1 id="regex-patterns"> + +<title>Patterns</title> + +<para>Patterns consists of literal strings and character +classes. Patterns may contain sub-patterns, which are patterns enclosed +in parentheses.</para> + +<sect2> +<title>Escaping characters</title> + +<para>In patterns as well as in character classes, some characters +have a special meaning. To literally match any of those characters, +they must be marked or <emphasis>escaped</emphasis> to let the regular +expression software know that it should interpret such characters in +their literal meaning.</para> + +<para>This is done by prepending the character with a backslash +(<literal>\</literal>).</para> + + +<para>The regular expression software will silently ignore escaping a +character that does not have any special meaning in the context, so +escaping for example a <quote>j</quote> (<userinput>\j</userinput>) is +safe. If you are in doubt whether a character could have a special +meaning, you can therefore escape it safely.</para> + +<para>Escaping of cause includes the backslash character it self, to +literally match a such, you would write +<userinput>\\</userinput>.</para> + +</sect2> + +<sect2> +<title>Character Classes and abbreviations</title> + +<para>A <emphasis>character class</emphasis> is an expression that +matches one of a defined set of characters. In Regular Expressions, +character classes are defined by putting the legal characters for the +class in square brackets, <literal>[]</literal>, or by using one of +the abbreviated classes described below.</para> + +<para>Simple character classes just contains one or more literal +characters, for example <userinput>[abc]</userinput> (matching either +of the letters <quote>a</quote>, <quote>b</quote> or <quote>c</quote>) +or <userinput>[0123456789]</userinput> (matching any digit).</para> + +<para>Because letters and digits have a logical order, you can +abbreviate those by specifying ranges of them: +<userinput>[a-c]</userinput> is equal to <userinput>[abc]</userinput> +and <userinput>[0-9]</userinput> is equal to +<userinput>[0123456789]</userinput>. Combining these constructs, for +example <userinput>[a-fynot1-38]</userinput> is completely legal (the +last one would match, of cause, either of +<quote>a</quote>,<quote>b</quote>,<quote>c</quote>,<quote>d</quote>, +<quote>e</quote>,<quote>f</quote>,<quote>y</quote>,<quote>n</quote>,<quote>o</quote>,<quote>t</quote>, +<quote>1</quote>,<quote>2</quote>,<quote>3</quote> or +<quote>8</quote>).</para> + +<para>As capital letters are different characters from their +non-capital equivalents, to create a caseless character class matching +<quote>a</quote> or <quote>b</quote>, in any case, you need to write it +<userinput>[aAbB]</userinput>.</para> + +<para>It is of cause possible to create a <quote>negative</quote> +class matching as <quote>anything but</quote> To do so put a caret +(<literal>^</literal>) at the beginning of the class: </para> + +<para><userinput>[^abc]</userinput> will match any character +<emphasis>but</emphasis> <quote>a</quote>, <quote>b</quote> or +<quote>c</quote>.</para> + +<para>In addition to literal characters, some abbreviations are +defined, making life still a bit easier: + +<variablelist> + +<varlistentry> +<term><userinput>\a</userinput></term> +<listitem><para> This matches the <acronym>ASCII</acronym> bell character (BEL, 0x07).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\f</userinput></term> +<listitem><para> This matches the <acronym>ASCII</acronym> form feed character (FF, 0x0C).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\n</userinput></term> +<listitem><para> This matches the <acronym>ASCII</acronym> line feed character (LF, 0x0A, Unix newline).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\r</userinput></term> +<listitem><para> This matches the <acronym>ASCII</acronym> carriage return character (CR, 0x0D).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\t</userinput></term> +<listitem><para> This matches the <acronym>ASCII</acronym> horizontal tab character (HT, 0x09).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\v</userinput></term> +<listitem><para> This matches the <acronym>ASCII</acronym> vertical tab character (VT, 0x0B).</para></listitem> +</varlistentry> +<varlistentry> +<term><userinput>\xhhhh</userinput></term> + +<listitem><para> This matches the Unicode character corresponding to +the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (&ie;, +\zero ooo) matches the <acronym>ASCII</acronym>/Latin-1 character +corresponding to the octal number ooo (between 0 and +0377).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>.</userinput> (dot)</term> +<listitem><para> This matches any character (including newline).</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\d</userinput></term> +<listitem><para> This matches a digit. Equal to <literal>[0-9]</literal></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\D</userinput></term> +<listitem><para> This matches a non-digit. Equal to <literal>[^0-9]</literal> or <literal>[^\d]</literal></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\s</userinput></term> +<listitem><para> This matches a whitespace character. Practically equal to <literal>[ \t\n\r]</literal></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\S</userinput></term> +<listitem><para> This matches a non-whitespace. Practically equal to <literal>[^ \t\r\n]</literal>, and equal to <literal>[^\s]</literal></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\w</userinput></term> +<listitem><para>Matches any <quote>word character</quote> - in this case any letter or digit. Note that +underscore (<literal>_</literal>) is not matched, as is the case with perl regular expressions. +Equal to <literal>[a-zA-Z0-9]</literal></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\W</userinput></term> +<listitem><para>Matches any non-word character - anything but letters or numbers. +Equal to <literal>[^a-zA-Z0-9]</literal> or <literal>[^\w]</literal></para></listitem> +</varlistentry> + + +</variablelist> + +</para> + +<para>The abbreviated classes can be put inside a custom class, for +example to match a word character, a blank or a dot, you could write +<userinput>[\w \.]</userinput></para> + +<note> <para>The POSIX notation of classes, <userinput>[:<class +name>:]</userinput> is currently not supported.</para> </note> + +<sect3> +<title>Characters with special meanings inside character classes</title> + +<para>The following characters has a special meaning inside the +<quote>[]</quote> character class construct, and must be escaped to be +literally included in a class:</para> + +<variablelist> +<varlistentry> +<term><userinput>]</userinput></term> +<listitem><para>Ends the character class. Must be escaped unless it is the very first character in the +class (may follow an unescaped caret)</para></listitem> +</varlistentry> +<varlistentry> +<term><userinput>^</userinput> (caret)</term> +<listitem><para>Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.</para></listitem> +</varlistentry> +<varlistentry> +<term><userinput>-</userinput> (dash)</term> +<listitem><para>Denotes a logical range. Must always be escaped within a character class.</para></listitem> +</varlistentry> +<varlistentry> +<term><userinput>\</userinput> (backslash)</term> +<listitem><para>The escape character. Must always be escaped.</para></listitem> +</varlistentry> + +</variablelist> + +</sect3> + +</sect2> + +<sect2> + +<title>Alternatives: matching <quote>one of</quote></title> + +<para>If you want to match one of a set of alternative patterns, you +can separate those with <literal>|</literal> (vertical bar character).</para> + +<para>For example to find either <quote>John</quote> or <quote>Harry</quote> you would use an expression <userinput>John|Harry</userinput>.</para> + +</sect2> + +<sect2> + +<title>Sub Patterns</title> + +<para><emphasis>Sub patterns</emphasis> are patterns enclosed in +parentheses, and they have several uses in the world of regular +expressions.</para> + +<sect3> + +<title>Specifying alternatives</title> + +<para>You may use a sub pattern to group a set of alternatives within +a larger pattern. The alternatives are separated by the character +<quote>|</quote> (vertical bar).</para> + +<para>For example to match either of the words <quote>int</quote>, +<quote>float</quote> or <quote>double</quote>, you could use the +pattern <userinput>int|float|double</userinput>. If you only want to +find one if it is followed by some whitespace and then some letters, +put the alternatives inside a subpattern: +<userinput>(int|float|double)\s+\w+</userinput>.</para> + +</sect3> + +<sect3> + +<title>Capturing matching text (back references)</title> + +<para>If you want to use a back reference, use a sub pattern to have +the desired part of the pattern remembered.</para> + +<para>For example, it you want to find two occurrences of the same +word separated by a comma and possibly some whitespace, you could +write <userinput>(\w+),\s*\1</userinput>. The sub pattern +<literal>\w+</literal> would find a chunk of word characters, and the +entire expression would match if those were followed by a comma, 0 or +more whitespace and then an equal chunk of word characters. (The +string <literal>\1</literal> references <emphasis>the first sub pattern +enclosed in parentheses</emphasis>)</para> + +<!-- <para>See also <link linkend="backreferences">Back references</link>.</para> --> + +</sect3> + +<sect3 id="lookahead-assertions"> +<title>Lookahead Assertions</title> + +<para>A lookahead assertion is a sub pattern, starting with either +<literal>?=</literal> or <literal>?!</literal>.</para> + +<para>For example to match the literal string <quote>Bill</quote> but +only if not followed by <quote> Gates</quote>, you could use this +expression: <userinput>Bill(?! Gates)</userinput>. (This would find +<quote>Bill Clinton</quote> as well as <quote>Billy the kid</quote>, +but silently ignore the other matches.)</para> + +<para>Sub patterns used for assertions are not captured.</para> + +<para>See also <link linkend="assertions">Assertions</link></para> + +</sect3> + +</sect2> + +<sect2 id="special-characters-in-patterns"> +<title>Characters with a special meaning inside patterns</title> + +<para>The following characters have meaning inside a pattern, and +must be escaped if you want to literally match them: + +<variablelist> + +<varlistentry> +<term><userinput>\</userinput> (backslash)</term> +<listitem><para>The escape character.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>^</userinput> (caret)</term> +<listitem><para>Asserts the beginning of the string.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>$</userinput></term> +<listitem><para>Asserts the end of string.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>()</userinput> (left and right parentheses)</term> +<listitem><para>Denotes sub patterns.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>{}</userinput> (left and right curly braces)</term> +<listitem><para>Denotes numeric quantifiers.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>[]</userinput> (left and right square brackets)</term> +<listitem><para>Denotes character classes.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>|</userinput> (vertical bar)</term> +<listitem><para>logical OR. Separates alternatives.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>+</userinput> (plus sign)</term> +<listitem><para>Quantifier, 1 or more.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>*</userinput> (asterisk)</term> +<listitem><para>Quantifier, 0 or more.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>?</userinput> (question mark)</term> +<listitem><para>An optional character. Can be interpreted as a quantifier, 0 or 1.</para></listitem> +</varlistentry> + +</variablelist> + +</para> + +</sect2> + +</sect1> + +<sect1 id="quantifiers"> +<title>Quantifiers</title> + +<para><emphasis>Quantifiers</emphasis> allows a regular expression to +match a specified number or range of numbers of either a character, +character class or sub pattern.</para> + +<para>Quantifiers are enclosed in curly brackets (<literal>{</literal> +and <literal>}</literal>) and have the general form +<literal>{[minimum-occurrences][,[maximum-occurrences]]}</literal> +</para> + +<para>The usage is best explained by example: + +<variablelist> + +<varlistentry> +<term><userinput>{1}</userinput></term> +<listitem><para>Exactly 1 occurrence</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>{0,1}</userinput></term> +<listitem><para>Zero or 1 occurrences</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>{,1}</userinput></term> +<listitem><para>The same, with less work;)</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>{5,10}</userinput></term> +<listitem><para>At least 5 but maximum 10 occurrences.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>{5,}</userinput></term> +<listitem><para>At least 5 occurrences, no maximum.</para></listitem> +</varlistentry> + +</variablelist> + +</para> + +<para>Additionally, there are some abbreviations: + +<variablelist> + +<varlistentry> +<term><userinput>*</userinput> (asterisk)</term> +<listitem><para>similar to <literal>{0,}</literal>, find any number of occurrences.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>+</userinput> (plus sign)</term> +<listitem><para>similar to <literal>{1,}</literal>, at least 1 occurrence.</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>?</userinput> (question mark)</term> +<listitem><para>similar to <literal>{0,1}</literal>, zero or 1 occurrence.</para></listitem> +</varlistentry> + +</variablelist> + +</para> + +<sect2> + +<title>Greed</title> + +<para>When using quantifiers with no maximum, regular expressions +defaults to match as much of the searched string as possible, commonly +known as <emphasis>greedy</emphasis> behavior.</para> + +<para>Modern regular expression software provides the means of +<quote>turning off greediness</quote>, though in a graphical +environment it is up to the interface to provide you with access to +this feature. For example a search dialog providing a regular +expression search could have a check box labeled <quote>Minimal +matching</quote> as well as it ought to indicate if greediness is the +default behavior.</para> + +</sect2> + +<sect2> +<title>In context examples</title> + +<para>Here are a few examples of using quantifiers</para> + +<variablelist> + +<varlistentry> +<term><userinput>^\d{4,5}\s</userinput></term> +<listitem><para>Matches the digits in <quote>1234 go</quote> and <quote>12345 now</quote>, but neither in <quote>567 eleven</quote> +nor in <quote>223459 somewhere</quote></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\s+</userinput></term> +<listitem><para>Matches one or more whitespace characters</para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>(bla){1,}</userinput></term> +<listitem><para>Matches all of <quote>blablabla</quote> and the <quote>bla</quote> in <quote>blackbird</quote> or <quote>tabla</quote></para></listitem> +</varlistentry> + +<varlistentry> +<term><userinput>/?></userinput></term> +<listitem><para>Matches <quote>/></quote> in <quote><closeditem/></quote> as well as +<quote>></quote> in <quote><openitem></quote>.</para></listitem> +</varlistentry> + +</variablelist> + +</sect2> + +</sect1> + +<sect1 id="assertions"> +<title>Assertions</title> + +<para><emphasis>Assertions</emphasis> allows a regular expression to +match only under certain controlled conditions.</para> + +<para>An assertion does not need a character to match, it rather +investigates the surroundings of a possible match before acknowledging +it. For example the <emphasis>word boundary</emphasis> assertion does +not try to find a non word character opposite a word one at its +position, instead it makes sure that there is not a word +character. This means that the assertion can match where there is no +character, &ie; at the ends of a searched string.</para> + +<para>Some assertions actually does have a pattern to match, but the +part of the string matching that will not be a part of the result of +the match of the full expression.</para> + +<para>Regular Expressions as documented here supports the following +assertions: + +<variablelist> + +<varlistentry> +<term><userinput>^</userinput> (caret: beginning of +string)</term> +<listitem><para>Matches the beginning of the searched +string.</para> <para>The expression <userinput>^Peter</userinput> will +match at <quote>Peter</quote> in the string <quote>Peter, hey!</quote> +but not in <quote>Hey, Peter!</quote> </para> </listitem> +</varlistentry> + +<varlistentry> +<term><userinput>$</userinput> (end of string)</term> +<listitem><para>Matches the end of the searched string.</para> + +<para>The expression <userinput>you\?$</userinput> will match at the +last you in the string <quote>You didn't do that, did you?</quote> but +nowhere in <quote>You didn't do that, right?</quote></para> + +</listitem> +</varlistentry> + +<varlistentry> +<term><userinput>\b</userinput> (word boundary)</term> +<listitem><para>Matches if there is a word character at one side and not a word character at the +other.</para> +<para>This is useful to find word ends, for example both ends to find +a whole word. The expression <userinput>\bin\b</userinput> will match +at the separate <quote>in</quote> in the string <quote>He came in +through the window</quote>, but not at the <quote>in</quote> in +<quote>window</quote>.</para></listitem> + +</varlistentry> + +<varlistentry> +<term><userinput>\B</userinput> (non word boundary)</term> +<listitem><para>Matches wherever <quote>\b</quote> does not.</para> +<para>That means that it will match for example within words: The expression +<userinput>\Bin\B</userinput> will match at in <quote>window</quote> but not in <quote>integer</quote> or <quote>I'm in love</quote>.</para> +</listitem> +</varlistentry> + +<varlistentry> +<term><userinput>(?=PATTERN)</userinput> (Positive lookahead)</term> +<listitem><para>A lookahead assertion looks at the part of the string following a possible match. +The positive lookahead will prevent the string from matching if the text following the possible match +does not match the <emphasis>PATTERN</emphasis> of the assertion, but the text matched by that will +not be included in the result.</para> +<para>The expression <userinput>handy(?=\w)</userinput> will match at <quote>handy</quote> in +<quote>handyman</quote> but not in <quote>That came in handy!</quote></para> +</listitem> +</varlistentry> + +<varlistentry> +<term><userinput>(?!PATTERN)</userinput> (Negative lookahead)</term> + +<listitem><para>The negative lookahead prevents a possible match to be +acknowledged if the following part of the searched string does match +its <emphasis>PATTERN</emphasis>.</para> +<para>The expression <userinput>const \w+\b(?!\s*&)</userinput> +will match at <quote>const char</quote> in the string <quote>const +char* foo</quote> while it can not match <quote>const QString</quote> +in <quote>const QString& bar</quote> because the +<quote>&</quote> matches the negative lookahead assertion +pattern.</para> +</listitem> +</varlistentry> + +</variablelist> + +</para> + +</sect1> + +<!-- TODO sect1 id="backreferences"> + +<title>Back References</title> + +<para></para> + +</sect1 --> + +</appendix> |