Regular Expressions in RGSS (Regexp)
PrefaceThis article is intended to give basic insight in the use of Regular Expressions in RGSS and to be used for future reference.
I assume you, the reader, have scripting knowledge. If you have problems with scripting issues when you read/test the tutorial that is not directly related to regular expression it would probably be a good idea to seek basic scripting tutorials before continue reading or you will most likely not get as much out of reading this tutorial.
This is not an in-depth tutorial on regular expressions. It is more of an appetizer. A help to get you started. (Along with the reference part naturally)
I hope will be able to figure out how to use Regular expressions to your advantage with what I have provided as well as being able to figure out what I have not run through.
I am sure I will have made at least one error. Please do tell me if you find it.
If you have any problems understanding anything of the material here, ask. Worst case scenario is me not answering. Best case is you getting an answer that will help you understand the stuff here
Worth the risk? I would say so.
Enjoy reading
-
ZeriabContents- Preface
- Contents
- Reference
- What are regular expressions/languages?
- How to create regular expressions
- How to use regular expressions
- Non-regular languages accepted
- Examples
- Exercises
- Credits and thanks
- Sources and useful links
ReferenceReference from
ZenSpider:
Regexen
/normal regex/iomx[neus]
%r|alternate form|
options:
/i case insensitive
/o only interpolate #{} blocks once
/m multiline mode - '.' will match newline
/x extended mode - whitespace is ignored
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively
regex characters:
. any character except newline
[ ] any single character of set
[^ ] any single character NOT of set
* 0 or more previous regular expression
*? 0 or more previous regular expression(non greedy)
+ 1 or more previous regular expression
+? 1 or more previous regular expression(non greedy)
? 0 or 1 previous regular expression
| alternation
( ) grouping regular expressions
^ beginning of a line or string
$ end of a line or string
{m,n} at least m but most n previous regular expression
{m,n}? at least m but most n previous regular expression(non greedy)
\A beginning of a string
\b backspace(0x08)(inside[]only)
\B non-word boundary
\b word boundary(outside[]only)
\d digit, same as[0-9]
\D non-digit
\S non-whitespace character
\s whitespace character[ \t\n\r\f]
\W non-word character
\w word character[0-9A-Za-z_]
\z end of a string
\Z end of a string, or before newline at the end
(?# ) comment
(?: ) grouping without backreferences
(?= ) zero-width positive look-ahead assertion
(?! ) zero-width negative look-ahead assertion
(?ix-ix) turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.
What are regular expressions/languages?You can skip this section if you want since it is a bit of theory without too much practical application.A regular language is a language that can be expressed with a regular expression.
What do I mean by language?
I mean that words either can be in the language or not in the language.
Let us for the moment for simplification consider an
alphabet with just
a and
b. This means that you only have the symbol
a and the symbol
b to make words of when using this alphabet.
A regular language with that alphabet will define which combinations of
a and
b are
words in the language. When I say a regular language accepts a words I mean that it is an actual
word in the language.
We could define a regular language to be an
a followed by any amount of
bs and ended by 1
a. 0
bs is a possibility.
Etc.
aa,
aba and
abbba would be in the language.
b,
abbbbab and
bba would on the other hand not be in the language.
We can quickly see that there are an infinite amount of words belonging to the language
A regular expression is the definition of a regular language.
Let us start with a simple regular expression:
abaThis means that accepted word must consist of an
a followed by a
b followed by an
a.
This particular regular expression defines a regular language that only accepts one word,
aba.
The previous regular language can be expresses as:
a(b)*aNotice the new notation. The star *.
It literally means any amount of the symbol it encompasses. By any amount it is 0, 1, 2, ...
As long as just one of the amounts fits it will be accepted.
The regular expression defines a regular language that accepts any words that consists of an
a followed by any amount of
bs followed by an
a and nothing more.
Note that
a(b)*(b)*a expresses the same regular language. This can easily be seen as one of the
(b)* can be 0 all the time and you have the same regular expression.
We can conclude that there can exist an infinite amount ways to express a regular language
If we wanted the words just to start with
a(b)*a and what come after is irrelevant we can use this regular expression:
a(b)*a(a|b)*Note: The
() are not really needed when there is only 1 letter, I just put them on for clarity. It is perfectly fine to write
ab*a(a|b)*Let us define a new regular expression:
abba|babaWe have new notation again. The | simple means
or like in normal RGSS.
So this regular expression defines a language that accepts
abba and
baba. It is pretty straightforward.
The? means either 0 or 1 of the symbol.
ab?a accepts
aa and
aba.
One final often used notation is the + operator -
a(b)+aIt is basically the same as the star * operator with the difference of at least one b.
(x)+ is the same as
x(x)*, where x means that it could be anything. Any regular notation.
Note
(a)* also accepts the empty word. I.e. "" in Ruby syntax.
I will end this section with an example of a non-regular language:
Any words in this language has a number of
as in the start followed by the same number of
bs.
I.e.
ab,
aabb,
aaabbb and so on.
I will not trouble you with the proof because I think you will find it as boring as I do.
If you really want it I am sure you can manage to search for it yourself.
How to create regular expressionsYou can create a regular expression with one of the following syntaxes: (I will use the first one in my examples)
/.../
ol%r|...|
olRegexp.new('...',
options,
language)
The dots (...) symbolize the actual regular expression. You can see the syntax up in the reference section.
The
o symbolizes the place for options, which are optional. You do not have to write anything here.
o be any of the following
i,o,m,x You can have several of them. You can choose them all if you want. From the reference section:
/i case insensitive
/o only interpolate #{} blocks once
/m multiline mode - '.' will match newline
/x extended mode - whitespace is ignored
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively
The /[neus] is for the
l. You can choose either
n, e, u or
s. Only one encoding is possible at a time.
The
options block in Regexp.new is optional. This is a little different from the options part in the previous syntaxes.
Here you can put:
Regexp::EXTENDED - Newlines and spaces are ignored.
Regexp::IGNORECASE - Case insensitive.
Regexp::MULTILINE - Newlines are treated as any other character.
If you want more than one option, let us say ignore case and multiline, you or them together like this:
Regexp::IGNORECASE | Regexp::MULTILINEWe move on to what you actually write instead of the dots (...)
Notice that
. means any character.
If you want to find just a dot use
\.Let us take the example where you want to accept words that start with either
a, b or
c. What comes after does not matter
/(a|b|c).*/ is a possibility, so is
[abc].* and
[a-c].*This illustrates the use of the
[]. If you want all letters use
[a-zA-Z] or just
[a-z] if you have ignore case on.
If you have weird letters, i.e. not a-z then you probably have to enter them manually.
Example, any word:
/[a-z]*/i gives the same as
/[a-zA-Z]*/The first case will not allow case sensitive stuff later though. I.e.
/[a-z]*g/i is not the same as
/[a-zA-Z]*g/.
In the first case words that end with big G are accepted while they are not in the latter case.
Numbers are
[0-9]Just use
\w to get numbers, letters and
_Let us a bit more difficult example.
We have used Dir to get the files in a directory. We want to remove the file extension from the strings. How exactly we will do it is shown in the next section.
Here we will write the regular expression that accepts the file extension with dot and only the file extension.
If there are more dots as in FileName.txt.dat we will only consider the end extension. I.e. only
.dat.
If you have self-discipline enough it would be a good time to try and figure out how to do it on your own. Just consider every extension.
I will go through this step by step.
First we have
\. which means a dot like I told earlier. It is the dot in the file name.
Next we have the
[^\.]* bit. The
[] means one of the character in the brackets.
[^] means any character that is not in the brackets. Since you have the dot
\. in the brackets it means any character but the dot.
The star simple means any amount, so any amounts of non-dot characters are accepted.
Finally we have
\Z which means at the end of the string. It will be explained in the next section.
How to use regular expressionsYou may have wondered why there are both a
* and
*? operator that basically does the same.
I have also avoided other use specific operators.
These are related to how they should be applied to strings.
You now know how to create regular expressions and I will in this section tell you how to actually use.
I will continue the example from the previous section.
We have the string:
str = "Read.me.txt"
We want to remove the
.txt which can be done this way:
str. gsub!(/\.[^\.]*\Z/) {|s| ""}
We use the gsub! which modifies the string. The return is "Read.me"
The
\Z means that it has to be at the end of the line and/or string. '\n' is considered to be a new line. It does not take
.me because there is a dot after it and it is therefore not at the end of the line. (Remember that
[^\.] do not accept dots)
Here is a list of the methods on strings where you can apply the regular expression.
Look here for how to use them:
http://www.rubycentral.com/ref/ref_c_string.html- =~
- [ ]
- [ ]=
- gsub
- gsub!
- index
- rindex
- scan
- slice
- slice!
- split
- sub
- sub!
Basically whenever you see
aRegexp or
pattern as an argument you can apply your regular expression.
The effects vary from method to method, but I am sure you can figure it out as the principle when considering regular expressions are the same.
Another fishy thing you might have notices is the
non-greedy variants of
* and
+.
To illustrate the different effect try this code:
"boobs".gsub(/bo*/) {|s| p s} # 1
"boobs".gsub(/bo*?/) {|s| p s} # 2
"boobs".gsub(/bo+/) {|s| p s} # 3
"boobs".gsub(/bo+?/) {|s| p s} # 4
The first one (greedy) will print out "boo" and "b". It takes all the
os it can.
The next one (non greedy) will print out "b" and "b". It takes as few
os as possible.
That is basically the difference. It will take
os if necessary. In the following code "boob" is printed out in both examples.
"boobs".gsub(/bo*b/) {|s| p s}
"boobs".gsub(/bo*?b/) {|s| p s}
The + operator is similar except that there have to be at least 1
o.
In 3rd case you will get "boo" and 4th case you will get "bo".
All in all. The longest possible string is taken unless you have some non greedy operators. The non greedy operators will try to get the shortest possible string.
"boobs".gsub(/o*?/) {|s| p s} will give pure ""s.
As a finale I will talk about escaping characters. You may have wondered about how to search for the characters that have special meaning like
*,
|,
/.
The trick is to
escape them. That is what it is called. You just basically have to put a
\ in front of them.
\*,
\| and
\/. To get
\ just use
\\.
I have already showed an example where I escape the dot. (
\.)
There are still loads of operators I have not showed. Some are a bit advanced some are not. They will not be included in this tutorial except for the back reference shown in the next section
Until I make another tutorial or extend this tutorial you can have fun with discovering how they work on your own ^_^
Non-regular languages acceptedIt is a bit ironic that the regular expression implementation in Ruby also accepts some non-regular languages.
It is the back-references I am talking about.
Look at this example:
/(wii|psp)\1/wiiwii and
psppsp, but neither
wiipsp nor
pspwii.
You can use back references to make non-regular languages.
I am not going to supply neither proof nor example. You can google it yourself if you are doubting ^_^
One problem with back references is speed. It goes from polynomial time to exponential time. If the regular expressions have been implemented just a little sensible the speed down will only effect regular expressions with the extra functionality.
It should not be too much of a problem with short regular expressions, but it is still something to consider.
ExamplesA couple of examples for reference and guidance ^_^
Example 1I will start by giving some code:
files = Dir["Data/*"]
files.each {|s| s.gsub!(/\.[^\.]*$|[^\/]*\//) {|str| ""}}
The filenames of all the files in the Data directory are stored in the
files variable as strings in an Array. (subdirectories and their contents not included)
That is what the
Dir["Data/*"] command returns.
The next line calls
gsub!(/\.[^\.]*$|[^\/]*\//) {|str| ""} on each filename. (Remember that it is a string)
Now we finally come to the big regular expression:]
/\.[^\.]*$|[^\/]*\//Notice the
|. This means that we accept strings that fulfills either what comes before or what comes after. If the string are accepted in both cases it will still be accepted
Let us look at
/\.[^\.]*$. This looks like something we have seen before. Since
$ means end of string/line it basically does the same thing as
\Z. We have already run through this example, it removes the extension.
Next we will look
[^\/]*\//. Remember the bit about escaping?
[^\/] means any character but /. The star means any amount of them.
It is followed by
\/ which means the character /. The last character MUST be /
Since the greedy version of the star is used it will try to find the longest possible string which ends with /.
So this basically finds the directory bit of the path.
This means that it either has to be the extension or the path before the filename. We remove these parts by substituting the found strings with "".
This can be used if you for some reason want to draw the files in the location without the path and extension. (You might have them elsewhere.)
ExercisesI have made up a couple of exercise you can try to deal with. I have not provided the answer and I do not plan to provide them.
I consider them more of a stimulant; A way to help you at learning regular expressions.
If you really want to check your answers then PM me. Do not post your answers here.
Exercise 1Let
str be an actual string. What does this code do?
str.gsub!(/.*\..*/) {|s| ""}
Exercise 2You have one long string that contains a certain amounts of words. Whitespace characters separate the words.
These words have a structure. They all start with either "donkey_" or "monkey_"
What comes after the "_" differs from word to word.
What you want is to separate the monkeys from the donkeys.
You want to put the donkeys in a donkey array and the monkeys in a monkey array.
Make the 2 arrays before and use just a
gsub call to separate the monkeys from the donkeys.
Credits and thanksThis article have been made by
Zeriab, please give credits
Credits to
ZenSpider.com for the reference list.
Thanks to:
Ragnarok Rob for the example word "boobs".
(An amazingly good word :P)SephirothSpawn for getting me to do this tutorial.
Nitt (
jimme reashu) for reporting a bug
Sources and useful linksZenSpider -
http://www.zenspider.com/Languages/Ruby/QuickRef.html#11Ruby Central -
http://www.rubycentral.com/ref/ref_c_regexp.htmlRegular-Expressions.info -
http://www.regular-expressions.info/