Author Topic: Regular Expressions in RGSS (Regexp) (Read 16681 times)

#1 • May 19, 2007

Regular Expressions in RGSS (Regexp)

Preface

This article is intended to give basic insight in the use of Regular Expressions in RGSS and to be used for future reference.
I assume you, the reader, have scripting knowledge. If you have problems with scripting issues when you read/test the tutorial that is not directly related to regular expression it would probably be a good idea to seek basic scripting tutorials before continue reading or you will most likely not get as much out of reading this tutorial.
This is not an in-depth tutorial on regular expressions. It is more of an appetizer. A help to get you started. (Along with the reference part naturally)
I hope will be able to figure out how to use Regular expressions to your advantage with what I have provided as well as being able to figure out what I have not run through.

I am sure I will have made at least one error. Please do tell me if you find it.
If you have any problems understanding anything of the material here, ask. Worst case scenario is me not answering. Best case is you getting an answer that will help you understand the stuff here

Worth the risk? I would say so.

Enjoy reading
- Zeriab

Contents

Preface
Contents
Reference
What are regular expressions/languages?
How to create regular expressions
How to use regular expressions
Non-regular languages accepted
Examples
Exercises
Credits and thanks
Sources and useful links

Reference

Reference from ZenSpider:

Quote from: Reference

Regexen

/normal regex/iomx[neus]
%r|alternate form|

options:

/i case insensitive
/o only interpolate #{} blocks once
/m multiline mode - '.' will match newline
/x extended mode - whitespace is ignored
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively

regex characters:

. any character except newline
[ ] any single character of set
[^ ] any single character NOT of set
* 0 or more previous regular expression
*? 0 or more previous regular expression(non greedy)
+ 1 or more previous regular expression
+? 1 or more previous regular expression(non greedy)
? 0 or 1 previous regular expression
| alternation
( ) grouping regular expressions
^ beginning of a line or string
$ end of a line or string
{m,n} at least m but most n previous regular expression
{m,n}? at least m but most n previous regular expression(non greedy)
\A beginning of a string
\b backspace(0x08)(inside[]only)
\B non-word boundary
\b word boundary(outside[]only)
\d digit, same as[0-9]
\D non-digit
\S non-whitespace character
\s whitespace character[ \t\n\r\f]
\W non-word character
\w word character[0-9A-Za-z_]
\z end of a string
\Z end of a string, or before newline at the end
(?# ) comment
(?: ) grouping without backreferences
(?= ) zero-width positive look-ahead assertion
(?! ) zero-width negative look-ahead assertion
(?ix-ix) turns on/off i/x options, localized in group if any.
(?ix-ix: ) turns on/off i/x options, localized in non-capturing group.

What are regular expressions/languages?
You can skip this section if you want since it is a bit of theory without too much practical application.

A regular language is a language that can be expressed with a regular expression.
What do I mean by language?
I mean that words either can be in the language or not in the language.
Let us for the moment for simplification consider an alphabet with just a and b. This means that you only have the symbol a and the symbol b to make words of when using this alphabet.
A regular language with that alphabet will define which combinations of a and b are words in the language. When I say a regular language accepts a words I mean that it is an actual word in the language.
We could define a regular language to be an a followed by any amount of bs and ended by 1 a. 0 bs is a possibility.
Etc. aa, aba and abbba would be in the language.
b, abbbbab and bba would on the other hand not be in the language.
We can quickly see that there are an infinite amount of words belonging to the language

A regular expression is the definition of a regular language.
Let us start with a simple regular expression:
aba
This means that accepted word must consist of an a followed by a b followed by an a.
This particular regular expression defines a regular language that only accepts one word, aba.

The previous regular language can be expresses as:
a(b)*a

Notice the new notation. The star *.
It literally means any amount of the symbol it encompasses. By any amount it is 0, 1, 2, ...
As long as just one of the amounts fits it will be accepted.
The regular expression defines a regular language that accepts any words that consists of an a followed by any amount of bs followed by an a and nothing more.
Note that a(b)*(b)*a expresses the same regular language. This can easily be seen as one of the (b)* can be 0 all the time and you have the same regular expression.
We can conclude that there can exist an infinite amount ways to express a regular language

If we wanted the words just to start with a(b)*a and what come after is irrelevant we can use this regular expression: a(b)*a(a|b)*

Note: The () are not really needed when there is only 1 letter, I just put them on for clarity. It is perfectly fine to write ab*a(a|b)*

Let us define a new regular expression:
abba|baba

We have new notation again. The | simple means or like in normal RGSS.
So this regular expression defines a language that accepts abba and baba. It is pretty straightforward.

The? means either 0 or 1 of the symbol. ab?a accepts aa and aba.

One final often used notation is the + operator - a(b)+a
It is basically the same as the star * operator with the difference of at least one b.
(x)+ is the same as x(x)*, where x means that it could be anything. Any regular notation.

Note (a)* also accepts the empty word. I.e. "" in Ruby syntax.

I will end this section with an example of a non-regular language:
Any words in this language has a number of as in the start followed by the same number of bs.
I.e. ab, aabb, aaabbb and so on.

I will not trouble you with the proof because I think you will find it as boring as I do.
If you really want it I am sure you can manage to search for it yourself.

How to create regular expressions

You can create a regular expression with one of the following syntaxes: (I will use the first one in my examples)

/.../ol
%r|...|ol
Regexp.new('...', options, language)

The dots (...) symbolize the actual regular expression. You can see the syntax up in the reference section.
The o symbolizes the place for options, which are optional. You do not have to write anything here.
o be any of the following i,o,m,x You can have several of them. You can choose them all if you want. From the reference section:

Quote

/i case insensitive
/o only interpolate #{} blocks once
/m multiline mode - '.' will match newline
/x extended mode - whitespace is ignored
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively

The /[neus] is for the l. You can choose either n, e, u or s. Only one encoding is possible at a time.

The options block in Regexp.new is optional. This is a little different from the options part in the previous syntaxes.
Here you can put:
Regexp::EXTENDED - Newlines and spaces are ignored.
Regexp::IGNORECASE - Case insensitive.
Regexp::MULTILINE - Newlines are treated as any other character.

If you want more than one option, let us say ignore case and multiline, you or them together like this: Regexp::IGNORECASE | Regexp::MULTILINE

We move on to what you actually write instead of the dots (...)
Notice that . means any character.
If you want to find just a dot use \.

Let us take the example where you want to accept words that start with either a, b or c. What comes after does not matter
/(a|b|c).*/ is a possibility, so is [abc].* and [a-c].*
This illustrates the use of the []. If you want all letters use [a-zA-Z] or just [a-z] if you have ignore case on.
If you have weird letters, i.e. not a-z then you probably have to enter them manually.
Example, any word:
/[a-z]*/i gives the same as /[a-zA-Z]*/
The first case will not allow case sensitive stuff later though. I.e. /[a-z]*g/i is not the same as /[a-zA-Z]*g/.
In the first case words that end with big G are accepted while they are not in the latter case.
Numbers are [0-9]
Just use \w to get numbers, letters and _

Let us a bit more difficult example.
We have used Dir to get the files in a directory. We want to remove the file extension from the strings. How exactly we will do it is shown in the next section.
Here we will write the regular expression that accepts the file extension with dot and only the file extension.
If there are more dots as in FileName.txt.dat we will only consider the end extension. I.e. only .dat.
If you have self-discipline enough it would be a good time to try and figure out how to do it on your own. Just consider every extension.

Spoiler for My answer:

I will go through this step by step.
First we have \. which means a dot like I told earlier. It is the dot in the file name.
Next we have the [^\.]* bit. The [] means one of the character in the brackets. [^] means any character that is not in the brackets. Since you have the dot \. in the brackets it means any character but the dot.
The star simple means any amount, so any amounts of non-dot characters are accepted.
Finally we have \Z which means at the end of the string. It will be explained in the next section.

How to use regular expressions

You may have wondered why there are both a * and *? operator that basically does the same.
I have also avoided other use specific operators.
These are related to how they should be applied to strings.

You now know how to create regular expressions and I will in this section tell you how to actually use.

I will continue the example from the previous section.
We have the string:

Code: [Select]

str = "Read.me.txt"

We want to remove the .txt which can be done this way:

Code: [Select]

str. gsub!(/\.[^\.]*\Z/) {|s| ""}

We use the gsub! which modifies the string. The return is "Read.me"
The \Z means that it has to be at the end of the line and/or string. '\n' is considered to be a new line. It does not take .me because there is a dot after it and it is therefore not at the end of the line. (Remember that [^\.] do not accept dots)
Here is a list of the methods on strings where you can apply the regular expression.
Look here for how to use them: http://www.rubycentral.com/ref/ref_c_string.html

=~
[ ]
[ ]=
gsub
gsub!
index
rindex
scan
slice
slice!
split
sub
sub!

Basically whenever you see aRegexp or pattern as an argument you can apply your regular expression.

The effects vary from method to method, but I am sure you can figure it out as the principle when considering regular expressions are the same.

Another fishy thing you might have notices is the non-greedy variants of * and +.
To illustrate the different effect try this code:

Code: [Select]

 "boobs".gsub(/bo*/) {|s| p s}     # 1
 "boobs".gsub(/bo*?/) {|s| p s}    # 2
 "boobs".gsub(/bo+/) {|s| p s}     # 3
 "boobs".gsub(/bo+?/) {|s| p s}    # 4

The first one (greedy) will print out "boo" and "b". It takes all the os it can.
The next one (non greedy) will print out "b" and "b". It takes as few os as possible.
That is basically the difference. It will take os if necessary. In the following code "boob" is printed out in both examples.

Code: [Select]

 "boobs".gsub(/bo*b/) {|s| p s} 
 "boobs".gsub(/bo*?b/) {|s| p s}

The + operator is similar except that there have to be at least 1 o.
In 3rd case you will get "boo" and 4th case you will get "bo".

All in all. The longest possible string is taken unless you have some non greedy operators. The non greedy operators will try to get the shortest possible string.
"boobs".gsub(/o*?/) {|s| p s} will give pure ""s.
As a finale I will talk about escaping characters. You may have wondered about how to search for the characters that have special meaning like *, |, /.
The trick is to escape them. That is what it is called. You just basically have to put a \ in front of them.
\*, \| and \/. To get \ just use \\.
I have already showed an example where I escape the dot. (\.)

There are still loads of operators I have not showed. Some are a bit advanced some are not. They will not be included in this tutorial except for the back reference shown in the next section
Until I make another tutorial or extend this tutorial you can have fun with discovering how they work on your own ^_^

Non-regular languages accepted

It is a bit ironic that the regular expression implementation in Ruby also accepts some non-regular languages.
It is the back-references I am talking about.
Look at this example: /(wii|psp)\1/
wiiwii and psppsp, but neither wiipsp nor pspwii.
You can use back references to make non-regular languages.
I am not going to supply neither proof nor example. You can google it yourself if you are doubting ^_^
One problem with back references is speed. It goes from polynomial time to exponential time. If the regular expressions have been implemented just a little sensible the speed down will only effect regular expressions with the extra functionality.
It should not be too much of a problem with short regular expressions, but it is still something to consider.

Examples

A couple of examples for reference and guidance ^_^

Example 1
I will start by giving some code:

Code: [Select]

files = Dir["Data/*"]
files.each {|s| s.gsub!(/\.[^\.]*$|[^\/]*\//) {|str| ""}}

The filenames of all the files in the Data directory are stored in the files variable as strings in an Array. (subdirectories and their contents not included)
That is what the Dir["Data/*"] command returns.
The next line calls gsub!(/\.[^\.]*$|[^\/]*\//) {|str| ""} on each filename. (Remember that it is a string)
Now we finally come to the big regular expression:]/\.[^\.]*$|[^\/]*\//
Notice the |. This means that we accept strings that fulfills either what comes before or what comes after. If the string are accepted in both cases it will still be accepted
Let us look at /\.[^\.]*$. This looks like something we have seen before. Since $ means end of string/line it basically does the same thing as \Z. We have already run through this example, it removes the extension.
Next we will look [^\/]*\//. Remember the bit about escaping?
[^\/] means any character but /. The star means any amount of them.
It is followed by \/ which means the character /. The last character MUST be /
Since the greedy version of the star is used it will try to find the longest possible string which ends with /.
So this basically finds the directory bit of the path.
This means that it either has to be the extension or the path before the filename. We remove these parts by substituting the found strings with "".

This can be used if you for some reason want to draw the files in the location without the path and extension. (You might have them elsewhere.)

Exercises

I have made up a couple of exercise you can try to deal with. I have not provided the answer and I do not plan to provide them.
I consider them more of a stimulant; A way to help you at learning regular expressions.
If you really want to check your answers then PM me. Do not post your answers here.

Exercise 1
Let str be an actual string. What does this code do?

Code: [Select]

str.gsub!(/.*\..*/) {|s| ""}

Exercise 2
You have one long string that contains a certain amounts of words. Whitespace characters separate the words.
These words have a structure. They all start with either "donkey_" or "monkey_"
What comes after the "_" differs from word to word.
What you want is to separate the monkeys from the donkeys.
You want to put the donkeys in a donkey array and the monkeys in a monkey array.
Make the 2 arrays before and use just a gsub call to separate the monkeys from the donkeys.

Credits and thanks
This article have been made by Zeriab, please give credits
Credits to ZenSpider.com for the reference list.

Thanks to:
Ragnarok Rob for the example word "boobs". (An amazingly good word :P)
SephirothSpawn for getting me to do this tutorial.
Nitt (jimme reashu) for reporting a bug

Sources and useful links

ZenSpider - http://www.zenspider.com/Languages/Ruby/QuickRef.html#11
Ruby Central - http://www.rubycentral.com/ref/ref_c_regexp.html
Regular-Expressions.info - http://www.regular-expressions.info/

#2 • May 21, 2007

Good tutorial.

I've never used regular expressions before but they seem quite helpful. Thank you for writing this.

#3 • Mar 11, 2008

I know this is a double post and a necro, but I've recently been drawing on this a lot for a script I am writing and I wanted to thank you legitimately. When I last posted, I had no clue how useful this tutorial actually was and I understood very little. So anyway, thanks, the tutorial is great!

About my assignment, I haven't had time to work on it yet, but I will try to get around to it. I have a ton of school work to do though and I am almost finished the script that I have been working on (at least a beta).

#4 • Mar 11, 2008

Aww, thanks

*blushes*
I'm glad you find it useful

*gives Modern a cry-hug*

#5 • Jan 14, 2010

i have a problem

Code: [Select]

string= "<msg hello>"
#so i use reg exp like this
code = /<msg (.*)>/.match(string).to_a
code.delete_at(0)
p code #==> ["msg hello"]#nice

but

Code: [Select]

string = "<msg hello><msg another message>"
#again, i write like this
code = /<msg (.*)>/.match(string).to_a
code.delete_at(0)
p code #==> ["msg hello"]# huh, what's happen

help me, i want final array return are ["msg hello","msg another message"]

#6 • Jan 14, 2010

Hi ngoaho

First of all the second case looks like this since * is greedy:

Code: [Select]

string = "<msg hello><msg another message>"
#again, i write like this
code = /<msg (.*)>/.match(string).to_a
code.delete_at(0)
p code #==> ["msg hello><msg another message"]# huh, what's happen

You should use the non-greedy version *? instead. Of course you still won't get more than message.
Take a look at String.scan since it is very helpful in your situation.

Code: [Select]

# The string to decode
string = "<msg hello><msg another message>"
# Let's make an array to fill up
array = []
# Scan for the messages and add them to the array
string.scan(/<msg (.*?)>/) {|x| array << $1}
p array #==> ["hello","another message"]
# Another way to use scan:
p string.scan(/<msg (.*?)>/) #==> [["hello"],["another message"]]

Good luck with whatever you are scripting.

*hugs*

#7 • Jan 14, 2010

ye, great, i love you zeriab.

#8 • Jan 21, 2010

an other problem,

Code: [Select]

p "<if 1==2;<msg aaaaaa>>".scan(/<(.*?)>/)# ==> ["if","1==2;<msg aaaaaa"]

">" missing, help me, i want array return are ["if","1==2;<msg aaaaaa>"]

#9 • Jan 21, 2010

The problem is that you want to have the same number of < and >.
This cannot be expressed in a regular expression. (Unless you are allowing infinitely long expressions)
I don't think you can use back references to solve the problem either.

You could let use a different symbol for the if construct. That wouldn't allow any nesting so it may not solve the problem for you.
You could use references to solve the problem. For example <ref c1><msg aaaaaa></ref c1> and then <if 1==2; \ref[c1]>.
You could also write a parser which counts the number of the symbols. A recursive descent parser would for example solve the problem.

*hugs*

#10 • Jan 21, 2010

hmmm, thank you, i'll learn more about it.