realbasic-nug
[Top] [All Lists]

Re: Regex Hanging

To: REALbasic NUG <realbasic-nug at lists dot realsoftware dot com>, dda <headspin at gmail dot com>
Subject: Re: Regex Hanging
From: Stan Busk <maxprog at mac dot com>
Date: Tue, 31 Aug 2004 21:26:26 +0200
Cc:
Delivered-to: realbasic-nug at lists dot realsoftware dot com
References: <A0D127F0-FA90-11D8-9119-000A95C377AA at mac dot com> <eb8fc19304083010536bd90fa9 at mail dot gmail dot com>
Hi,

Thanks a lot for your response but none of your solutions actually work (at least not with my code) only mine removing the trailing '$'.

~/Stan

No need to run it, I believe you!
May I suggest first that when you write regexes, you don't build them
like legos, as you did, since the end result is invisible. Your regex
looks like this, in fine:

^[a-zA-Z0-9!#\$%&\+\-/=\?\^_{}~\.']+@(\w+[-.]?)+\.([a-zA-Z]+){2,4}$

Bound to lock up the engine, mind. The PCRE engine, which RB uses, is
an NFA engine, which means, among others, that it can be tricked in
endless loops (backtracking forever). A DFA engine would fail this
pattern pronto, but here, nonono... The end $ is the killer. It forces
the engine to loop again and again and again... trying to find better
matches.

Not that just removing the $ will fix all your problems. It WILL
prevent the engine to lock up, but won't match the full email address.
On the other hand, this will:

^[a-zA-Z0-9!#\$%&\+\-/=\?\^_{}~\.']+@(\w+[-.]?)+(\.[a-zA-Z]+){2,4}$
(with of without the ending $, but without will require that you set
greediness to true).
You see, the real problem was the position of \. in your pattern.
Outside the last brackets, the pattern meant [....] and finally a dot
and 2 to 4 times several letters, whereas it should've been inside,
meaning "and finally 2 to 4 times (a dot and several letters).

Finally, I'd like to offer another optimization to this regex:

^[a-zA-Z0-9!#\$%&\+\-/=\?\^_{}~\.']+@(\w[-\w]+)+(\.[a-zA-Z]+){2,4}$

Since we took care of the dot in the last set of parens, we dont need
it anymore in the first half of the domain matching pattern. Saved us
some backtracking -- caused by [-.]?

HTH


--
dda

anger is more useful than despair
Mac OS X 10.3.5, RB 5.5.3 and up



On Mon, 30 Aug 2004 16:26:48 +0200, Stan Busk <maxprog at mac dot com> wrote:
Hi,

The following code below uses a very simple Regex expression to
validate e-mail addresses. I created that expression after reading the
corresponding internet RFC and making tests with thousands addresses.
However there is a given type of invalid e-mail address that simply
hang the app and shows the beach ball. This address simply have a TLD
of one only character rather than the correct 2 to 4 characters. The
code is pasted below. Seems RB Regex implementation has a problem with
the extension part '([a-zA-Z]{2,4})'. Anybody on this list can copy
this code as is and run it to see the result. I have tried with RB
5.5.1 to 5.5.4 on Mac OS X.

   Dim myRegEx As new RegEx
   Dim myRegexMatch As RegExMatch
   Dim UserName, Domain, Extention As String

   UserName  = "[a-zA-Z0-9!#\$%&\+\-/=\?\^_{}~\.']+"
   Domain    = "(\w+[-.]?)+"
   Extention = "([a-zA-Z]{2,4})"

   myRegEx.options.Greedy = false
   myRegEx.options.TreatTargetAsOneLine = true
   myRegEx.SearchPattern = "^" + UserName + "@" + Domain + "\." +
Extention + "$"

   myRegexMatch =
myRegEx.search("ggoos at martinuscollege dot grootebroek dot kennisnet dot n")

~/Stan
_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://www.realsoftware.com/listarchives/lists.html>



_______________________________________________
Unsubscribe or switch delivery mode:
<http://www.realsoftware.com/support/listmanager/>

Search the archives of this list here:
<http://www.realsoftware.com/listarchives/lists.html>

<Prev in Thread] Current Thread [Next in Thread>