[IronPython] differences in IronPython/CPython regular expressions?
janssen at parc.com
Wed Jun 1 16:44:16 PDT 2011
Jeff Hardy <jdhardy at gmail.com> wrote:
> On Wed, Jun 1, 2011 at 4:03 PM, Bill Janssen <janssen at parc.com> wrote:
> > I have a large RE (223613 chars) that works fine in CPython 2.6, but
> That's truly horrible, but I assume you have a good reason for it.
Hi, Jeff. Yes, I think so.
> > seems to produce an endless loop in IronPython (see below). I'm using
> > Mono 2.10 (.NET 4.0.x) on Ubuntu, with IronPython 2.7. Anyone have
> > pointers to the differences between them? Is
> > System::Text::RegularExpressions in .NET configurable in some fashion
> > that might help?
> First off, is there a reason you don't use re.IGNORECASE? That would
> cut the regex in half, at least.
Sure. Names sensitive to capitalization; the rule I'm implementing says
names are either capitalized or upper-case.
> For the most part, CPython and IronPython regexes should be fairly
> compatible - IronPython takes the regex and massages it to work with
> System.Text.RE, but the changes are pretty straightforward and small,
Are those changes documented anywhere?
> and I don't think the re you provided hits any of them. It's quite
> possible that the Mono version of System.Text.RE can't handle the
> expression; you could test this saving the full regex and building a
> small C# program that runs it. The regex template has a lot of
> potential backtracking in it; are you sure it's not caught in a
> pathological (exponential) case?
No; all I'm sure of is that this runs in 1.2 seconds in CPython, and
takes up a core for 15 minutes (till I kill it) with IronPython/Mono.
Something is clearly hitting a bug somewhere... I suppose I should
try it on Windows.
> Finally, is one ginormous really the best way to do this? Have you
> tried other approaches?
No need, until I hit .NET. I'm used to working with a full-featured
finite-state machine (PARC's xfst; see
http://www.cis.upenn.edu/~cis639/docs/xfst.html), and was wondering if
we could do similar things with Python's RE machinery. Long lists like
these names are often used for lists of companies or cities or such.
People's names are actually a fairly simple and short example of this :-).
More information about the Users