Opened 6 years ago

Closed 6 years ago

Last modified 6 years ago

#2540 closed bug (fixed)

[Text.Regex] incorrect word boundary ("\\b") substitutions. Bug in regex-compat's subRegex handling of BOL flags.

Reported by: Eelis- Owned by: ChrisKuklewicz
Priority: normal Milestone: Not GHC
Component: libraries (other) Version: 6.8.3
Keywords: regex regex-compat Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: Difficulty: Unknown
Test Case: Blocked By:
Blocking: Related Tickets:

Description

Consider:

  import Text.Regex
  main = putStrLn $ subRegex (mkRegex "\\b(.)") "abcdef" "|\\1"

This outputs "|a|b|c|d|e|f", while it really should output "|abcdef" (at least according to Perl and Ruby).

Change History (6)

comment:1 Changed 6 years ago by dons

Inconsitency between the regex.h C library, and PCRE?

comment:2 Changed 6 years ago by igloo

  • Difficulty set to Unknown
  • Milestone set to Not GHC
  • Owner set to TextRegexLazy@…

I'd expect it to do the same things as sed:

$ echo "abcdef" | sed -r 's/\b(.)/|\1/g'
|abcdef

i.e. it looks like a bug to me.

Looks like the problem is how subRegex recurses on what comes after the match
(trail):

        case matchRegexAll regexp inp of
            Nothing -> inp
            Just (lead, match, trail, groups) ->
              lead ++ lookup match repl groups ++ (subRegex regexp trail repl)

Christopher, I've assigned it to you as the regex libraries maintainer.

comment:3 Changed 6 years ago by ChrisKuklewicz

  • Architecture changed from x86_64 (amd64) to Multiple
  • Keywords regex-compat added
  • Operating System changed from Unknown to Multiple
  • Owner changed from TextRegexLazy@… to ChrisKuklewicz
  • Status changed from new to assigned
  • Summary changed from [Text.Regex] incorrect word boundary ("\\b") substitutions to [Text.Regex] incorrect word boundary ("\\b") substitutions. Bug in regex-compat's subRegex handling of BOL flags.

Ah bollocks, there is a bug here but it is subtle. The above
complaint is actually to do with the lack of support for GNU
extensions to regex/sed. The regex-posix library expects to implement
just the POSIX regular expressions and none of the different
extensions. This is also consistent with the BSD sed.

The actual c-library calls in regex-posix are regcomp and regexec (and
regfree, regerror).

In GNU regex/sed (I tested version 4.1.5 on linux) the \b means a word
boundary. I assume that this is also the case in Perl and Ruby. Thus
\b matches only at the front of the abcdef word for these systems.

In POSIX sed the \b is not recognized as a known escape, but is
accepted as a literal b. So it matches the bc in abcdef and is
replaced by |c.

On Mac OS 10.5.4 the equivalent to -r is -E and then:

$ echo "abcdef" | sed -E  's/\b(.)/|\1/g'
a|cdef

With ghc version 6.8.3 on OS X I get the same answer as POSIX sed

Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\\b(.)") "abcdef" "|\\1"
"a|cdef"

On linux I can reproduce the bug report:

Prelude Text.Regex>  subRegex (mkRegex "\\b(.)") "abcdef" "|\\1"
"|a|b|c|d|e|f"

Note that man 3 regexec and man 7 regex on linux are not describing
the \b behavior. It is mis-documented.

But there is a further problem: Change \b to ^ and it is clear that
Text.Regex is getting the wrong answer on all systems. On OS X:

$ echo "abcdef" | sed -E  's/^(.)/|\1/'
|abcdef
Prelude Text.Regex>  subRegex (mkRegex "^(.)") "abcdef" "|\\1"
"|a|b|c|d|e|f"

So there is a bug to fix with respect to ^. Fixing this may also
accidentally fix the \b handling on GNU systems. I thought I had
added enough 'execNotBOL' (REG_NOTBOL) flags to cover all these cases,
but regex-compat's subRegex is not clearly not clever enough.

I will update this bug report when there is a fixed version to announce.

comment:4 Changed 6 years ago by ChrisKuklewicz

  • Resolution set to fixed
  • Status changed from assigned to closed

I have uploaded regex-compat 0.92 and regex-posix 0.93.2 to hackage (they are also in darcs). These contain two changes:

regex-posix's Wrap.hsc defines _POSIX_C_SOURCE to (untested) cause the gnu systems to stop trying to handle non-posix escapes like \b.

regex-compat's splitRegex and subRegex are changed to use matchAllText and not to use recursion. This may (untested) create differences in behavior compared to the old version, including laziness.

Both require the newer regex-base to compile and run, the current version of which is 0.93.1 at this time.

comment:5 Changed 6 years ago by simonmar

  • Architecture changed from Multiple to Unknown/Multiple

comment:6 Changed 6 years ago by simonmar

  • Operating System changed from Multiple to Unknown/Multiple
Note: See TracTickets for help on using tickets.