Opened 9 years ago

Closed 9 years ago

Last modified 9 years ago

#2540 closed bug (fixed)

[Text.Regex] incorrect word boundary ("\\b") substitutions. Bug in regex-compat's subRegex handling of BOL flags.

Reported by: Eelis- Owned by: ChrisKuklewicz
Priority: normal Milestone: Not GHC
Component: libraries (other) Version: 6.8.3
Keywords: regex regex-compat Cc:
Operating System: Unknown/Multiple Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:



  import Text.Regex
  main = putStrLn $ subRegex (mkRegex "\\b(.)") "abcdef" "|\\1"

This outputs "|a|b|c|d|e|f", while it really should output "|abcdef" (at least according to Perl and Ruby).

Change History (6)

comment:1 Changed 9 years ago by dons

Inconsitency between the regex.h C library, and PCRE?

comment:2 Changed 9 years ago by igloo

difficulty: Unknown
Milestone: Not GHC
Owner: set to TextRegexLazy@…

I'd expect it to do the same things as sed:

$ echo "abcdef" | sed -r 's/\b(.)/|\1/g'

i.e. it looks like a bug to me.

Looks like the problem is how subRegex recurses on what comes after the match (trail):

        case matchRegexAll regexp inp of
            Nothing -> inp
            Just (lead, match, trail, groups) ->
              lead ++ lookup match repl groups ++ (subRegex regexp trail repl)

Christopher, I've assigned it to you as the regex libraries maintainer.

comment:3 Changed 9 years ago by ChrisKuklewicz

Architecture: x86_64 (amd64)Multiple
Keywords: regex-compat added
Operating System: UnknownMultiple
Owner: changed from TextRegexLazy@… to ChrisKuklewicz
Status: newassigned
Summary: [Text.Regex] incorrect word boundary ("\\b") substitutions[Text.Regex] incorrect word boundary ("\\b") substitutions. Bug in regex-compat's subRegex handling of BOL flags.

Ah bollocks, there is a bug here but it is subtle. The above complaint is actually to do with the lack of support for GNU extensions to regex/sed. The regex-posix library expects to implement just the POSIX regular expressions and none of the different extensions. This is also consistent with the BSD sed.

The actual c-library calls in regex-posix are regcomp and regexec (and regfree, regerror).

In GNU regex/sed (I tested version 4.1.5 on linux) the \b means a word boundary. I assume that this is also the case in Perl and Ruby. Thus \b matches only at the front of the abcdef word for these systems.

In POSIX sed the \b is not recognized as a known escape, but is accepted as a literal b. So it matches the bc in abcdef and is replaced by |c.

On Mac OS 10.5.4 the equivalent to -r is -E and then:

$ echo "abcdef" | sed -E  's/\b(.)/|\1/g'

With ghc version 6.8.3 on OS X I get the same answer as POSIX sed

Prelude> :m +Text.Regex
Prelude Text.Regex> subRegex (mkRegex "\\b(.)") "abcdef" "|\\1"

On linux I can reproduce the bug report:

Prelude Text.Regex>  subRegex (mkRegex "\\b(.)") "abcdef" "|\\1"

Note that man 3 regexec and man 7 regex on linux are not describing the \b behavior. It is mis-documented.

But there is a further problem: Change \b to ^ and it is clear that Text.Regex is getting the wrong answer on all systems. On OS X:

$ echo "abcdef" | sed -E  's/^(.)/|\1/'
Prelude Text.Regex>  subRegex (mkRegex "^(.)") "abcdef" "|\\1"

So there is a bug to fix with respect to ^. Fixing this may also accidentally fix the \b handling on GNU systems. I thought I had added enough 'execNotBOL' (REG_NOTBOL) flags to cover all these cases, but regex-compat's subRegex is not clearly not clever enough.

I will update this bug report when there is a fixed version to announce.

comment:4 Changed 9 years ago by ChrisKuklewicz

Resolution: fixed
Status: assignedclosed

I have uploaded regex-compat 0.92 and regex-posix 0.93.2 to hackage (they are also in darcs). These contain two changes:

regex-posix's Wrap.hsc defines _POSIX_C_SOURCE to (untested) cause the gnu systems to stop trying to handle non-posix escapes like \b.

regex-compat's splitRegex and subRegex are changed to use matchAllText and not to use recursion. This may (untested) create differences in behavior compared to the old version, including laziness.

Both require the newer regex-base to compile and run, the current version of which is 0.93.1 at this time.

comment:5 Changed 9 years ago by simonmar

Architecture: MultipleUnknown/Multiple

comment:6 Changed 9 years ago by simonmar

Operating System: MultipleUnknown/Multiple
Note: See TracTickets for help on using tickets.