Opened 6 months ago

Closed 3 months ago

#12554 closed bug (fixed)

Testsuite exhibits large amount of framework failures

Reported by: Phyx- Owned by:
Priority: normal Milestone: 8.2.1
Component: Test Suite Version: 8.0.1
Keywords: Cc: RyanGlScott
Operating System: Windows Architecture: Unknown/Multiple
Type of failure: Incorrect result at runtime Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s): Phab:D2684
Wiki Page:

Description

I've tried this on multiple computers and they all have the same issue.

I get around 400 testsuite failures with the error:

  r:/temp/ghctest-0u4c8o/test   spaces/./th/T11680.run                                    T11680 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T11680.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T11797.run                                    T11797 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T11797.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T10734.run                                    T10734 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T10734.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T11345.run                                    T11345 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T11345.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T10820.run                                    T10820 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T10820.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T11484.run                                    T11484 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T11484.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T9022.run                                     T9022 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T9022.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T12130.run                                    T12130 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T12130.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T8761.run                                     T8761 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T8761.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T12407.run                                    T12407 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T12407.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T11463.run                                    T11463 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T11463.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T12478_4.run                                  T12478_4 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T12478_4.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T12478_3.run                                  T12478_3 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T12478_3.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T12513.run                                    T12513 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T12513.run')
   r:/temp/ghctest-0u4c8o/test   spaces/./th/T12530.run                                    T12530 [ext-interp] ([Error 183] Cannot create a file when that file already exists: 'r:/temp/ghctest-0u4c8o/test   spaces/./th/T12530.run')

What they all have in common is that they are all using the latest MSYS2 and tools:

Tamar@Squid MINGW64 ~/ghc
$ python3 --version
Python 3.5.2

Tamar@Squid MINGW64 ~/ghc
$ uname -a
MINGW64_NT-10.0 Kino 2.5.2(0.297/5/3) 2016-07-15 08:31 x86_64 Msys

Attachments (3)

Logfile.CSV (12.9 KB) - added by Phyx- 4 months ago.
I/O trace
cgrun035.zip (97.4 KB) - added by bgamari 4 months ago.
Zipped ProcMon log of a failure of cgrun035
ProcessName=cgrun035.exe.CSV (28.7 KB) - added by bgamari 4 months ago.
Different subset of the same trace, this time filtered on processname=cgrun035

Download all attachments as: .zip

Change History (37)

comment:1 Changed 6 months ago by Phyx-

Summary: Testsuite exibits large amount of framework failuresTestsuite exhibits large amount of framework failures

comment:2 Changed 5 months ago by bgamari

Indeed I am also seeing this on the new Windows build bot.

comment:3 Changed 4 months ago by Phyx-

Taking a look again, but my Python foo is weak..

comment:4 Changed 4 months ago by bgamari

I wonder if the testsuite driver is attempting to run the same test in multiple ways at concurrently, therefore attempting to copy the same file into place more than once. thomie, does this ring any bells?

comment:5 Changed 4 months ago by bgamari

Oh, right, we disable threading on Windows. Never mind then.

comment:6 Changed 4 months ago by Phyx-

Well, my TeamCity runs patch the testrunner to run in parallel.

The weird thing is, it doesn't happen on my old msys installs, with python 3.4.3, I don't know if this started because of a change in python or the msys runtime.

Another odd thing is that I would expect the tests that trigger this to be pretty random. But according to my teamcity data of the past ~1 month, the tests that fail are pretty stable. +- 3-4 tests each run, but all in the same category. Mostly the TH tests.

Last edited 4 months ago by Phyx- (previous) (diff)

comment:7 Changed 4 months ago by Phyx-

I have my doubts about the threadsafety of the testsuite. In particular, the location of the function that's failing is right after a rmtree call that's supposed to ensure the path doesn't exist. Yet it does.

Also I see that only relative paths are used in all directory name. I can only assume that it's relative to the cwd as explicit shell commands are issued to change the shell's cwd. Why the python call for this is not used I don't understand.

I must be missing something since this is reliably working on Linux..

comment:8 Changed 4 months ago by bgamari

Alright, I think I see what is going on here.

But first some background. The testsuite driver runs each testcase in its own fresh directory to ensure a clean test environment. The code responsible for this can be found in testlib.do_test(). This looks something like the following,

def do_test(name, way, func, args, files):
    opts = getTestOpts()
    shutil.rmtree(opts.testdir, ignore_errors=True)
    os.makedirs(opts.testdir)

    # Copy test files into testdir and run test
    ...

Note how we invoke rmtree with ignore_errors=True, presumably to account for the fact that testdir may not exist (e.g. if we are running in a fresh working directory). However, this unfortunately hides any real errors that we may encounter while trying to delete testdir, in which case we'll attempt to create testdir when it already exists.

This appears to be precisely what happens some fraction of the time on Windows. Namely, if we rework do_test as follows,

def do_test(name, way, func, args, files):
    opts = getTestOpts()
    if os.path.exists(opts.testdir):
        print('cleaning %s' % opts.testdir)
        shutil.rmtree(opts.testdir, ignore_errors=False)
    os.makedirs(opts.testdir)

    # Copy test files into testdir and run test
    ...

We find the following,

=====> TH_nameSpace(ext-interp) 39 of 42 [0, 3, 12]
cleaning ./th/TH_nameSpace.run
*** framework failure for TH_nameSpace(ext-interp) [Error 5] Access is denied: './th/TH_nameSpace.run/TH_nameSpace.exe' 

I suspect this is due to the fact that, on Windows, executable images of running programs are locked and therefore cannot be deleted. The strange thing here is that the process has presumably already terminated. It seems like the test process must still be in some sort of zombie state where it has terminated and we have its exit code, but its resources (e.g. executable image) have still not been released.

As one would expect, adding a time.sleep(0.1) before the rmtree hides the issue. It's not clear to me what the proper solution is here.

Last edited 4 months ago by bgamari (previous) (diff)

Changed 4 months ago by Phyx-

Attachment: Logfile.CSV added

I/O trace

comment:9 Changed 4 months ago by Phyx-

That (though filtering on the python process, will make a wider filter now, seems to indicate, that at least the stderr is still locked when it tries to delete the folder.)

comment:10 Changed 4 months ago by Phyx-

oh and I seem to get a different error from you

=====> TH_repGuardOutput(ext-interp) 17 of 239 [0, 9, 8]
cleaning ./TH_repGuardOutput.run
cd "./TH_spliceE5_prof.run" && $MAKE -s --no-print-directory TH_spliceE5_prof
*** framework failure for TH_repGuardOutput(ext-interp) [Error 32] The process cannot access the file because it is being used by another process: './TH_repGuardOutput.run/TH_repGuardOutput.comp.stderr'

Changed 4 months ago by bgamari

Attachment: cgrun035.zip added

Zipped ProcMon log of a failure of cgrun035

Changed 4 months ago by bgamari

Different subset of the same trace, this time filtered on processname=cgrun035

comment:11 Changed 4 months ago by bgamari

I spent quite a while today cleaning up the testsuite driver's Python 3 compatibility story and it seems that Python 3.5.2 is significantly less buggy than 2.7.11 with respect to this issue..

comment:12 Changed 4 months ago by bgamari

Here is a fairly minimal testcase which reproduces the issue. Phyx-, can you verify that this doesn't crash in your working environment? I used some random executable produced by the testsuite for test.exe.

#!/usr/bin/python

import sys
import subprocess
import os
import os.path
import time
import shutil

d = 'testdir'
f = d + '/test.exe'
log = d + '/log'
if os.path.exists(d):
    shutil.rmtree(d)

while True:
    os.makedirs(d)
    shutil.copy('test.exe', f)
    subprocess.check_call(f, stdout=open(log, 'w'))
    time.sleep(0.04)
    shutil.rmtree(d)
    print '.',
    sys.stdout.flush()
Last edited 4 months ago by bgamari (previous) (diff)

comment:13 Changed 4 months ago by Phyx-

@bgamari, yes the script works fine for me as well.

comment:14 Changed 4 months ago by Phyx-

that said, it also doesn't crash without the sleep in there. May be my test.exe is too simple? it's just "print 123" atm.

comment:15 Changed 4 months ago by thomie

Usage of ignore_errors=True was sloppy program, sorry about that.

It would be better to revert back to the original (working and commented) code:

def clean_full_path(name):
        try:
            # Remove files...
            os.remove(name)
        except OSError as e1:
            try:
                # ... and empty directories
                os.rmdir(name)
            except OSError as e2:
                # We don't want to fail here, but we do want to know
                # what went wrong, so print out the exceptions.
                # ENOENT isn't a problem, though, as we clean files
                # that don't necessarily exist.
                if e1.errno != errno.ENOENT:
                    print(e1)
                if e2.errno != errno.ENOENT:
                    print(e2)

On the real bug: you could try to eliminate any possible problems with timeout:

-     r = subprocess.call([timeout_prog, timeout, cmd],
+     r = subprocess.call(['sh', '-c', cmd],

On Windows the testsuite uses a timeout program written in Haskell, on all other platforms timeout is written in Python. I'm not quite sure why two separate versions exist, but all the code is in testsuite/timeout.

Fwiw: I couldn't reproduce this issue myself, last time I tried.

comment:16 Changed 4 months ago by bgamari

I pushed some patches which I was testing with to Phab:D2598. They cover a few things including Python 3 compatibility, working around the cleanup issue described above, disabling usage of symlinks on Windows (since they require administrator privileges)n and marking a few tests as broken. Unfortunately the cleanup workaround only narrows the Window and does not close it entirely.

Phyx- has isolated the cleanup problem as being a regression in the msys2-runtime package. 2.5.1 apparently works but 2.5.2 seems to fail.

comment:17 Changed 4 months ago by bgamari

Alright, so I've been trying to work out how to downgrade our Windows builder to a non-broken version of msys2-runtime (namely 2.5.1). It seems that the current installer executable ships with a 2.6 runtime, which is not compatible with the 2.5 series. Consequently it is necessary to use a release from July 2016 instead (http://repo.msys2.org/distrib/x86_64/msys2-base-x86_64-20160719.tar.xz). From this release you can downgrade the runtime with pacman -U. Namely I'm using http://repo.msys2.org/msys/x86_64/msys2-runtime-2.5.1-1-x86_64.pkg.tar.xz.

comment:18 Changed 4 months ago by bgamari

So we're still at it. The current state of things is that,

  • msys python-3.4.3 works with msys2-runtime < 2.5.2
  • msys python-3.4.3` does not work with runtime 2.5.1 nor 2.5.2
  • mingw python-2.7.11 does not work with either runtime
  • mingw python-3.5.2 does not work with either runtime

In all cases the failure mode is error 183 during test directory deletion.

In light of this it seems likely that there are two bugs:

  • CPython has a Windows compatibility bug (hence none of the mingw builds work)
  • msys2-runtime >= 2.5.2 has a regression

While the msys2-runtime issue is likely easier to track down since its a regression over a rather small diff, it's unfortunately not the issue we really care about. We'd far prefer to have mingw python working; unfortunately this is a much harder issue to debug.

Last edited 4 months ago by bgamari (previous) (diff)

comment:19 Changed 4 months ago by bgamari

Phyx- contributed another invaluable data point:

  • mingw python-3.4.3 works with runtime 2.5.1

So it seems that there is very likely a Python regression between 3.4.3 and 3.5.2

comment:20 Changed 4 months ago by bgamari

Phyx- also tested mingw python-3.4.3 with runtime 2.5.2. It failed, although perhaps not as frequently as with runtime 2.5.1.

comment:21 Changed 4 months ago by bgamari

Phyx said that his nightly server has been reliably running msys python-3.4.3 and msys2-runtime 2.5.0 for quite some time.

comment:22 Changed 4 months ago by Phyx-

Not my nightly, my main development machine.

comment:23 Changed 4 months ago by bgamari

Phyx- ran a run on his server to msys2-runtime 2.5.0 and mingw python-3.4.3 and found that it too failed.

Last edited 4 months ago by bgamari (previous) (diff)

comment:24 Changed 4 months ago by bgamari

Two runs with msys python-3.4.3 and msys-runtime-2.5.0 passed cleanly.

Last edited 4 months ago by bgamari (previous) (diff)

comment:25 Changed 4 months ago by RyanGlScott

Cc: RyanGlScott added

comment:26 Changed 4 months ago by Ben Gamari <ben@…>

In 9cb44598/ghc:

testsuite: Work around #12554

It seems that Python 2.7.11 and "recent" msys2 releases are broken,
holding open file locks unexpected. This causes rmtree to intermittently
fail. Even worse, it would fail silently (since we pass
ignore_errors=True), causing makedirs to fail later.

We now explicitly check for the existence of the test directory before
attempting to delete it and disable ignore_errors. Moreover, on Windows
we now try multiple times to rmtree the testdir, working around the
apparently msys bug.

This is all just terrible, but Phyx and I spent several hours trying to
track down the issue to no available. The workaround is better than
nothing.

comment:27 Changed 4 months ago by Ben Gamari <ben@…>

In 17d696f/ghc:

validate: Allow user to override Python interpreter

Due to #12554 and #12661 we must be quite picky about our choice of
Python interpreter on Windows. Allow the user to override it.

Test Plan: `PYTHON=/usr/bin/python2 ./validate` on Windows

Reviewers: austin, Phyx

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2603

GHC Trac Issues: #12554, #12661

comment:28 Changed 4 months ago by Ben Gamari <ben@…>

In 2864ad7/ghc:

testsuite/driver: Allow threading on Windows

It seems that threading now works fine. The only caveat here is that it
makes some race conditions more likely (e.g. #12554), although these
also appear to affect single-threaded runs.

Test Plan: Validate on Windows

Reviewers: austin, Phyx

Subscribers: thomie

Differential Revision: https://phabricator.haskell.org/D2600

GHC Trac Issues: #10510

comment:29 Changed 4 months ago by Phyx-

Differential Rev(s): Phab:D2684
Status: newpatch

comment:30 Changed 3 months ago by Ben Gamari <ben@…>

In 4d4f3533/ghc:

testsuite: Rip out hack for #12554

@Phyx is working on correctly fixing (pun intended) the underlying issue
that prompted this hack. It turns out that `timeout` it the culprit.
Moreover, this hack breaks on msys python builds, which don't export
`WindowsError`.

Test Plan: Validate on Windows with `msys` python.

Reviewers: Phyx, austin

Subscribers: thomie, Phyx

Differential Revision: https://phabricator.haskell.org/D2724

GHC Trac Issues: #12554

comment:31 Changed 3 months ago by bgamari

Phyx has put in a herculean effort to try to fix up the testsuite driver in Phab:D2684. Unfortunately, while the result works on Harbormaster and RyanGlScott's machine, it still fails for me. However, it turns out that only mingw-w64 Python 2.7 is affected.

Given that maintaining Python 2 support is already becoming rather tiresome and that Python 3 seems to work just fine, I say let's just deprecate Python 2 support in the testsuite driver. I've proposed this as Phab:D2766.

comment:32 Changed 3 months ago by Ben Gamari <ben@…>

In 0ce59be/ghc:

Fix testsuite threading, timeout, encoding and performance issues on Windows

In a land far far away, a project called Cygwin was born.
Cygwin used newlib as it's standard C library implementation.

But Cygwin wanted to emulate POSIX systems as closely as possible.
So it implemented `execv` using the Windows function `spawnve`.

Specifically

```
spawnve (_P_OVERLAY, path, argv, cur_environ ())
```

`_P_OVERLAY` is crucial, as it makes the function behave *sort of*
like execv on linux. the child process replaces the original process.

With one major difference because of the difference in process models
on Windows: the original process signals the caller that it's done.

this is why the file is still locked. because it's still running,
control was returned because the parent process was destroyed,
but the child is still running.

I think it's just pure dumb luck, that the older runtimes are slow
enough to give the process time to terminate before we tried deleting
the file.  Which explains why you do have sporadic failures even on
older runtimes like 2.5.0, of a test or two (like T7307).

So this patch fixes a couple of things. I leverage the existing
`timeout.exe` to implement a workaround for this issue.

a) The old timeout used to start the process then assign it to the job.
   This is slightly faulty since child processes are only assigned to a
   job is their parent were assigned at the time they started. So this
   was a race condition. I now create the process suspended, assign it
   to the job and then resume it. Which means all child processes are
   not running under the same job.

b) First things, Is to prevent dangling child processes. I mark the job
   with `JOB_OBJECT_LIMIT_KILL_ON_JOB_CLOSE` so when the last process in
   the job is done, it insures all processes under the job are killed.

c) Secondly, I change the way we wait for results. Instead of waiting
   for the parent process to terminate, I wait for the job itself to
   terminate.

   There's a slight subtlety there, we can't wait on the job itself.
   Instead we have to create an I/O Completion port and wait for signals
   on it.  See
   https://blogs.msdn.microsoft.com/oldnewthing/20130405-00/?p=4743

This fixes the issues on all runtimes for me and makes T7307 pass
consistenly.

The threading was also simplified by hiding all the locking in a single
semaphore and a completion class. Futhermore some additional error
reporting was added.

For encoding the testsuite now no longer passes a file handle to the
subprocess since on windows, sh.exe seems to acquire a lock on the file
that is not released in a timely fashion.

I suspect this because cygwin seems to emulate console handles by
creating file handles and using those for std handles. So when we give
it an existing file handle it just locks the file. I what's happening is
that it's not releasing the handle until all shared cygwin processes are
dead. Which explains why it worked in single threaded mode.

So now instead we pass a pipe and do not interpret the resulting data.

Any bytes written to stdin or read out of stdout/stderr are done so in
binary mode and we do not interpret the data. The reason for this is
that we have encoding tests in GHC which pass invalid utf-8. If we try
to handle the data as text then python will throw an exception instead
of a test comparison failing.

Also I have fixed the ability to override `PYTHON` when calling `make
tests`. This now works the same as with `.\validate`.

Finally, after cleaning up the locks I was able to make the abort
behavior work correctly as I believe it was intended: when you press
Ctrl+C and send an interrupt signal, the testsuite finishes the active
tests and then gracefully exits showing you a report of the progress it
did make. So using Ctrl+C will not just *die* as it did before.

These changes lift the restriction on which python version you use
(msys/mingw) or which runtime or python 3 or python 2.  All combinations
should now be supported.

Test Plan:
PATH=/usr/local/bin:/mingw64/bin:$APPDATA/cabal/bin:$PATH &&
PYTHON=/usr/bin/python THREADS=9 make test
THREADS=9 make test
PATH=/usr/local/bin:/mingw64/bin:$APPDATA/cabal/bin:$PATH &&
PYTHON=/usr/bin/python ./validate --quiet --testsuite-only

Reviewers: erikd, RyanGlScott, bgamari, austin

Subscribers: jrtc27, mpickering, thomie, #ghc_windows_task_force

Differential Revision: https://phabricator.haskell.org/D2684

GHC Trac Issues: #12725, #12554, #12661, #12004

comment:33 Changed 3 months ago by Ben Gamari <ben@…>

In 605bb9b/ghc:

testsuite: Use python3 by default

Summary:
It turns out that Phyx's fix for #12554 (D2684) still fails with mingw-w64
python 2.7. However, Python 3 (both msys2 and mingw-w64) work fine. Given that
supporting Python 2 has already become rather tiresome (as @thomie warned it
would), let's just move to python3 by default.

Test Plan: Validate

Reviewers: austin, Phyx

Reviewed By: Phyx

Subscribers: Phyx, thomie

Differential Revision: https://phabricator.haskell.org/D2766

GHC Trac Issues: #12554

comment:34 Changed 3 months ago by bgamari

Milestone: 8.2.1
Resolution: fixed
Status: patchclosed

This should at long last be fixed by the above two patches.

Note: See TracTickets for help on using tickets.