Opened 3 years ago

Closed 2 years ago

#9266 closed bug (fixed)

getDirectoryContents blow its stack in a huge directory

Reported by: joeyhess Owned by: snoyberg
Priority: normal Milestone:
Component: Core Libraries Version: 7.6.2
Keywords: Cc: core-libraries-committee@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


Once a directory has around 2 million files in it, a lack of an accumulator in getDirectoryContents (unix version only; windows already has an acc) causes it to blow the stack:

joey@darkstar:‾/src/git-annex>cat test.hs
import System.Directory
main = do
	l <- getDirectoryContents "/tmp/big"
	print (null l)
joey@darkstar:‾/src/git-annex>ghc --make -O2 test
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I suggest the attached patch.

Attachments (2)

0001-use-an-accumulator-in-getDirectoryContents.patch (1.0 KB) - added by joeyhess 3 years ago.
0002-add-portable-directory-reading-interface.patch (5.8 KB) - added by joeyhess 3 years ago.

Download all attachments as: .zip

Change History (16)

comment:1 Changed 3 years ago by joeyhess

While my patch avoids the stack overflow, getDirectoryContents still seems to be using more memory than I would expect if it's lazily generating the list of files. The test program uses around 600 mb.

I think it also needs to use unsafeInterleaveIO, but have not been able to get that to work.

comment:2 Changed 3 years ago by joeyhess

I've gotten it to work with unsafeInterleaveIO. I can provide a patch, but I don't think you'l find this approach acceptable. The problem is that any errors would be deferred until the list is consumed, ie the readFile problem all over again.

With that said, it does work; the test program now uses only 1776kb, and a variant that gets the length of the list behaves similarly well.

Last edited 3 years ago by joeyhess (previous) (diff)

comment:3 Changed 3 years ago by nomeata

This seems to be an instance of the problem described at (for which I have not found a suitable solution yet).

Have you tried passing an a dlist in the accumulator? OTOH, it shouldn’t perform better than your first patch, just result in a different order...

You say you use 7.8. Does this mean that the rumors that the stack would be unlimited in 7.8 are not true?

comment:4 Changed 3 years ago by joeyhess

difficulty: Easy (less than 1 hour)Unknown

comment:5 Changed 3 years ago by joeyhess

Sorry, had wrong ghc version entered before.

It might indeed save some memory to deepseq the accumulator as described in the blog post. I have not tried, since in my use case I want to avoid buffering the whole list in memory at all.

I'm currently using a getDirectoryContents' that uses unsafeInterleaveIO. To avoid exceptions being thrown after the call has succeeded, when the return list is traversed, I made it catch and discard exceptions from Posix.readDirStream etc, so in an exceptional condition the list may not contain all items in the directory.

That was ok in my use case, but I dunno if it would be acceptable for the real getDirectoryContents. It would probably be fine to just fix it to not blow the stack, and perhaps add a note to its documentation that the list of directory contents is not streamed lazily. (Although note that eg, removeDirectoryRecursive uses getDirectoryContents and so can also unexpectedly use large amounts of memory..)

I do wonder if conduit has a better way to handle this.

comment:6 Changed 3 years ago by joeyhess

One other idea:

openDirectory :: FilePath -> IO DirHandle

readDirectory :: DirHandle -> IO (Maybe FilePath)
-- closes DirHandle automatically at end

If directory included that interface, then code that needs to stream the directory could easily do so. removeDirectoryRecursive could use that instead of getDirectoryContents and avoid ever using a lot of memory. It could also probably be used to make a conduit, etc.

comment:7 Changed 3 years ago by nomeata

I was under the impression that people tend to think badly about lazy IO these days, so I’m not sure if it makes sense to make getDirectoryContents lazy. Unless of course most people expect it to be.

I think a way forward is to make getDirectoryContents non-lazy, but not blowing the stack (if that still is a problem on 7.8), and alternatively think about an interface for people who need more control over that. Maybe is already that solution.

comment:8 Changed 3 years ago by joeyhess

Data.Conduit.Filesystem uses listDirectory to generate a list, and loops over it instead. listDirectory comes from system-fileio, and on Windows just uses getDirectoryContents. On unix, it essentially re-implements getDirectoryContents (unsure why). So, it does not avoid buffering [FilePath] in memory.

But, looking at Data.Conduit.Filesystem, it could certianly be changed to use the openDirectory/readDirectory interface prototyped above and avoid that problem. Essentially, it would: liftIO (readDirectory h) >>= yield

In fact, system-fileio has internally a similar openDir/readDir/closeDir, although that interface is not exported. So, I think adding that low-level interface to directory or somewhere and using it in conduit etc would be a good plan.

comment:9 Changed 2 years ago by thoughtpolice

Component: libraries/directoryCore Libraries
Owner: set to ekmett

Moving over to new owning component 'Core Libraries'.

comment:10 Changed 2 years ago by thomie

Cc: core-libraries-committee@… added
Status: newupstream

joeyhess: you might have more luck submitting a pull request to, and poking someone from the corelibcom to look at it.

comment:11 Changed 2 years ago by snoyberg

For the record, recent versions of conduit-extra provide a proper streaming solution, e.g.:

This is based on non-conduit-specific code available in streaming-commons:

comment:12 Changed 2 years ago by snoyberg

Owner: changed from ekmett to snoyberg

comment:13 Changed 2 years ago by snoyberg

Reviewing your code, it looks incredibly similar to the code that I wrote in streaming-commons. I think it does make sense to include such a patch in directory, but that should likely be handled via a pull request there.

comment:14 Changed 2 years ago by snoyberg

Resolution: fixed
Status: upstreamclosed

I've implemented a fix for the stack overflow at:

For the streaming interface: please send a pull request with your changes. I'm going to close this ticket as resolved.

Note: See TracTickets for help on using tickets.