Opened 12 months ago

Closed 5 months ago

#9266 closed bug (fixed)

getDirectoryContents blow its stack in a huge directory

Reported by: joeyhess Owned by: snoyberg
Priority: normal Milestone:
Component: Core Libraries Version: 7.6.2
Keywords: Cc: core-libraries-committee@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

Once a directory has around 2 million files in it, a lack of an accumulator in getDirectoryContents (unix version only; windows already has an acc) causes it to blow the stack:

joey@darkstar:‾/src/git-annex>cat test.hs
import System.Directory
main = do
	l <- getDirectoryContents "/tmp/big"
	print (null l)
joey@darkstar:‾/src/git-annex>ghc --make -O2 test
[1 of 1] Compiling Main             ( test.hs, test.o )
Linking test ...
joey@darkstar:‾/src/git-annex>./test
Stack space overflow: current size 8388608 bytes.
Use `+RTS -Ksize -RTS' to increase it.

I suggest the attached patch.

Attachments (2)

0001-use-an-accumulator-in-getDirectoryContents.patch (1.0 KB) - added by joeyhess 12 months ago.
0002-add-portable-directory-reading-interface.patch (5.8 KB) - added by joeyhess 12 months ago.

Download all attachments as: .zip

Change History (16)

comment:1 Changed 12 months ago by joeyhess

While my patch avoids the stack overflow, getDirectoryContents still seems to be using more memory than I would expect if it's lazily generating the list of files. The test program uses around 600 mb.

I think it also needs to use unsafeInterleaveIO, but have not been able to get that to work.

comment:2 Changed 12 months ago by joeyhess

I've gotten it to work with unsafeInterleaveIO. I can provide a patch, but I don't think you'l find this approach acceptable. The problem is that any errors would be deferred until the list is consumed, ie the readFile problem all over again.

With that said, it does work; the test program now uses only 1776kb, and a variant that gets the length of the list behaves similarly well.

Last edited 12 months ago by joeyhess (previous) (diff)

comment:3 Changed 12 months ago by nomeata

This seems to be an instance of the problem described at http://www.joachim-breitner.de/blog/archives/620-Constructing-a-list-in-a-Monad.html (for which I have not found a suitable solution yet).

Have you tried passing an a dlist in the accumulator? OTOH, it shouldn’t perform better than your first patch, just result in a different order...

You say you use 7.8. Does this mean that the rumors that the stack would be unlimited in 7.8 are not true?

comment:4 Changed 12 months ago by joeyhess

  • difficulty changed from Easy (less than 1 hour) to Unknown
  • Version changed from 7.8.2 to 7.6.2

comment:5 Changed 12 months ago by joeyhess

Sorry, had wrong ghc version entered before.

It might indeed save some memory to deepseq the accumulator as described in the blog post. I have not tried, since in my use case I want to avoid buffering the whole list in memory at all.

I'm currently using a getDirectoryContents' that uses unsafeInterleaveIO. To avoid exceptions being thrown after the call has succeeded, when the return list is traversed, I made it catch and discard exceptions from Posix.readDirStream etc, so in an exceptional condition the list may not contain all items in the directory.

That was ok in my use case, but I dunno if it would be acceptable for the real getDirectoryContents. It would probably be fine to just fix it to not blow the stack, and perhaps add a note to its documentation that the list of directory contents is not streamed lazily. (Although note that eg, removeDirectoryRecursive uses getDirectoryContents and so can also unexpectedly use large amounts of memory..)

I do wonder if conduit has a better way to handle this.

comment:6 Changed 12 months ago by joeyhess

One other idea:

openDirectory :: FilePath -> IO DirHandle

readDirectory :: DirHandle -> IO (Maybe FilePath)
-- closes DirHandle automatically at end

If directory included that interface, then code that needs to stream the directory could easily do so.
removeDirectoryRecursive could use that instead of getDirectoryContents and avoid ever using a lot of memory. It could also probably be used to make a conduit, etc.

comment:7 Changed 12 months ago by nomeata

I was under the impression that people tend to think badly about lazy IO these days, so I’m not sure if it makes sense to make getDirectoryContents lazy. Unless of course most people expect it to be.

I think a way forward is to make getDirectoryContents non-lazy, but not blowing the stack (if that still is a problem on 7.8), and alternatively think about an interface for people who need more control over that. Maybe http://hackage.haskell.org/package/filesystem-conduit-1.0.0.2/docs/Data-Conduit-Filesystem.html is already that solution.

comment:8 Changed 12 months ago by joeyhess

Data.Conduit.Filesystem uses listDirectory to generate a list, and loops over it instead. listDirectory comes from system-fileio, and on Windows just uses getDirectoryContents. On unix, it essentially re-implements getDirectoryContents (unsure why). So, it does not avoid buffering [FilePath] in memory.

But, looking at Data.Conduit.Filesystem, it could certianly be changed to use the openDirectory/readDirectory interface prototyped above and avoid that problem. Essentially, it would: liftIO (readDirectory h) >>= yield

In fact, system-fileio has internally a similar openDir/readDir/closeDir, although that interface is not exported. So, I think adding that low-level interface to directory or somewhere and using it in conduit etc would be a good plan.

comment:9 Changed 9 months ago by thoughtpolice

  • Component changed from libraries/directory to Core Libraries
  • Owner set to ekmett

Moving over to new owning component 'Core Libraries'.

comment:10 Changed 8 months ago by thomie

  • Cc core-libraries-committee@… added
  • Status changed from new to upstream

joeyhess: you might have more luck submitting a pull request to http://github.com/haskell/directory, and poking someone from the corelibcom to look at it.

comment:11 Changed 5 months ago by snoyberg

For the record, recent versions of conduit-extra provide a proper streaming solution, e.g.:

http://hackage.haskell.org/package/conduit-extra-1.1.6.2/docs/src/Data-Conduit-Filesystem.html#sourceDirectory

This is based on non-conduit-specific code available in streaming-commons:

http://hackage.haskell.org/package/streaming-commons-0.1.9.1/docs/Data-Streaming-Filesystem.html

comment:12 Changed 5 months ago by snoyberg

  • Owner changed from ekmett to snoyberg

comment:13 Changed 5 months ago by snoyberg

Reviewing your code, it looks incredibly similar to the code that I wrote in streaming-commons. I think it does make sense to include such a patch in directory, but that should likely be handled via a pull request there.

comment:14 Changed 5 months ago by snoyberg

  • Resolution set to fixed
  • Status changed from upstream to closed

I've implemented a fix for the stack overflow at:

https://github.com/haskell/directory/pull/17

For the streaming interface: please send a pull request with your changes. I'm going to close this ticket as resolved.

Note: See TracTickets for help on using tickets.