Efficient Maps using Generalised Tries
|Reported by:||AdrianHey||Owned by:|
|Cc:||ekmett@…||Difficulty:||1 person Summer|
Description (last modified by )
The Haskell standard libraries currently provide Map and Set implementations based on balanced binary trees. It is possible to improve on the performance of these marginally (by a factor of two or so at best, less typically) by using better and cheaper tree type, such as AVL trees. However, the fundamental problem with all such implementations is that the search used by common lookup and insert operations still requires O(log n) potentially expensive comparison operations (n is the number of entries in the Map). To understand how inefficient this can be, imagine what is involved in lookup on a large map where the keys are all filepaths that just happen to share exactly the same 20 character prefix.
This solution to this problem for character strings and similar list like keys has been known for a long time, that is to use some form of trie. Generalised tries extend this idea to provide effiecient Maps for arbitrary "sum of product" types (a typical algebraic data type in Haskell). The basic idea is that for sum types you maintain separate maps for each data constructor and for products you use nested maps (maps of maps of maps..).
Googling will reveal many interesting papers on this subject, but my personal impression is that with many implementations simplicity and elegance has been given priority over performance and API completeness. In reality producing a useable and efficient generalised trie implementation is not at all simple. The sum of product rules given above will yield a rather inefficient implementation for lists, for example.
Implementations should also bear in mind (and ideally exploit) the fact that with nested maps of maps.. singleton maps will be very common.
There is also an important invariant that must be maintained with nested maps of maps.. This is that empty "inner" maps are illegal. If some operation yields an empty "inner" map the corresponding entry in the "outer" map should be deleted (which may cause the outer map to also become empty).
Immediate Project Goals
This is what I think could reasonably be achieved in a summer of code project.
1- Define the generalised trie class API (or APIs)
The class definition should make appropriate use of the new type families extension provided by ghc. The class API needs to be complete, in the sense that it must contain all the methods needed to create new instances from existing instances (not just those that a typical Map user would need). Writing a few common instances by hand will help in discover exactly what a complete API should look like.
It may be that several closely related classes are needed. For example, tries generally don't store copies of their keys. This has some efficiency advantages but also means that "withKeys" operations can be expensive as the actual keys must be reconstructed from the trie. It may be advantageous to have a separate class for tries which do store copies of the keys, and this class is where various "withKeys" methods would live.
There is also a potential issue with having an Ord class constraint on keys. This has definite performance advantages for some operations and also enables fast trie based sort implementations. But some people feel that Ord instances should be meaningful in some deeper philosophical sense, not just the arbitrary total ordering of concrete data types that you would get by deriving Ord. There is also a practical problem in that despite the Ord constraint, a trie implementation will not actually use any Ord class methods. It's ordering will be consistent with that that would have been defined by a wholly derived Ord instance, but how do we ensure ordering consistency with hand written Ord instances?
One possible solution might be for the generalised trie class to define it's own private total ordering of keys (and corresponding compare method) which may or may not agree with the ordering defined by the Ord instance (if it exists).
2- Produce a few hand written instances of the class API(s)
These should include common standard types like Ints, Lists, Strings and Byte Strings. Also a common default implementation of the same APIs for any Ord instance is needed (based on AVL trees for example).
The purpose of this exercise is to help ensure that the defined class APIs are reasonably complete and usable (in the sense described above) and also to provide the community with something that is actually useable in a reasonable time scale.
This isn't the perfect solution, but even providing a uniform Map API with efficient specialised (type specific) implementations for this limited range of types will still be a big improvement on the current situation.
3- Write a test and benchmarking suite for any generalised trie class instance
It is obviouly desirable that as far as is possible we ensure that defined class instances are correct. Being able to assess relative efficiencies of multiple possible candidate implementations would be good too.
Future Project Goals
In the long run what we need is some way to automatically derive generalised trie instances (from data type definitions). I'm reluctant to make this part of this years SOC project definition as I think this is not realistically achievable in the time available (given the likely size and complexity of the generalised trie API). It might be realistic in future years assuming all the necessary preliminaries described above are in place.
But I'd be happy to be proven wrong about this too! So if anyone wants to give it a shot that would be very welcome.
Starting Point (where we're at now - 2008)
In this darcs repository..
..you can find my own initial attempts at this. This includes a preliminary generalised trie class API (called GT) using functional dependencies, a few hand written instances for common prelude types and a default instance for any Ord instance (which uses AVL trees).
This may be modified, finished (or just ignored) at your discretion :-)
I would encourage interested parties to at least look at the implementation for Lists (called ListGT) and understand how and why it differs from that that would be obtained from the simple sum of product rules for generalised tries. Maybe the same idea could be used for arbitrary product types. By that I mean treat an arbitrary product type as if it were an heterogeneous list of known length (using tuples perhaps).
Starting Point (where we're at now - 2009)
There seems to have been some interest in this project from students for 2009. Anyone wishing to submit a proposal for this should be aware that Jamie Brandon did a lot of work on this for 2008. But it's still not finished, so there is room for 2009 applications and further work.
Currently we have preliminary class definitions, a few hand written instances and a test suite as a package on hackage (gmap). This makes use of functional dependencies.
The darcs repository for the library is in the process of being converted to use type families (instead of functional dependencies), but can't be compiled at the moment (apparently there's a problem with ghc type families implementation).
Jamie Brandon listed what he sees as outstanding work in this email:
One thing that seems to be missing from this post is that we need a way of deriving instances for arbitrary key types. The class API is somewhat huge and writing instances by hand is not really a practical proposition for most users.
- Don Stewart
- Edward Kmett <ekmett@…>
- Adrian Hey <ahey@…>
- Jamie Brandon <jamiiecb@…>
- Brendan Hickey <brendan.m.hickey+soc@…>
- Hrushikesh Tilak <hrushikesh.tilak+haskell@…>
- Louis Wasserman <wasserman.louis@…>