Opened 11 years ago

Closed 9 years ago

Last modified 9 years ago

#1218 closed proposal (wontfix)

Add sortNub and sortNubBy to Data.List

Reported by: neil Owned by:
Priority: normal Milestone: Not GHC
Component: libraries/base Version: 6.8.2
Keywords: Cc: libraries@…, gwern0@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: None/Unknown Test Case:
Blocked By: Blocking:
Related Tickets: Differential Rev(s):
Wiki Page:


I propose the addition of sortNub and sortNubBy to Data.List

sortNub = sort . nub, but can run in O(n log n) instead of O(n2) with the implementation:

sortNub = map head . group . sort

I personally have defined this function over 25 times. Debate on the libraries list.

Attachments (1)

sortnub.patch.gz (16.7 KB) - added by neil 11 years ago.
Proposed patch

Download all attachments as: .zip

Change History (5)

Changed 11 years ago by neil

Attachment: sortnub.patch.gz added

Proposed patch

comment:1 Changed 11 years ago by dons

How about rather than changing the api, purely for efficiency reasons, we just add a rewrite rule to Data.List for this optimisation?

The rule would be:

"sort/nub" sort . nub = map head . group . sort

sort . nub will be rewritten to the optimised case, and we won't need to further complicate the API.

Quite often now it is possible for the library author to expose only a small, generic API, while providing internal rules for specific optimised cases. This keeps the interface small, while still providing performance.

Data.ByteString does this in a number of places, for example:

  "FPS specialise filter (== x)" forall x.
     filter (== x) = filterByte x

I'd really not like to see more functions exposed in the list interface, only as optimisations, that could be better provided via RULES.

comment:2 Changed 9 years ago by igloo

Resolution: wontfix
Status: newclosed

This proposal seems to be abandoned

comment:3 Changed 9 years ago by guest

Cc: gwern0@… added
Operating System: UnknownLinux
Version: 6.8.2

This may be abandoned, but I'm curious. Is this actually an optimization? I've been trying out some toy examples, and the 'map head . group . sort' version seems to perform absolutely terribly compared to just 'sort . nub', at least on my sample.


gwern@craft:15969~>cat tmp2.hs [ 9:58AM] import Data.List

main = print $ uniq $ take 1000000000 $ cycle [1]

uniq = sort . nub% gwern@craft:15970~>time runhaskell tmp2.hs [ 9:58AM] [1] runhaskell tmp2.hs 45.90s user 0.23s system 66% cpu 1:09.24 total gwern@craft:15971~>cat tmp4.hs [ 9:59AM] import Data.Set

main = print $ uniq $ take 1000000000 $ cycle [1]

uniq = toList . fromList gwern@craft:15974~>time runhaskell tmp4.hs [10:00AM] [1] runhaskell tmp4.hs 38.14s user 0.14s system 96% cpu 39.743 total gwern@craft:15975~>cat tmp3.hs [10:02AM] import Data.List

main = print $ uniq $ take 1000000000 $ cycle [1]

uniq = map head . group . sort gwern@craft:15976~>

You'll understand if I don't provide timings for tmp3.hs (the map head implementation) since it takes vastly longer than 30 seconds and eats so much RAM and CPU time that it locks my system overnight.

Admittedly, perhaps a list of repeated items is not the best usecase, but shouldn't optimizations - particularly optimizations in the base library be a win in all cases? (and particularly not huge pessimizations in some cases)?

-- gwern

comment:4 Changed 9 years ago by simonmar

Architecture: UnknownUnknown/Multiple
Note: See TracTickets for help on using tickets.