Opened 8 years ago

Closed 7 years ago

Last modified 7 years ago

#1218 closed proposal (wontfix)

Add sortNub and sortNubBy to Data.List

Reported by: neil Owned by:
Priority: normal Milestone: Not GHC
Component: libraries/base Version: 6.8.2
Keywords: Cc: libraries@…, gwern0@…
Operating System: Linux Architecture: Unknown/Multiple
Type of failure: Test Case:
Blocked By: Blocking:
Related Tickets: Differential Revisions:

Description

I propose the addition of sortNub and sortNubBy to Data.List

sortNub = sort . nub, but can run in O(n log n) instead of O(n2) with the implementation:

sortNub = map head . group . sort

I personally have defined this function over 25 times. Debate on the libraries list.

Attachments (1)

sortnub.patch.gz (16.7 KB) - added by neil 8 years ago.
Proposed patch

Download all attachments as: .zip

Change History (5)

Changed 8 years ago by neil

Proposed patch

comment:1 Changed 8 years ago by dons

How about rather than changing the api, purely for efficiency reasons, we just add a rewrite rule to Data.List for this optimisation?

The rule would be:

{-# RULES
"sort/nub" sort . nub = map head . group . sort
  #-}

sort . nub will be rewritten to the optimised case, and we won't need to further complicate the API.

Quite often now it is possible for the library author to expose only a small, generic API, while providing internal rules for specific optimised cases. This keeps the interface small, while still providing performance.

Data.ByteString does this in a number of places, for example:

{-# RULES
  "FPS specialise filter (== x)" forall x.
     filter (== x) = filterByte x
  #-}

I'd really not like to see more functions exposed in the list interface, only as optimisations, that could be better provided via RULES.

comment:2 Changed 7 years ago by igloo

  • Resolution set to wontfix
  • Status changed from new to closed

This proposal seems to be abandoned

comment:3 Changed 7 years ago by guest

  • Cc gwern0@… added
  • Operating System changed from Unknown to Linux
  • Version set to 6.8.2

This may be abandoned, but I'm curious. Is this actually an optimization? I've been trying out some toy examples, and the 'map head . group . sort' version seems to perform absolutely terribly compared to just 'sort . nub', at least on my sample.

Example:


gwern@craft:15969~>cat tmp2.hs [ 9:58AM]
import Data.List

main = print $ uniq $ take 1000000000 $ cycle [1]

uniq = sort . nub% gwern@craft:15970~>time runhaskell tmp2.hs [ 9:58AM]
[1]
runhaskell tmp2.hs 45.90s user 0.23s system 66% cpu 1:09.24 total
gwern@craft:15971~>cat tmp4.hs [ 9:59AM]
import Data.Set

main = print $ uniq $ take 1000000000 $ cycle [1]

uniq = toList . fromList gwern@craft:15974~>time runhaskell tmp4.hs [10:00AM]
[1]
runhaskell tmp4.hs 38.14s user 0.14s system 96% cpu 39.743 total
gwern@craft:15975~>cat tmp3.hs [10:02AM]
import Data.List

main = print $ uniq $ take 1000000000 $ cycle [1]

uniq = map head . group . sort gwern@craft:15976~>


You'll understand if I don't provide timings for tmp3.hs (the map head implementation) since it takes vastly longer than 30 seconds and eats so much RAM and CPU time that it locks my system overnight.

Admittedly, perhaps a list of repeated items is not the best usecase, but shouldn't optimizations - particularly optimizations in the base library be a win in all cases? (and particularly not huge pessimizations in some cases)?

--
gwern

comment:4 Changed 7 years ago by simonmar

  • Architecture changed from Unknown to Unknown/Multiple
Note: See TracTickets for help on using tickets.