Monday, January 3, 2011

Persistent Vectors and Hash Maps for Haskell

One of the prominent features of Clojure are a core set of immutable data structures with efficient manipulation operations. Two of the most innovative and important are the persistent vector and persistent hash map.

As a little project I set myself in order to get to know Haskell better, I have been porting these structures to Haskell. I think it's now at a state where the basics are there and usable, so I've put it up on my Github. The API provides Data.PVector (the persistent vector) and Data.PHashMap (the persistent hash map). The interface for both has been kept as consistent as possible with Data.Map.

Basic usage

These two data structures are:

  • Immutable. Unlike Data.Array.ST and Data.HashTable, there are no monads in sight.
  • Persistent. They provide "update" operations which do not destroy the original structure.
  • Efficient. Unlike Data.Array, updating a PVector or PHashMap doesn't copy the entire structure, but only the changed path.

Here's a demo of what you can do with a PVector:

ghci> :m + Data.PVector
ghci> empty  -- the empty pvector
fromList []

ghci> append 1 it
fromList [1]

ghci> append 42 it
fromList [1,42]

ghci> append 13 it
fromList [1,42,13]

ghci> let a = it
ghci> a ! 0  -- indexes are from 0 to n-1
1

ghci> a ! 1
42

ghci> a ! 2
13

ghci> set 1 71 a  -- a new PVector with the element replaced
fromList [1,71,13]

ghci> adjust succ 2 a  -- apply a function to a single element
fromList [1,42,14]

ghci> Data.PVector.map succ a  -- apply a function to all elements
fromList [2,43,14]

ghci> fromList [1..10]  -- convert a list to a PVector
fromList [1,2,3,4,5,6,7,8,9,10]

ghci> elems it  -- convert a PVector to a list
[1,2,3,4,5,6,7,8,9,10]

And here's a demo of the basic functionality of PHashMap:

ghci> :m + Data.PHashMap
ghci> empty Data.HashTable.hashString
        -- an empty PHashMap (requires a key hash function)
fromList hashFn []

ghci> insert "foo" 1 it
fromList hashFn [("foo",1)]

ghci> insert "bar" 42 it
fromList hashFn [("foo",1),("bar",42)]

ghci> insert "qux" 123 it
fromList hashFn [("qux",12),("foo",1),("bar",42)]

ghci> insert "qux" 13 it  -- inserting an existing key overwrites by
default
fromList hashFn [("qux",13),("foo",1),("bar",42)]

ghci> let a = it
ghci> a ! "foo"
1

ghci> a ! "baz"  -- using (!) is unsafe
*** Exception: array index out of range: element not in the map

ghci> Data.PHashMap.lookup "bar" a
Just 42

ghci> Data.PHashMap.lookup "baz" a  -- 'lookup' returns a safe Maybe
Nothing

ghci> adjust succ "foo" a  -- apply a function to a value
fromList hashFn [("qux",13),("foo",2),("bar",42)]

ghci> Data.PHashMap.map succ a  -- apply a function to all values
fromList hashFn [("qux",14),("foo",2),("bar",43)]

ghci> keys a
["qux","foo","bar"]

ghci> elems a
[13,1,42]

ghci> fromList Data.HashTable.hashString [("a", 1), ("b", 2), ("c", 3)]
fromList hashFn [("b",2),("c",3),("a",1)]

ghci> toList it
[("b",2),("c",3),("a",1)]

Installation

To try it yourself, just clone it from my Github repo, then configure, build and install it:

$ runhaskell Setup.hs configure --user
$ runhaskell Setup.hs build
$ runhaskell Setup.hs install

Performance

The single-element operations for each of these structures technically run in logarithmic time. However, they implemented as 32-ary trees which never exceed a depth of 7 nodes, so you can treat them as constant-time operations (for relatively large constants).

How it works

I wrote this code after reading the following explanatory blog posts on how the structures work in Clojure. I haven't veered too far from Clojure's implementation, so these posts should also provide a decent birds-eye overview of my Haskell implementation.

To do

Things that I haven't gotten round to but would be good to have are:
  • Match Data.Map in completeness
  • Performance tuning
    • More strictness
    • A more efficient fromList (it currently constructs lots of intermediary structures
  • Make a PVector-based implementation of IArray (?)
  • Unit tests
  • Benchmarks

This is my first proper project in Haskell, so I'm sure there are many problems with it. If you have any suggestions, I'd love to hear your comments. And Github pull requests would be very welcome!

11 comments:

  1. You should upload it to Hackage, even if you consider it incomplete.

    Also, do you have benchmarks against the already existing structures, like Data.Map and Data.IntMap? Benchmarks against clojure would also be nice.

    ReplyDelete
  2. That's very cool. However, I don't see why you are calling them persistent vectors or persistent hash maps - in Haskell people expect persistence by default, so you don't need to call it out. The default Haskell arrays aren't very good, so your work is very welcome!

    And do upload them to Hackage, it would be great.

    ReplyDelete
  3. Nice to see someone trying to port Clojure's nice data structures. Edward Z. Yang have tried a port as well so you might want to chat with him:

    http://blog.ezyang.com/2010/03/the-case-of-the-hash-array-mapped-trie/

    I wrote a little Criterion benchmark for your code but it crashed with the following error:

    Enum.pred{Int32}: tried to take `pred' of minBound

    Here's the code:


    {-# LANGUAGE GADTs #-}

    module Main where

    import Control.DeepSeq
    import Control.Exception (evaluate)
    import Control.Monad.Trans (liftIO)
    import Criterion.Config
    import Criterion.Main
    import Data.Hashable (Hashable(hash))
    import Data.Int (Int32)
    import qualified Data.ByteString as BS
    import qualified Data.ByteString.Char8 as C
    import qualified Data.PHashMap as PM
    import Data.List (foldl')
    import Data.Maybe (fromMaybe)
    import Prelude hiding (lookup)
    import System.Random (mkStdGen, randomRs)

    instance NFData BS.ByteString

    instance (Eq k, NFData k, NFData v) => NFData (PM.PHashMap k v) where
    rnf (PM.PHM f r) = f `seq` rnf r

    instance (Eq k, NFData k, NFData v) => NFData (PM.Node k v) where
    rnf PM.EmptyNode = ()
    rnf (PM.LeafNode h k v) = rnf h `seq` rnf k `seq` rnf v
    rnf (PM.HashCollisionNode h xs) = rnf h `seq` rnf xs
    rnf (PM.BitmapIndexedNode bm arr) = rnf bm `seq` rnf arr
    rnf (PM.ArrayNode n arr) = rnf n `seq` rnf arr

    data B where
    B :: NFData a => a -> B

    instance NFData B where
    rnf (B b) = rnf b

    hashBS :: BS.ByteString -> Int32
    hashBS = fromIntegral . hash

    main :: IO ()
    main = do
    let hmbs = PM.fromList hashBS elemsBS :: PM.PHashMap BS.ByteString Int
    defaultMainWith defaultConfig
    (liftIO . evaluate $ rnf [B hmbs])
    [ bench "lookup" $ nf (lookup keysBS) hmbs
    , bench "insert" $ nf (insert elemsBS) (PM.empty hashBS)
    , bench "delete" $ nf (delete keysBS) hmbs
    ]
    where
    n :: Int
    n = 2^(12 :: Int)

    elemsBS = zip keysBS [1..n]
    keysBS = rnd 8 n

    lookup :: Eq k => [k] -> PM.PHashMap k Int -> Int
    lookup xs m = foldl' (\z k -> fromMaybe z (PM.lookup k m)) 0 xs

    insert :: Eq k => [(k, Int)] -> PM.PHashMap k Int -> PM.PHashMap k Int
    insert xs m0 = foldl' (\m (k, v) -> PM.insert k v m) m0 xs

    delete :: Eq k => [k] -> PM.PHashMap k Int -> PM.PHashMap k Int
    delete xs m0 = foldl' (\m k -> PM.delete k m) m0 xs

    -- | Generate a number of fixed length strings where the content of
    -- the strings are letters in random order.
    rnd :: Int -- ^ Length of each string
    -> Int -- ^ Number of strings
    -> [BS.ByteString]
    rnd strlen num = map C.pack $ take num $ split $ randomRs ('a', 'z') $ mkStdGen 1234
    where
    split cs = case splitAt strlen cs of (str, cs') -> str : split cs'

    ReplyDelete
  4. This comment has been removed by the author.

    ReplyDelete
  5. Thanks for the encouraging comments, everyone!

    Neil, that's a good point; I'll need to look for a better name (keeping in mind that Data.Vector is taken).

    Johan, thanks for your efforts! I've now fixed two bugs which were exposed by that test and running it gives: lookup: 12ms, insert: 29ms, delete: 22ms. I don't have time right now to compare this anything else, but a quick test I did before with the knucleotides benchmark showed that this about an order of magnitude slower than Data.Map :(

    I suspect it should be fixable with a few strictness annotations.

    PS: How do you access those non-exported symbols in the benchmark (I had to temporarily export everything in my code to make it work)?

    ReplyDelete
  6. I think you want the term "pure functional" rather than "persistent". The latter usually means they can be saved to disk and restored.

    ReplyDelete
  7. Warren, "persistent" is indeed an overloaded word but it's correct in this context.

    ReplyDelete
  8. Kevin, for this quick benchmark I just changed the export list of your library. If I wanted to have more permanent access I would put internal details in e.g. Data.PHashMap.Internal and not export that module in the .cabal file. I would then compile the benchmark manually (or using a Makefile) and get access to the internals that way.

    ReplyDelete
  9. This comment has been removed by the author.

    ReplyDelete
  10. Appreciate someone can give some solution to me. I have downloaded the zip file - exclipy-pdata-51eb586 to try the PVector adding element function. I'm using WinHugs for executing haskell code. The error that I was facing is

    ERROR file:{Hugs}\packages\base\Data\PVector.hs:46 - Type error in application
    *** Expression : PV 0 shiftStep (BodyNode (array (0,-1) [])) (array (0,-1) [])
    *** Term : shiftStep
    *** Type : Integer
    *** Does not match : Int

    I'm newbie in haskell, hope some one can help me.

    ReplyDelete
  11. Hi aLvin,

    Since posting the original blog post, I have discontinued the vector part of the library so I can focus on the hashmap part (http://hackage.haskell.org/package/vector provides a better vector implementation).

    Also, I haven't had the chance to make it work in Hugs. I suggest that you give GHC a go.

    ReplyDelete