Implement Concurrent Hash-table / Hashmap

Loren Davis

Short description: Concurrent data structures for Haskell are currently a work in progress, and are necessary for parallel and high-performance computing. A few data structures, such as wait-free lists, already have Haskell implementations. One that does not yet is a thread-safe hash table. I propose to implement one as a library available under the new BSD license.

Introduction

 

There are several implementations of hash tables in Haskell, including the standard Data.HashTable, and a number of optimized and mutable implementations in the hashtables package.  There is, however, a need for at least one mutable, concurrent hashtable implementation, and possibly more if different use cases require different trade-offs.


This is an active area of research, and several different cutting-edge algorithms for concurrent hash tables have been published in recent years, including Click's highly-scalable hashtables, Shalev and Shavit’s split-ordered lists and Gao, Groote and Hesselink’s almost-wait-free resizable hashtables.  Discussion with the community shows that several approaches have at least one advocate.  As the lock-striping algorithm proposed for this would not scale well with the number of cores, there could also be room for a lock-free implementation.  This would be useful for high-performance computing, and perhaps on future hardware that we would call high-performance today, as well as of some research interest, and could perhaps make a good follow-up project.

 

 

About the Author

I (Loren Davis) am a graduate student at Oregon State University, in the AI, Machine Learning and Algorithms research cluster (http://eecs.oregonstate.edu/research/research-areas/ai-machine-learning-algorithms).  I learned Haskell while doing postbac work in Portland, OR and have used it for a number of projects since then, most recently:policy iteration, simulation and reinforcement learning for Markov Decision Processe models in Alan Fern’s intelligent agents class (CS 533, Winter ’12) and all my assignments in Martin Erwig’s programming languages class (CS581, Winter ’12).  Another example was reduction-to-SAT and SAT-solving in a course by Jim Hook and Tim Sheard at Portland State University.

 

My postgrad coursework also includes parallel programming (CS 515, Spring ’11, Portland State University), although the implementation language for these course projects was C rather than Haskell.

 

I successfully completed an earlier project for Google Summer of Code 2008, a cross-platform implementation of Sun's Doors library, originally for Solaris.  My mentor was Greg Kroah-Hartman, and I also would like to acknowledge the advice and support of professor Bart Massey of Portland State University.

 

I can provide my source code for each of those projects, and several others, if you would be interested.

 

Scope of the Project

The project should consist of a library, released under a FLOSS license such as new BSD, and compatible with the existing hashtables implementations.  A possible naming scheme might be Data.HashTables.IO.Concurrent (if we use lock-striping), Data.HashTables.IO.Lockfree (if we use Gao’s algorithm) or Data.HashTables.IO.Waitfree (if we use split-ordered lists).  (Thanks to Gregory Collins for pointing out that locking and CAS require IO, rather than ST.)  Note that multiple implementations can coexist.  The interface should be as compatible with existing libraries as possible, to ease parallelization (although some implementations might encourage the programmer to refactor code to reduce overhead).  I should also release the test cases and framework.  This release might be a separate package on Hackage, or might be integrated into hashtables if the maintainer prefers.  Ideally, it would be possible to simply change the line import Data.HashTables.ST.Basic to use the new library and make existing code thread-safe.

 

While I have learned to code in a literate style, there should be at least some separate documentation for programmers and maintainers.

 

Proposed Timeline

  • Community Bonding Period: Make a final decision which algorithm to implement (currently leaning toward lock-striping, as that seems to be what the community is looking for).  Hack around with mutable hashtables and parallel code in Haskell.
  • May 21–28: Finalize interface; create all exported functions as stubs
  • May 29–July 13: Implement alpha version of library
  • Note: The last week of my spring term is June 11–15
  • Midterm: Review progress, set goals for final deliverable
  • July 13-August 6: Finish test framework and debug
  • August 7–13: Profile beta version and optimize
  • August 13–20: Package and integrate.  Finish documentation.

These milestones are somewhat flexible, but I found that having the alpha version done by the midterm and reserving the second half of the schedule to debug and polish worked very successfully for me last time.

 

My school's schedule does not synchronize perfectly with the program: in particular, there is about a three-week overlap between the end of my spring term and the start of the coding period.  I am not proposing to delay the start of coding by that long, but I cannot guarantee availability full-time until my classes end.  The week before finals should not be very busy, and finals week itself has been anything from very light to very intense for me in the past; last term, one of my professors allowed me to opt out of the final.  The schedule does provide for an assessment of progress at the midterm and allow for some adjustment.