NUMA supporting features for GHC

Sajith Sasidharan

Short description: The broad goal is to improve GHC's suport for NUMA systems.

Additional info: http://hackage.haskell.org/trac/summer-of-code/...

My name is Sajith Sasidharan.  I study at School of Informatics and Computer Science at Indiana University Bloomington (Computer Science, PhD, into third semester as we speak).  I'm broadly interested in the implementation of programming languages, parallelism, and systems.  I would like to propose certain NUMA-enhancement features for Glasgow Haskell Compiler as a Google Summer of Code project.

(And before anyone asks, yes, I'm too lazy to look for a real summer internship and too excited about the Summer of Code!)

Questions and Answers

What is the goal of the project you propose to do?

The proposal is outlined in this ticket in haskell.org's Summer of Code trac.  The ticket proposes support for these specific areas (quoting verbatim):

  • Process pinning: when launching processes, we might want to specify a list of CPUs rather than the number of CPUs, so that we can pin a process to a particular NUMA node. Say, a -N [0,1,3] flag rather than -N 3 flag.
  • Memory locality: From a recent discussion at parallel-haskell mailing list, we learned that there is a clear path to improving the storage manager's NUMA support. The hypothesis is that allocating node-local nurseries per capability will improve performance over the bump-pointer approach in allocate. We might use the NUMA-aware allocation primitives from the  Portable Hardware Locality (hwloc) library for this.
  • Logging and tracing: Add NUMA-specific event logging and profiling information to support performance analysis and debugging of user-level NUMA-aware programs.

(Another additional item in the ticket is support for thread-pinning; but Simon Marlow says that this is already possible.)


In what ways will this project benefit the wider Haskell community?

Haskell indeed has excellent support for concurrency and parallelism, and this project's aim is to push further towards extracting that last ounce of performance from NUMA systems, by firing all CPUs if/when necessary and by ensuring a suitably NUMA-aware memory allocation behavior.  This could be especially useful to scientific workloads that needs greater amounts of task and data parallelism, for example.


Can you give some more detailed design of what precisely you intend to achieve?

Bulk of the hacking is going to be on and around GHC's runtime system (RTS).

Process pinning would require changes to capabilities and scheduler.  On init, a probe to determine NUMA-ness of the machine would be appropriate, before even trying to set up process/memory affinity. Also important is that these changes also should not affect non-NUMA systems in any adverse manner.

How would we determine the NUMA-ness and topology of the host machine?  The above-mentioned hwloc library could help, not just in determining the above parameters, but also in binding threads and memory to specific nodes.  Hwloc website has some examples

Second bullet point above -- improved memory locality -- involves hacking the block allocator and free list manager, implemented in rts/sm/BlockAlloc.c.  Instead of the global pool of blocks, we'd need allocations from per-capability nurseries, while being careful that deallocations are also appropriately managed. This discussion in parallel-haskell mailing list has greater details.

Related to the above, while answering off-list from glasgow-haskell-users mailing list, Duncan Coutts suggested that huge pages could be useful, perhaps in using them for each capability's first generation; this is something I would like to investigate.

The third bullet point will help in actually verifying the assumptions.  Note that the proposal is not to build a new logging/monitoring framework, but to extend the existing ghc-events package (and perhaps ThreadScope).  We might need to emit new events as well.  One of the ways to get to NUMA specific events would be by leveraging GHC's PAPI support -- we can, at least in theory, get PAPI to measure DRAM requests to specific NUMA domains.


What deliverables do you think are reasonable targets? Can you outline an approximate schedule of milestones?

Relevant patches against GHC and benchmark results to verify that these optimizations actually work would be reasonable targets.

I expect to spend some time familiarizing myself with the details of RTS implementation and the development process, preferably before the actual start date of the program.  (I have the upcoming semester-end craziness and some travel planned in the interim period; so I'm not sure how much of this is realizable.)

Week 1-3 will be spent working on process pinning.  Week 4-8 will be spent working on memory locality.  Support for logging and tracing will not be a separate activity; this needs to happen in parallel with the first two.  I expect that the final month will be spent on debugging and performance tuning.  Obviously, we also would need benchmarks to verify results and untangle performance problems.  I plan to borrow from the existing tests in monad-par (and perhaps others), and extend them if necessary.

Real world experience suggests that software projects rarely ever go according to the original schedule.  I hope that the one outlined here is reasonably realistic, while leaving some margin for slippage.


What relevant experience do you have? e.g. Have you coded anything in Haskell? Have you contributed to any other open source software? Been studying advanced courses in a related topic? 

I have been looking at Haskell's support for parallelism and concurrency of late; task-parallelism as supported by Strategies and Monad-par, arrays (such as Repa, DPH, and Accelerate), and other things like CloudHaskell and Hdph.  I have had a chance to familiarize with relevant literature and implementations by the way of course work.  I am also fairly new to Haskell.  I have used C for quite a while, which is perhaps more important in hacking the RTS itself.

My contribution to open source is minimal.  (Wrote a JFS port for GNU Hurd back when I was an undergraduate student; the few years spent in the industry after that and before coming to grad school didn't help further in that regard.)

In Fall 2011 I took courses on Domain Specific Languages for Parallel Performance and Parallel Architectures and Programming; and in Spring 2011 an Advanced Operating Systems course -- I imagine all of them could be useful.


In what ways do you envisage interacting with the wider Haskell community during your project? e.g. How would you seek help on something your mentor wasn't able to deal with? How will you get others interested in what you are doing?

I would use the IRC channel (#haskell and #haskell-soc on freenode) and the relevant mailing lists (glasgow-haskell-users, haskell-parallel).  I also plan to maintain a weblog that is updated on a regular (weekly perhaps) basis.  The development will happen in the open, possibly in a github fork of ghc repository.


Why do you think you would be the best person to tackle this project?

The wishlist above is result of some recent work on a scheduler for par-monad.  There's a research agenda to this, although this proposal is specifically about improvement of infrastructure.  As someone studying parallelism, languages and runtime systems, this suits my medium-term learning goals.  

I have access to a respectably "muscular" NUMA system (4 nodes, 32 cores "Intel(R) Xeon(R) CPU E7- 4830" machine with 64 GiB memory) to develop and test.  More importantly, there are some excellent people around to discuss ideas with, besides the excellent people from Haskell community.

Am I the "best" person to tackle this project?  That would of course be a tall claim!  The best person(s) for this project would be the longtime GHC hackers.  However, this wishlist is something I would like to see realized, and have the willingness to spend time and effort on.

Thank you for considering my application, and I look forward to having fun with this project over the summer.