Implicit locking

In my quest for ever more efficient locking strategies, implicit locking is one of my favourites along with the lock free pub/sub queue. It's for use with data that is frequently read but rarely updated as it incurs zero overhead while reading at the cost of a much greater overhead when locking for modification.

A synthetic test on a 32-way Power7 box using a large number of reading threads and few writers running in loops showed around a 2000% performance improvement (yup, that's the correct number of zeros!) over pthread rwlocks.

Prerequisites for this technique are:

  1. hardware support for storage keys

  2. data is locked with page granularity

The only systems I'm certain support storage key semantics for this are:

Power 6, 6+ and 7 processors.

Systems on which it may be possible but would need investigating are:

IBM mainframes

These have supported storage keys for years but I'm unaware of the semantics

PS3/Cell based systems

The Cell PPE is Power based so may include this capability. This is speculation.

For an efficient use of this technique it is necessary to have efficient toggling of key permissions on pages, meaning direct access page table permission words. Unfortunately on AIX this must be done via the ukey_protect system call; on Linux I'm not aware of any storage key support at all.

Even without efficient toggling my synthetic testing showed a performance improvement because the overhead is only incurred on the write path. Improving the efficiency of the toggle mechanic would simply make this technique effective in circumstances with a less extreme reading bias.

It is also worth noting that while arbitrary amounts of data can be protected under one of these locks and writing is atomic, reading anything other than an aligned sig_atomic_t value from this locked data my be preempted by a write. This can be dealt with easily via various means but a developer must be aware of the necessity. As such it's likely safer to restrict the use of implicit locking to atomic data.

Hard performance figures and source code are currently unavailable as this basic testing was performed while at IBM on IBM systems and, lacking access to an AIX Power7 box, I'm currently unable to recreate the test in circumstances where I can publish the results. Should anyone have such a machine I'd be interested in access!