Some background: The setups that triggered me into working on the patchset didn't really have a pgbench like workload, the individual queries were/are more complicated even though it's still an high throughput OLTP workload. And the contention was *much* higher than what I can reproduce with pgbench -S, there was often nearly all time spent in the lwlock's spinlock, and it was primarily the buffer mapping lwlocks, being locked in shared mode. The difference is that instead of locking very few buffers per query like pgbench does, they touched much more.
Perhaps I should try to argue for this extension to pgbench again:
I think it would go a good job of exercising what you want, provided you set the scale so that all data fit in RAM but not in shared_buffers.
Or maybe you want it to fit in shared_buffers, since the buffer mapping lock was contended in shared mode--that suggests the problem is finding the buffer that already has the page, not making a buffer to have the page.