This new patch is simpler than the previous one, and more effective at speeding up replication. I assume it would speed up pgbench with synchronous_commit turned off (or against unlogged tables) as well, but I don't think I have the hardware needed to test that.
If I use the 'tpcb-func' script embodied in the attached patch to pgbench, then I can see the performance difference against unlogged tables using 8 clients on a 8 CPU virtual machine. The normal tpcb-like script has too much communication overhead, bouncing from pgbench to the postgres backend 7 times per transaction, to see the difference. I also had to make autovacuum_vacuum_cost_delay=0, otherwise auto analyze holds a snapshot long enough to bloat the HOT chains which injects a great deal of variability into the timings.
Commit 7975c5e0a992ae9 in the 9.6 branch causes a regression of about 10%, and the my patch from the previous email redeems that regression. It also gives the same improvement against 10dev HEAD.