This is your problem. There is only one row in the pgbench_branch table, and every transaction has to update that one row. This is inherently a seriaized event.
Indeed it was!
One solution is to just use a large scale on the benchmark so that they upate random pgbench_branch rows, rather than all updating the same row:
pgbench -i -s50
With a scale of 1000, everything except the END took roughly the latency time. Interestingly, the END still seems to take more, when threads/clients are really ramped up (100 vs 8). Why would that be?
Could it be the fsync time? Presumably your server has a BBU on it, but perhaps the transaction rate at the high scale factor is high enough to overwhelm it. Are you sure you don't see the same timing when you run pgbench locally?
Alternatively, you could write a custom file so that all 7 commands are sent down in one packet.
How would you restructure the sql so as the make that happen?
Just make a file with all the commands in it, and then remove all the newlines from the non-backslash commands so that they are all on the same line (separated by semicolons). Leave the backslash commands on their own lines. Then use the -f switch to pgbench, giving it the file you just made. Also, make sure you give it '-s 1000' as well, because when you use the -f option pgbench does not auto-detect the scale factor.