On 16.06.2025 17:41, Andres Freund wrote:
> TBH, I don't see a point in continuing with this thread without something that
> others can test. I rather doubt that the right fix here is to just change the
> lock model over, but without a repro I can't evaluate that.
Hello,
I think I can reproduce the issue with pgbench on a muti-core server. I
start a regular select-only test with 64 clients, and while it's
running, I start a plpgsql loop creating and dropping temporary tables
from a single psql session. I observe ~25% drop in tps reported by
pgbench until I cancel the query in psql.
$ pgbench -n -S -c64 -j64 -T300 -P1
progress: 10.0 s, 1249724.7 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 11.0 s, 1248289.0 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 12.0 s, 1246001.0 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 13.0 s, 1247832.5 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 14.0 s, 1248205.8 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 15.0 s, 1247737.3 tps, lat 0.051 ms stddev 0.002, 0 failed
progress: 16.0 s, 1219444.3 tps, lat 0.052 ms stddev 0.039, 0 failed
progress: 17.0 s, 893943.4 tps, lat 0.071 ms stddev 0.159, 0 failed
progress: 18.0 s, 927861.3 tps, lat 0.069 ms stddev 0.150, 0 failed
progress: 19.0 s, 886317.1 tps, lat 0.072 ms stddev 0.163, 0 failed
progress: 20.0 s, 877200.1 tps, lat 0.073 ms stddev 0.164, 0 failed
progress: 21.0 s, 875424.4 tps, lat 0.073 ms stddev 0.163, 0 failed
progress: 22.0 s, 877693.0 tps, lat 0.073 ms stddev 0.165, 0 failed
progress: 23.0 s, 897202.8 tps, lat 0.071 ms stddev 0.158, 0 failed
progress: 24.0 s, 917853.4 tps, lat 0.070 ms stddev 0.153, 0 failed
progress: 25.0 s, 907865.1 tps, lat 0.070 ms stddev 0.154, 0 failed
Here I started the following loop in psql around 17s and tps dropped by
~25%:
do $$
begin
for i in 1..1000000 loop
create temp table tt1 (a bigserial primary key, b text);
drop table tt1;
commit;
end loop;
end;
$$;
Now, if I simply remove the spinlock in SIGetDataEntries, I see a drop
of just ~6% under concurrent DDL. I think this strongly suggests that
the spinlock is the bottleneck.
Before that, I tried removing `if (!hasMessages) return` optimization in
SIGetDataEntries to stress the spinlock and observed ~35% drop in tps of
select-only with an empty sinval queue (no DDL running in background).
Then I also removed the spinlock in SIGetDataEntries, and the loss was
just ~4%, which may be noise. I think this also suggests that the
spinlock could be the bottleneck.
I'm running this on a 2 socket AMD EPYC 9654 96-Core server with
postgres and pgbench bound to distinct CPUs. PGDATA is placed on tmpfs.
postgres is running with the default settings. pgbench tables are of
scale 1. pgbench is connecting via loopback/127.0.0.1.
Does this sound convincing?
Best regards,
--
Sergey Shinderuk https://postgrespro.com/