Thread: debugging what might be a perf regression in 17beta2

debugging what might be a perf regression in 17beta2

From
MARK CALLAGHAN
Date:
I am seeking advice. For now I hope for a suggestion about changes from 17beta1 to 17beta2 that might cause the problem -- assuming there is a problem, and not a mistake in my testing.

One of the sysbench microbenchmarks that I run does a table scan with a WHERE clause that filters out all rows. That WHERE clause is there to reduce network IO.

While running it on a server with 16 real cores, 12 concurrent queries and a cached database the query takes ~5% more time on 17beta2 than on 17beta1 or 16.3. Alas, this is a Google Cloud server and perf doesn't work there.

On small servers I have at home I can reproduce the problem without concurrent queries and 17beta2 is 5% to 10% slower there.

The SQL statement for the scan microbenchmark is:
SELECT * from %s WHERE LENGTH(c) < 0 

I will call my small home servers SER4 and PN53. They are described here:
https://smalldatum.blogspot.com/2022/10/small-servers-for-performance-testing-v4.html

The SER4 is a SER 4700u from Beelink and the PN53 is an ASUS ExpertCenter PN53. Both use an AMD CPU with 8 cores, AMD SMT disabled and Ubuntu 22.04. The SER4 has an older, slower CPU than the PN53. In all cases I compile from source using a configure command line like:
./configure --prefix=$pfx --enable-debug CFLAGS="-O2 -fno-omit-frame-pointer"
I used perf to get flamegraphs during the scan microbenchmark and they are archived here:
https://github.com/mdcallag/mytools/tree/master/bench/bugs/pg17beta2/24Jul5.sysbench.scan

For both SER4 and PN53 the time to finish the scan microbenchmark is ~10% longer in 17beta2 than it was in 17beta1 and 16.3. On the PN53 the query takes ~20 seconds with 16.3 and 17beta1 vs ~22.5 seconds for 17beta2 when the table has 60M rows.

From the SVG files for SER4 and 17beta2 I see ~2X more time in slot_getsomeattrs_int vs 17beta1 or 16.3 with all of that time spent in its child -- tts_buffer_heap_getsomeattrs. That function is defined in src/backend/executor/execTuples.c and that file has not changed from 17beta1 to 17beta2. But I don't keep up with individual commits to Postgres so I won't guess as to the root cause.

But the SVG files for PN53 don't show the same problem:
  • for 16.3 I see 85.24% in ExecInterpExpr vs 11.64% in SeqNext
  • for 17beta1 I see 82.82% in ExecInterpExpr vs 14.51% in SeqNext
  • for 17beta2 I see 85.03% in ExecInterpExpr vs 12.31% in SeqNext
  • for 17beta1 and 17beta2 the flamegraphs shows time spent handling page faults during SeqNext, and that isn't visible on the 16.3 flamegraph
And then for PN53 looking at slot_getsomeattrs_int, a child of ExecInterpExpr
  • for 16.3 I see 6.99% in slot_getsomeattrs_int
  • for 17beta1 I see 4.29% in slot_getsomeattrs_int
  • for 17beta2 I see 3.99% in slot_getsomeattrs_int
So at this point I am confused and repeating the test with a slightly larger table, but I am trying to keep the table small enough to fit in the Postgres buffer pool. I also have results from tables that are much larger than memory, and even in that case the problem can be reproduced.

--
Mark Callaghan
mdcallag@gmail.com

Re: debugging what might be a perf regression in 17beta2

From
David Rowley
Date:
On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> On small servers I have at home I can reproduce the problem without concurrent queries and 17beta2 is 5% to 10%
slowerthere.
 
>
> The SQL statement for the scan microbenchmark is:
> SELECT * from %s WHERE LENGTH(c) < 0

Can you share the CREATE TABLE and script to populate so others can try?

Also, have you tried with another compiler?  It does not seem
impossible that the refactoring done in heapam.c or the read stream
code might have changed something to make the binary more sensitive to
caching effects in this area.  One thing I often try when I can't
pinpoint the exact offending commit is to write a script to try the
first commit of each day for, say, 30 days to see if there is any
saw-toothing in performance numbers over that period.

David



Re: debugging what might be a perf regression in 17beta2

From
MARK CALLAGHAN
Date:
My results have too much variance so this is a false alarm. One day I might learn whether the noise is from HW, Postgres or my test method.
I ended up trying 10 builds between 17beta1 and 17beta2, but even with that I don't have a clear signal.

On Fri, Jul 5, 2024 at 8:48 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> On small servers I have at home I can reproduce the problem without concurrent queries and 17beta2 is 5% to 10% slower there.
>
> The SQL statement for the scan microbenchmark is:
> SELECT * from %s WHERE LENGTH(c) < 0

Can you share the CREATE TABLE and script to populate so others can try?

Also, have you tried with another compiler?  It does not seem
impossible that the refactoring done in heapam.c or the read stream
code might have changed something to make the binary more sensitive to
caching effects in this area.  One thing I often try when I can't
pinpoint the exact offending commit is to write a script to try the
first commit of each day for, say, 30 days to see if there is any
saw-toothing in performance numbers over that period.

David


--
Mark Callaghan
mdcallag@gmail.com

Re: debugging what might be a perf regression in 17beta2

From
MARK CALLAGHAN
Date:
A writeup for the benchmark results is here - https://smalldatum.blogspot.com/2024/07/postgres-17beta2-vs-sysbench-looking.html
pg17beta2 and pg17beta1 look good so far

On Mon, Jul 8, 2024 at 10:49 AM MARK CALLAGHAN <mdcallag@gmail.com> wrote:
My results have too much variance so this is a false alarm. One day I might learn whether the noise is from HW, Postgres or my test method.
I ended up trying 10 builds between 17beta1 and 17beta2, but even with that I don't have a clear signal.

On Fri, Jul 5, 2024 at 8:48 PM David Rowley <dgrowleyml@gmail.com> wrote:
On Sat, 6 Jul 2024 at 15:11, MARK CALLAGHAN <mdcallag@gmail.com> wrote:
> On small servers I have at home I can reproduce the problem without concurrent queries and 17beta2 is 5% to 10% slower there.
>
> The SQL statement for the scan microbenchmark is:
> SELECT * from %s WHERE LENGTH(c) < 0

Can you share the CREATE TABLE and script to populate so others can try?

Also, have you tried with another compiler?  It does not seem
impossible that the refactoring done in heapam.c or the read stream
code might have changed something to make the binary more sensitive to
caching effects in this area.  One thing I often try when I can't
pinpoint the exact offending commit is to write a script to try the
first commit of each day for, say, 30 days to see if there is any
saw-toothing in performance numbers over that period.

David


--
Mark Callaghan
mdcallag@gmail.com


--
Mark Callaghan
mdcallag@gmail.com