Re: AIO v2.5 - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: AIO v2.5
Date
Msg-id f50ad709-4fe4-4eea-a02a-a0f32c702e95@vondra.me
Whole thread Raw
In response to Re: AIO v2.5  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 7/14/25 20:44, Andres Freund wrote:
> On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote:
>> On 7/11/25 23:03, Tomas Vondra wrote:
>>> ...
>>>
>>> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png)
>>>
>>> There's an interesting difference difference I noticed in the run with
>>> checksums on PG17. The full PDF is available here:
>>>
>>> https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf
>>>
>>> The interesting thing is that PG17 indexscans on uniform dataset got a
>>> little bit faster. In the attached PDF it's exactly on par with PG18,
>>> but here it got a bit faster. Which makes no sense, if it has to also
>>> verify checksums. I haven't had time to investigate this yet.
>>
>> I was intrigued by this, so I looked into this today.
>>
>> TL;DR I believe it was caused by something in the filesystem or even the
>> storage devices, making the "PG17" data directory (or maybe even just
>> the "uniform" table) a bit faster.
>>
>> I started by reproducing the behavior with an indexscan matching 10% of
>> the rows, and it was very easy to reproduce the difference shows on the
>> chart (all timings in milliseconds):
>>
>> PG17: 14112.800 ms
>> PG18: 21612.090 ms
> 
> That's a decidedly nontrivial difference.
> 

It is. It surprised me.

> Did you keep any metrics from those runs? E.g. whether there were larger IOs
> or such?
> 

Unfortunately no. I should have investigated more before rebuilding the
filesystem. But I suspect I might be able to reproduce it, if I do the
loads in a loop or something like that.

> 
>> This was perfectly reproducible, affecting the whole table (not just one
>> part of it), etc. At some point I recalled that I might have initialized
>> the databases in slightly different ways - one by running the SQL, the
>> other one by pg_dump/pg_restore (likely with multiple jobs).
> 
> I guess that's an INSERT ...  SELECT vs COPY?
> 
> Which one was the faster one?
> 

I suspect I initialized one instance by the SQL script that generates
data by INSERT ... SELECT for tables one by one (and then builds
indexes, also one by one). And then I might have initialized the other
database by pg_dump/pg_restore, so that'd be COPY.

FWIW I might have used pg_restore with parallelism, and I suspect that
might be more important than INSERT vs. COPY.

But maybe I'm misremembering things, I rebuilt the benchmark databases
so many times over the past couple days ... not sure.


> If this ever re-occurs, it might be interesting to look at the fragmentation
> of the underlying files with filefrag.
> 

Yeah, I'll keep that in mind. And I'll try to reproduce this, once I'm
done with those benchmarks.

> 
>> I couldn't think of any other difference between the data directories,
>> so I simply reloaded them by pg_restore (from the same dump). Which
>> however made them both slow :O
> 
> So that suggests that COPY is the slow case, interesting.
> 
> One potentially relevant factor could be that parallel COPY into logged tables
> currently leads to really sub-optimal write patterns, due to us writing back
> buffers one-by-one, interspersed by WAL writes and file extensions. I know how
> that affects write speed, but it's not entirely obvious how it would affect
> read speed...
> 

Maybe. I don't remember which database I loaded first. But this time I
simply copied a backup of the data directory from a different device,
which would "cleanup" the write pattern.

I'd assume filesystems allocate space in larger chunks, e.g. ext4 does
delayed allocation (and I don't see why that wouldn't work here).

> 
>> And it didn't matter how many jobs are used, or anything else I tried.
>> But every now and then an instance (17 or 18) happened to be fast
>> (~14000 ms). Consistently, for all queries on the table, not randomly.
>>
>> In the end I recreated the (ext4) filesystem, loaded the databases and
>> now both instances are fast. I have no idea what the root cause was, and
>> I assume recreating the filesystem destroyed all the evidence.
> 
> Besides differences in filesystem level fragmentation, another potential
> theory is that the SSDs were internally more fragmented. Occasionally
> dumping/restoring the data could allow the drive to do internal wear
> leveling before the new data is loaded, leading to a better layout.
> 
> I found that I get more consistent benchmark performance if I delete as much
> of the data as possible, run fstrim -v -a and then load the data. And do
> another round of fstrim.
> 

Maybe, but the filesystem is far from full, which I think helps wear
leveling. The RAID has ~4TB and only ~500GB was ever used. But who
knows, and if that's the cause, we'll never really know I'm afraid. It's
hard to confirm what happens inside a SSD :-(


regards

-- 
Tomas Vondra




pgsql-hackers by date:

Previous
From: Dean Rasheed
Date:
Subject: Re: Improving and extending int128.h to more of numeric.c
Next
From: Tomas Vondra
Date:
Subject: Re: AIO v2.5