Re: AIO v2.5 - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: AIO v2.5 |
Date | |
Msg-id | f50ad709-4fe4-4eea-a02a-a0f32c702e95@vondra.me Whole thread Raw |
In response to | Re: AIO v2.5 (Andres Freund <andres@anarazel.de>) |
List | pgsql-hackers |
On 7/14/25 20:44, Andres Freund wrote: > On 2025-07-13 20:04:51 +0200, Tomas Vondra wrote: >> On 7/11/25 23:03, Tomas Vondra wrote: >>> ... >>> >>> e) indexscan regression (ryzen-indexscan-uniform-pg17-checksums.png) >>> >>> There's an interesting difference difference I noticed in the run with >>> checksums on PG17. The full PDF is available here: >>> >>> https://github.com/tvondra/iomethod-tests/blob/run2-17-checksums-on/ryzen-rows-cold-32GB-16-unscaled.pdf >>> >>> The interesting thing is that PG17 indexscans on uniform dataset got a >>> little bit faster. In the attached PDF it's exactly on par with PG18, >>> but here it got a bit faster. Which makes no sense, if it has to also >>> verify checksums. I haven't had time to investigate this yet. >> >> I was intrigued by this, so I looked into this today. >> >> TL;DR I believe it was caused by something in the filesystem or even the >> storage devices, making the "PG17" data directory (or maybe even just >> the "uniform" table) a bit faster. >> >> I started by reproducing the behavior with an indexscan matching 10% of >> the rows, and it was very easy to reproduce the difference shows on the >> chart (all timings in milliseconds): >> >> PG17: 14112.800 ms >> PG18: 21612.090 ms > > That's a decidedly nontrivial difference. > It is. It surprised me. > Did you keep any metrics from those runs? E.g. whether there were larger IOs > or such? > Unfortunately no. I should have investigated more before rebuilding the filesystem. But I suspect I might be able to reproduce it, if I do the loads in a loop or something like that. > >> This was perfectly reproducible, affecting the whole table (not just one >> part of it), etc. At some point I recalled that I might have initialized >> the databases in slightly different ways - one by running the SQL, the >> other one by pg_dump/pg_restore (likely with multiple jobs). > > I guess that's an INSERT ... SELECT vs COPY? > > Which one was the faster one? > I suspect I initialized one instance by the SQL script that generates data by INSERT ... SELECT for tables one by one (and then builds indexes, also one by one). And then I might have initialized the other database by pg_dump/pg_restore, so that'd be COPY. FWIW I might have used pg_restore with parallelism, and I suspect that might be more important than INSERT vs. COPY. But maybe I'm misremembering things, I rebuilt the benchmark databases so many times over the past couple days ... not sure. > If this ever re-occurs, it might be interesting to look at the fragmentation > of the underlying files with filefrag. > Yeah, I'll keep that in mind. And I'll try to reproduce this, once I'm done with those benchmarks. > >> I couldn't think of any other difference between the data directories, >> so I simply reloaded them by pg_restore (from the same dump). Which >> however made them both slow :O > > So that suggests that COPY is the slow case, interesting. > > One potentially relevant factor could be that parallel COPY into logged tables > currently leads to really sub-optimal write patterns, due to us writing back > buffers one-by-one, interspersed by WAL writes and file extensions. I know how > that affects write speed, but it's not entirely obvious how it would affect > read speed... > Maybe. I don't remember which database I loaded first. But this time I simply copied a backup of the data directory from a different device, which would "cleanup" the write pattern. I'd assume filesystems allocate space in larger chunks, e.g. ext4 does delayed allocation (and I don't see why that wouldn't work here). > >> And it didn't matter how many jobs are used, or anything else I tried. >> But every now and then an instance (17 or 18) happened to be fast >> (~14000 ms). Consistently, for all queries on the table, not randomly. >> >> In the end I recreated the (ext4) filesystem, loaded the databases and >> now both instances are fast. I have no idea what the root cause was, and >> I assume recreating the filesystem destroyed all the evidence. > > Besides differences in filesystem level fragmentation, another potential > theory is that the SSDs were internally more fragmented. Occasionally > dumping/restoring the data could allow the drive to do internal wear > leveling before the new data is loaded, leading to a better layout. > > I found that I get more consistent benchmark performance if I delete as much > of the data as possible, run fstrim -v -a and then load the data. And do > another round of fstrim. > Maybe, but the filesystem is far from full, which I think helps wear leveling. The RAID has ~4TB and only ~500GB was ever used. But who knows, and if that's the cause, we'll never really know I'm afraid. It's hard to confirm what happens inside a SSD :-( regards -- Tomas Vondra
pgsql-hackers by date: