Re: index prefetching - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: index prefetching
Date
Msg-id a84f5f7a-b146-4ebf-8ebd-dbdde070d4ba@vondra.me
Whole thread Raw
In response to Re: index prefetching  (Andres Freund <andres@anarazel.de>)
Responses Re: index prefetching
List pgsql-hackers
On 8/14/25 01:19, Andres Freund wrote:
> Hi,
> 
> On 2025-08-14 01:11:07 +0200, Tomas Vondra wrote:
>> On 8/13/25 23:57, Peter Geoghegan wrote:
>>> On Wed, Aug 13, 2025 at 5:19 PM Tomas Vondra <tomas@vondra.me> wrote:
>>>> It's also not very surprising this happens with backwards scans more.
>>>> The I/O is apparently much slower (due to missing OS prefetch), so we're
>>>> much more likely to hit the I/O limits (max_ios and various other limits
>>>> in read_stream_start_pending_read).
>>>
>>> But there's no OS prefetch with direct I/O. At most, there might be
>>> some kind of readahead implemented in the SSD's firmware.
>>>
>>
>> Good point, I keep forgetting direct I/O means no OS read-ahead. Not
>> sure if there's a good way to determine if the SSD can do something like
>> that (and how well). I wonder if there's a way to do backward sequential
>> scans in fio ..
> 
> In theory, yes, in practice, not quite:
> https://github.com/axboe/fio/issues/1963
> 
> So right now it only works if you skip over some blocks. For that there rather
> significant performance differences on my SSDs. E.g.
> 
> andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:8k
--buffered0 2>&1|grep READ
 
>    READ: bw=179MiB/s (188MB/s), 179MiB/s-179MiB/s (188MB/s-188MB/s), io=341MiB (358MB), run=1907-1907msec
> andres@awork3:~/src/fio$ fio --directory /srv/fio --size=$((1024*1024*1024)) --name test --bs=4k --rw read:-8k
--buffered0 2>&1|grep READ
 
>    READ: bw=70.6MiB/s (74.0MB/s), 70.6MiB/s-70.6MiB/s (74.0MB/s-74.0MB/s), io=1024MiB (1074MB), run=14513-14513msec
> 
> So on this WD Red SN700 there's a rather substantial performance difference.
> 
> On a Samsung 970 PRO I don't see much of a difference. Nor on a ADATA
> SX8200PNP.
> 

I experimented with this a little bit today. Given the fio issues, I
ended up writing a simple tool in C, doing pread() forward/backward with
different block size and direct I/O. AFAICS this is roughly equivalent
to fio with iodepth=1 (based on a couple tests).

Too bad fio has issues with backward sequential tests ... I'll see if I
can get at least some results to validate my results.

On all my SSDs there's massive difference between forward and backward
sequential scans. It depends on the block size, but for the smaller
block sizes (1-16KB) it's roughly 4x slower. It gets better for larger
blocks, but while that's interesting, we're stuck with 8K blocks.


FWIW I'm not claiming this explains all odd things we're investigating
in this thread, it's more a confirmation that the scan direction may
matter if it translates to direction at the device level. I don't think
it can explain the strange stuff with the "random" data sets constructed
Peter.


regards

-- 
Tomas Vondra

Attachment

pgsql-hackers by date:

Previous
From: Noboru Saito
Date:
Subject: Re: [PATCH] Proposal: Improvements to PDF stylesheet and table column widths
Next
From: Peter Geoghegan
Date:
Subject: Re: index prefetching