Re: BitmapHeapScan streaming read user and prelim refactoring - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: BitmapHeapScan streaming read user and prelim refactoring
Date
Msg-id 059b1665-38aa-450f-a7a0-0bdd37104069@enterprisedb.com
Whole thread Raw
In response to Re: BitmapHeapScan streaming read user and prelim refactoring  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-hackers
On 3/17/24 20:36, Tomas Vondra wrote:
> 
> ...
> 
>> Besides a lot of other things, I finally added debugging fprintfs printing the
>> pid, (prefetch, read), block number. Even looking at tiny excerpts of the
>> large amount of output that generates shows that two iterators were out of
>> sync.
>>
> 
> Thanks. I did experiment with fprintf, but it's quite cumbersome, so I
> was hoping you came up with some smart way to trace this king of stuff.
> For example I was wondering if ebpf would be a more convenient way.
> 

FWIW I just realized why I failed to identify this "late prefetch" issue
during my investigation. I was experimenting with instrumenting this by
adding a LD_PRELOAD library, logging all pread/fadvise calls. But the
FilePrefetch call is skipped in the page is already in shared buffers,
so this case "disappeared" during processing which matched the two calls
by doing an "inner join".

That being said, I think tracing this using LD_PRELOAD or perf may be
more convenient way to see what's happening. For example I ended up
doing this:

  perf record -a -e syscalls:sys_enter_fadvise64 \
                 -e syscalls:sys_exit_fadvise64 \
                 -e syscalls:sys_enter_pread64 \
                 -e syscalls:sys_exit_pread64

  perf script -ns

Alternatively, perf-trace can be used and prints the filename too (but
time has ms resolution only). Processing this seems comparable to the
fprintf approach.

It still has the issue that some of the fadvise calls may be absent if
the prefetch iterator gets too far behind, but I think that can be
detected / measured by simply counting the fadvise calls, and comparing
them to pread calls. We expect these to be about the same, so

   (#pread - #fadvise) / #fadvise

is a measure of how many were "late" and skipped.

It also seems better than fprintf because it traces the actual syscalls,
not just calls to glibc wrappers. For example I saw this

postgres   54769 [001] 33768.771524828:
           syscalls:sys_enter_pread64: ..., pos: 0x30d04000

postgres   54769 [001] 33768.771526867:
           syscalls:sys_exit_pread64: 0x2000

postgres   54820 [000] 33768.771527473:
           syscalls:sys_enter_fadvise64: ..., offset: 0x30d04000, ...

postgres   54820 [000] 33768.771528320:
           syscalls:sys_exit_fadvise64: 0x0

which is clearly a case where we issue fadvise after pread of the same
block already completed.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Alexander Korotkov
Date:
Subject: Re: Add pg_basetype() function to obtain a DOMAIN base type
Next
From: Amit Langote
Date:
Subject: Re: Memory consumed by child SpecialJoinInfo in partitionwise join planning