Re: pg_buffercache: Add per-relation summary stats - Mailing list pgsql-hackers
| From | Heikki Linnakangas |
|---|---|
| Subject | Re: pg_buffercache: Add per-relation summary stats |
| Date | |
| Msg-id | f64f741f-36db-4b5b-b7ff-ee86af65e3e6@iki.fi Whole thread Raw |
| In response to | Re: pg_buffercache: Add per-relation summary stats (Andres Freund <andres@anarazel.de>) |
| Responses |
Re: pg_buffercache: Add per-relation summary stats
|
| List | pgsql-hackers |
On 07/04/2026 16:47, Andres Freund wrote:
> On 2026-04-07 16:07:45 +0300, Heikki Linnakangas wrote:
>> On 28/03/2026 06:18, Ashutosh Bapat wrote:
>>> Parallely myself and Palak Chaturvedi developed a quick patch to
>>> modernise pg_buffercache_pages() and use tuplestore so that it doesn't
>>> have to rely on NBuffers being the same between start of the scan,
>>> when memory allocated, when the scan ends - a condition possible with
>>> resizing buffer cache. It seems to improve the timings by about 10-30%
>>> on my laptop for 128MB buffercache size. Without this patch the time
>>> taken to execute Lukas's query varies between 10-15ms on my laptop.
>>> With this patch it varies between 8-9ms. So the timing is more stable
>>> as a side effect. It's not a 10x improvement that we are looking for
>>> but it looks like a step in the right direction. That improvement
>>> seems to come purely because we avoid creating a heap tuple. I wonder
>>> if there are some places up in the execution tree where full
>>> heaptuples get formed again instead of continuing to use minimal
>>> tuples or places where we perform some extra actions that are not
>>> required.
>
> I don't think that's the reason for the improvement - tuplestore_putvalues()
> forms a minimal tuple, and the cost to form a minimal tuple and a heap tuple
> aren't meaningfully different.
Yeah, I wasn't fully convinced of that part either, which is why I left
it out of the commit message. I mostly wanted to get rid of the
double-buffering where we first accumulated all the data in an array.
> I think the problem is that we materialize rowmode SRFs as a tuplestore if
> they are in the from list. You can easily see this even with just
> generate_series():
>
> postgres[1520825][1]=# SELECT count(*) FROM generate_series(1, 1000000);
> ┌─────────┐
> │ count │
> ├─────────┤
> │ 1000000 │
> └─────────┘
> (1 row)
>
> Time: 117.939 ms
> postgres[1520825][1]=# SELECT count(*) FROM (SELECT generate_series(1, 1000000));
> ┌─────────┐
> │ count │
> ├─────────┤
> │ 1000000 │
> └─────────┘
> (1 row)
>
> Time: 58.914 ms
Oh, to be honest I didn't remember that we *don't* materialize when it's
in the target list.
> Of course, because pg_buffercache_pages() is archaicially defined without
> defininig its output columns, you can't actually use it in the select list.
>
> But that can be fixed:
>
> CREATE FUNCTION pg_buffercache_pages_fast(OUT bufferid integer, OUT relfilenode oid, OUT reltablespace oid, OUT
reldatabaseoid,
> OUT relforknumber int2, OUT relblocknumber int8, OUT isdirty bool, OUT usagecount int2,
> OUT pinning_backends int4)
> RETURNS SETOF RECORD
> AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> LANGUAGE C PARALLEL SAFE;
>
> 60GB of s_b, mostly filled, with 257c8231bf97a77378f6fedb826b1243f0a41612
> reverted.
>
> SELECT count(*) FROM (SELECT pg_buffercache_pages_fast());
> Time: 1518.704 ms (00:01.519)
>
> SELECT count(*) FROM pg_buffercache_pages_fast();
> Time: 2008.101 ms (00:02.008)
Hmm, we could easily go back to ValuePerCall mode, while still getting
rid of the temporary array and the other modernization. But "SELECT *
FROM pg_buffercache_pages_fast()" doesn't actually produce the result we
want:
postgres=# SELECT pg_buffercache_pages_fast() limit 1;
pg_buffercache_pages_fast
---------------------------
(1,1262,1664,0,0,0,f,5,0)
(1 row)
Can we turn that into the right shaep for the pg_buffercache view? This
works:
postgres=# SELECT (pg_buffercache_pages_fast()).* limit 1;
bufferid | relfilenode | reltablespace | reldatabase | relforknumber |
relblocknumber | isdirty | usagecount | pinning_backends
----------+-------------+---------------+-------------+---------------+----------------+---------+------------+------------------
1 | 1262 | 1664 | 0 | 0 |
0 | f | 5 | 0
(1 row)
But that doesn't seem to be faster anymore, despite avoiding the
tuplestore, because of overheads elsewhere in the executor.
>>> I didn't dig into the history to find out why we didn't modernize
>>> pg_buffercache_pages(). I don't see any hazard though.
>>
>> Committed this modernization patch, thanks!
>>
>> It would be nice to have a proper row-at-a-time mode that would avoid
>> materializing the result, but collecting all the data in a temporary array
>> is clearly worse than just putting them to the tuplestore directly. The only
>> reason I can think of why we'd prefer to use a temporary array like that is
>> to get a more consistent snapshot of all the buffers, by keeping the time
>> spent scanning the buffers as short as possible. But we're not getting a
>> consistent view anyway, it's just a matter of degree.
>
> Seems like a reasonably large difference in degree whether you have a snapshot
> collected in one loop, or you do things like spilling a tuplestore to disk in
> between.
Looking at the original discussion when pg_buffercache was introduced
[1], the first patch version didn't have the array, but it was added in
v2 to avoid holding BufMappingLock for long. But we gave up on getting a
consistent snapshot and stopped holding the lock in commit 6e654546fb.
To summarize my current thinking, I think this is fine as committed. I'm
not worried about the more "stretched out" snapshot that you get. And it
would be nice if we had a better, faster ValuePerCall mode that also
worked with FunctionScans, but we don't.
[1] https://www.postgresql.org/message-id/42297D6E.3000505%40coretech.co.nz
- Heikki
pgsql-hackers by date: