Re: pg_buffercache: Add per-relation summary stats - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: pg_buffercache: Add per-relation summary stats
Date
Msg-id f64f741f-36db-4b5b-b7ff-ee86af65e3e6@iki.fi
Whole thread Raw
In response to Re: pg_buffercache: Add per-relation summary stats  (Andres Freund <andres@anarazel.de>)
Responses Re: pg_buffercache: Add per-relation summary stats
List pgsql-hackers
On 07/04/2026 16:47, Andres Freund wrote:
> On 2026-04-07 16:07:45 +0300, Heikki Linnakangas wrote:
>> On 28/03/2026 06:18, Ashutosh Bapat wrote:
>>> Parallely myself and Palak Chaturvedi developed a quick patch to
>>> modernise pg_buffercache_pages() and use tuplestore so that it doesn't
>>> have to rely on NBuffers being the same between start of the scan,
>>> when memory allocated, when the scan ends - a condition possible with
>>> resizing buffer cache. It seems to improve the timings by about 10-30%
>>> on my laptop for 128MB buffercache size. Without this patch the time
>>> taken to execute Lukas's query varies between 10-15ms on my laptop.
>>> With this patch it varies between 8-9ms. So the timing is more stable
>>> as a side effect. It's not a 10x improvement that we are looking for
>>> but it looks like a step in the right direction. That improvement
>>> seems to come purely because we avoid creating a heap tuple. I wonder
>>> if there are some places up in the execution tree where full
>>> heaptuples get formed again instead of continuing to use minimal
>>> tuples or places where we perform some extra actions that are not
>>> required.
> 
> I don't think that's the reason for the improvement - tuplestore_putvalues()
> forms a minimal tuple, and the cost to form a minimal tuple and a heap tuple
> aren't meaningfully different.

Yeah, I wasn't fully convinced of that part either, which is why I left 
it out of the commit message. I mostly wanted to get rid of the 
double-buffering where we first accumulated all the data in an array.

> I think the problem is that we materialize rowmode SRFs as a tuplestore if
> they are in the from list.  You can easily see this even with just
> generate_series():
> 
> postgres[1520825][1]=# SELECT count(*) FROM generate_series(1, 1000000);
> ┌─────────┐
> │  count  │
> ├─────────┤
> │ 1000000 │
> └─────────┘
> (1 row)
> 
> Time: 117.939 ms
> postgres[1520825][1]=# SELECT count(*) FROM (SELECT generate_series(1, 1000000));
> ┌─────────┐
> │  count  │
> ├─────────┤
> │ 1000000 │
> └─────────┘
> (1 row)
> 
> Time: 58.914 ms

Oh, to be honest I didn't remember that we *don't* materialize when it's 
in the target list.

> Of course, because pg_buffercache_pages() is archaicially defined without
> defininig its output columns, you can't actually use it in the select list.
> 
> But that can be fixed:
> 
> CREATE FUNCTION pg_buffercache_pages_fast(OUT bufferid integer, OUT relfilenode oid, OUT reltablespace oid, OUT
reldatabaseoid,
 
>           OUT relforknumber int2, OUT relblocknumber int8, OUT isdirty bool, OUT usagecount int2,
>           OUT pinning_backends int4)
> RETURNS SETOF RECORD
> AS '$libdir/pg_buffercache', 'pg_buffercache_pages'
> LANGUAGE C PARALLEL SAFE;
> 
> 60GB of s_b, mostly filled, with 257c8231bf97a77378f6fedb826b1243f0a41612
> reverted.
> 
> SELECT count(*) FROM (SELECT pg_buffercache_pages_fast());
> Time: 1518.704 ms (00:01.519)
> 
> SELECT count(*) FROM pg_buffercache_pages_fast();
> Time: 2008.101 ms (00:02.008)

Hmm, we could easily go back to ValuePerCall mode, while still getting 
rid of the temporary array and the other modernization. But "SELECT * 
FROM pg_buffercache_pages_fast()" doesn't actually produce the result we 
want:

postgres=# SELECT pg_buffercache_pages_fast() limit 1;
  pg_buffercache_pages_fast
---------------------------
  (1,1262,1664,0,0,0,f,5,0)
(1 row)

Can we turn that into the right shaep for the pg_buffercache view? This 
works:

postgres=# SELECT (pg_buffercache_pages_fast()).* limit 1;
  bufferid | relfilenode | reltablespace | reldatabase | relforknumber | 
relblocknumber | isdirty | usagecount | pinning_backends

----------+-------------+---------------+-------------+---------------+----------------+---------+------------+------------------
         1 |        1262 |          1664 |           0 |             0 | 
              0 | f       |          5 |                0
(1 row)

But that doesn't seem to be faster anymore, despite avoiding the 
tuplestore, because of overheads elsewhere in the executor.

>>> I didn't dig into the history to find out why we didn't modernize
>>> pg_buffercache_pages(). I don't see any hazard though.
>>
>> Committed this modernization patch, thanks!
>>
>> It would be nice to have a proper row-at-a-time mode that would avoid
>> materializing the result, but collecting all the data in a temporary array
>> is clearly worse than just putting them to the tuplestore directly. The only
>> reason I can think of why we'd prefer to use a temporary array like that is
>> to get a more consistent snapshot of all the buffers, by keeping the time
>> spent scanning the buffers as short as possible. But we're not getting a
>> consistent view anyway, it's just a matter of degree.
> 
> Seems like a reasonably large difference in degree whether you have a snapshot
> collected in one loop, or you do things like spilling a tuplestore to disk in
> between.

Looking at the original discussion when pg_buffercache was introduced 
[1], the first patch version didn't have the array, but it was added in 
v2 to avoid holding BufMappingLock for long. But we gave up on getting a 
consistent snapshot and stopped holding the lock in commit 6e654546fb.

To summarize my current thinking, I think this is fine as committed. I'm 
not worried about the more "stretched out" snapshot that you get. And it 
would be nice if we had a better, faster ValuePerCall mode that also 
worked with FunctionScans, but we don't.

[1] https://www.postgresql.org/message-id/42297D6E.3000505%40coretech.co.nz

- Heikki




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Adding REPACK [concurrently]
Next
From: Xuneng Zhou
Date:
Subject: Re: Implement waiting for wal lsn replay: reloaded