Thread: less expensive pg_buffercache on big shmem

less expensive pg_buffercache on big shmem

From
Ivan Kartyshov
Date:
Hi hackers,

Recently I have finished my work on a patch for pg_buffercache contrib,
I think it's time to share my results.


Introduction
============

I want to offer you the implementation that allows to decrease system
workload by
partially sacrificing (fully snapshot consistency) data consistency.
Sometimes we do not need full data consistency, for example on
quantitative rather than qualitative analysis of memory contents, or
when we want to catch insufficient memory resources or how often
relation is used.


Problem description
===================

Currently, the pg_buffercache v1.1 and prior takes an exclusive lock on
all shared buffers, which greatly affects system performance.
Usually we use pg_buffercache to find out why DB is working slower than
expected or examine what occupies the entire memory. So when we run
pg_buffercache on such system, we make it even slower.


Implementation
==============

Vanilla implementation contains loop which collecting statistic from
whole shared memory acquire, read and release Spinlocks one by one, page
by page  while holding LWLock.

V1.2 implementation contains flexible loop which can collect shared
memory statistic using three different methods:
1) with holding LWLock only on one partition of shared memory
(semiconsistent method)
2) without LWLocks (nonconsistent method),
3) or in vanilia way (consistent method)

The aforementioned allow us to launch pg_buffercache in the three
different ways.
Each of them have some advantages and some disadvantages:

Consistent:
+ 100% consistency of shared memory snapshot
- Slowdown the system with whole shared memory exclusive lock

Semiconsistent:
+ Faster than consistent method
+ Mostly doesn`t affect on the system load
- Speed of taking that snapshot is low
Nonconsistent:
The fastest
+ Doesn`t noticeably affects on the systems
- <3% lost of snapshot consistency

What works
==========

Actually, it work well even on big load, but of course there might be
things I've
overlooked.
VIEW pg_buffercache_cons
VIEW pg_buffercache_noncons
VIEW pg_buffercache_semicons

Examples from docs in new realization:
SELECT c.relname, count(*) AS buffers FROM pg_buffercache_noncons b
INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname =
current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;

SELECT c.relname, count(*) AS buffers FROM pg_buffercache_semicons b
INNER JOIN pg_class c ON b.relfilenode = pg_relation_filenode(c.oid) AND
b.reldatabase IN (0, (SELECT oid FROM pg_database WHERE datname =
current_database())) GROUP BY c.relname ORDER BY 2 DESC LIMIT 10;


Testing the implementation
======================

How implementation tested:
1) Start server
2) Make pgbench tps
    pgbench -c 250 -s 1000  -T 200 -P1
3) Compare how tps sags under load if:
    SELECT count(*) FROM pg_buffercache_cons;
    SELECT count(*) FROM pg_buffercache_semicons;
    SELECT count(*) FROM pg_buffercache_noncons;

This test was made on server (server parameters)
Model name:            Intel(R) Xeon(R) CPU E7-8890 v3 @ 2.50GHz
CPU(s):                144
Socket(s):             4
Shared_buffers:        200GB


Results of testing
======================

Our DBA team obtained the following results:
Nonconsistent:
    * 10% faster then consistent method
    * doesn`t noticeably affects on the systems
    * the maximum loss of accuracy was less then 3%* ( in most situation it
is permissible accuracy loss )

Semiconsistent:
    * 5 time slower then nonconsistent
    * made less affects on system compared to consistent

Overall results:
Our clients was pleased with this implementation.
Implementation is made with backward compatibility, as a conclusion old
pg_buffercache v1.1 queries will work well.
Semiconsistent show results approaching to nonconsistent on SELECTONLY
queries.

* this values were obtained from our DBA tests.

What can be better
===============

It is unclear how to optimize the semiconsistent method to make it
faster, and reduce temporary effect that appears from time to time.


I will be glad to see your feedback!

---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: less expensive pg_buffercache on big shmem

From
Robert Haas
Date:
On Wed, Aug 31, 2016 at 8:27 PM, Ivan Kartyshov
<i.kartyshov@postgrespro.ru> wrote:
> Recently I have finished my work on a patch for pg_buffercache contrib, I
> think it's time to share my results.

Thanks for sharing your results.

> V1.2 implementation contains flexible loop which can collect shared memory
> statistic using three different methods:
> 1) with holding LWLock only on one partition of shared memory
> (semiconsistent method)
> 2) without LWLocks (nonconsistent method),
> 3) or in vanilia way (consistent method)
>
> The aforementioned allow us to launch pg_buffercache in the three different
> ways.

I am a little skeptical about the idea of offering the user three
different choices about how to do this.  That seems like it is
exposing what ought to be an internal implementation detail and I'm
not sure users will really know how to make an intelligent choice
between the three options.  But of course others may have a different
view on that.

I wonder whether we ought to just switch from the consistent method to
the semiconsistent method and call it good.  I agree with you that
taking every buffer partition lock simultaneously seems like too much
locking.  And in the future if we replace the buffer mapping hash with
something that is lock-free or close to it, then we wouldn't even have
buffer partition locks any more, and probably no way at all to get an
entirely consistent snapshot.

What do you think of this?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: less expensive pg_buffercache on big shmem

From
andres@anarazel.de (Andres Freund)
Date:
On 2016-09-02 08:31:42 +0530, Robert Haas wrote:
> I wonder whether we ought to just switch from the consistent method to
> the semiconsistent method and call it good.

+1. I think, before long, we're going to have to switch away from having
locks & partitions in the first place. So I don't see a problem relaxing
this. It's not like that consistency really buys you anything...  I'd
even consider not using any locks.

Andres



Re: less expensive pg_buffercache on big shmem

From
Peter Geoghegan
Date:
On Thu, Sep 1, 2016 at 8:19 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-09-02 08:31:42 +0530, Robert Haas wrote:
>> I wonder whether we ought to just switch from the consistent method to
>> the semiconsistent method and call it good.
>
> +1. I think, before long, we're going to have to switch away from having
> locks & partitions in the first place. So I don't see a problem relaxing
> this. It's not like that consistency really buys you anything...  I'd
> even consider not using any locks.

Right. ISTM that the consistency guarantee was added on the off chance
that it mattered, without actually being justified. I would like to be
able to run pg_buffercache in production from time to time.


-- 
Peter Geoghegan



Re: less expensive pg_buffercache on big shmem

From
Robert Haas
Date:
On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-09-02 08:31:42 +0530, Robert Haas wrote:
>> I wonder whether we ought to just switch from the consistent method to
>> the semiconsistent method and call it good.
>
> +1. I think, before long, we're going to have to switch away from having
> locks & partitions in the first place. So I don't see a problem relaxing
> this. It's not like that consistency really buys you anything...  I'd
> even consider not using any locks.

I think we certainly want to lock the buffer header, because otherwise
we might get a torn read of the buffer tag, which doesn't seem good.
But it's not obvious to me that there's any point in taking the lock
on the buffer mapping partition; I'm thinking that doesn't really do
anything unless we lock them all, and we all seem to agree that's
going too far.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: less expensive pg_buffercache on big shmem

From
Ivan Kartyshov
Date:
On 09/02/2016 06:01 AM, Robert Haas wrote:
> I wonder whether we ought to just switch from the consistent method to
> the semiconsistent method and call it good.  I agree with you that
> taking every buffer partition lock simultaneously seems like too much
> locking.  And in the future if we replace the buffer mapping hash with
> something that is lock-free or close to it, then we wouldn't even have
> buffer partition locks any more, and probably no way at all to get an
> entirely consistent snapshot.
> What do you think of this?

I fully agree with you that it would be preferred in the future to 
replace the buffer mapping hash with some of lock-free algorithms.

In the question of replacing the consistent method I agree with you, 
Andres Freund and Peter Geoghegan: the consistent method does not bring 
any considerable benefits.

You might be right regarding the three different modes, but our DBAs 
asked if we could preserve a legacy mode too, thus the choice.

On 09/02/2016 06:19 AM, Andres Freund wrote: > +1. I think, before long, we're going to have to switch away from having
>locks & partitions in the first place. So I don't see a problem relaxing > this. It's not like that consistency really
buysyou anything...  I'd > even consider not using any locks.
 

If we will replace consistent method, then we should replace it with the 
partially consistent method (called "nonconsistent") because:
1) it's based on fast spinlocks (it's not fully up to its name, though)
2) it's *almost* the fastest one (the less time needed for execution of 
method, the less data core will change and as a consequence the more 
consistent snapshot will be)
3) and it has less influence on the entire system during query processing.

On 09/02/2016 06:30 AM, Peter Geoghegan wrote: > I would like to be able to run pg_buffercache in production from time
>to time.
 

Yes, in our experience the usage of fully consistent pg_buffercache in
production is quite a courageous decision.


---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: less expensive pg_buffercache on big shmem

From
Tomas Vondra
Date:
Hi Ivan,

This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.

On 09/02/2016 01:38 PM, Ivan Kartyshov wrote:
> On 09/02/2016 06:01 AM, Robert Haas wrote:
>> I wonder whether we ought to just switch from the consistent method to
>> the semiconsistent method and call it good.  I agree with you that
>> taking every buffer partition lock simultaneously seems like too much
>> locking.  And in the future if we replace the buffer mapping hash with
>> something that is lock-free or close to it, then we wouldn't even have
>> buffer partition locks any more, and probably no way at all to get an
>> entirely consistent snapshot.
>> What do you think of this?
>
> I fully agree with you that it would be preferred in the future to
> replace the buffer mapping hash with some of lock-free algorithms.
>
> In the question of replacing the consistent method I agree with you,
> Andres Freund and Peter Geoghegan: the consistent method does not bring
> any considerable benefits.
>
> You might be right regarding the three different modes, but our DBAs
> asked if we could preserve a legacy mode too, thus the choice.
>
> On 09/02/2016 06:19 AM, Andres Freund wrote:
>  > +1. I think, before long, we're going to have to switch away from having
>  > locks & partitions in the first place. So I don't see a problem relaxing
>  > this. It's not like that consistency really buys you anything...  I'd
>  > even consider not using any locks.
>
> If we will replace consistent method, then we should replace it with the
> partially consistent method (called "nonconsistent") because:
> 1) it's based on fast spinlocks (it's not fully up to its name, though)

In other words, it does exactly the thing proposed up-thread, i.e. 
locking only buffer headers.

What do you mean by fast spinlocks? And why aren't they up to the name?

> 2) it's *almost* the fastest one (the less time needed for execution of
> method, the less data core will change and as a consequence the more
> consistent snapshot will be)

I'm not sure I understand what you're saying here? What do you mean by 
"almost the fastest one"?

I'm a bit confused by the results you reported before, i.e. that

1) nonconsistent is 10% faster then consistent method
2) semiconsistent is 5-times slower then nonconsistent method

What does that mean? Are you refering to duration of the queries reading 
data from pg_buffercache, or to impact on the other queries?

How can be semiconsistent 5x slower than nonconsistent? Wouldn't that 
make it significantly slower than the consistent method?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: less expensive pg_buffercache on big shmem

From
Ivan Kartyshov
Date:
On 09/03/2016 05:04 AM, Tomas Vondra wrote:> This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.
Thank you for a valuable hint.
> > If we will replace consistent method, then we should replace it 
with the> > partially consistent method (called "nonconsistent") because:> > 1) it's based on fast spinlocks (it's not
fullyup to its name, though)>> In other words, it does exactly the thing proposed up-thread, i.e.> locking only buffer
headers.>>What do you mean by fast spinlocks? And why aren't they up to the name?
 

Not they (spinlocks), but the name “nonconsistent” is somewhat 
misleading. At the moment we can't implement concurrent shared memory 
access without locks in general, so most inconsistent method that has 
been proposed was the "nonconsistent" one. But roughly speaking 
*nonconsistent* is not as such by the name, because it contains a 
locking mechanism (spinlocks).
> > 2) it's *almost* the fastest one (the less time needed for execution of> > method, the less data core will change
andas a consequence the more> > consistent snapshot will be)>> I'm not sure I understand what you're saying here? What
doyou mean by> "almost the fastest one"?
 

I mean the fastest one that has been proposed ("consistent"| 
"semiconsistent"| "nonconsistent").
> I'm a bit confused by the results you reported before, i.e. that>> 1) nonconsistent is 10% faster than consistent
method>2) semiconsistent is 5-times slower than nonconsistent method>> What does that mean? Are you refering to
durationof the queries reading> data from pg_buffercache, or to impact on the other queries?
 

Here I mean "working duration time".
> How can be semiconsistent 5x slower than nonconsistent? Wouldn't that> make it significantly slower than the
consistentmethod?
 

Firstly, when we want to measure the quality of pg_buffercache, we must 
measure several values:
1) Execution time (duration of the queries reading data from pg_buffercache)
2) How it influences the system (and other queries) during its work

Secondly, the semiconsistent is slower than nonconsistent and consistent 
method, but it makes less influence on other queries then consistent.

Thirdly, it is slower because it locks the partitions of shared memory 
in a different way than in consistent or nonconsistent methods.
The semi-consistent strategy implies that only one buffer is locked at a 
time. Therefor has no significant effect on the system, but it takes 
more time.


---
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: less expensive pg_buffercache on big shmem

From
Mark Kirkwood
Date:
On 02/09/16 15:19, Andres Freund wrote:

> On 2016-09-02 08:31:42 +0530, Robert Haas wrote:
>> I wonder whether we ought to just switch from the consistent method to
>> the semiconsistent method and call it good.
> +1. I think, before long, we're going to have to switch away from having
> locks & partitions in the first place. So I don't see a problem relaxing
> this. It's not like that consistency really buys you anything...  I'd
> even consider not using any locks.
>

+1 as well. When I wrote the original module I copied the design of the 
pg_locks view - as it was safe and consistent. Now it is clear that the 
ability to look at (semi-consistent) contents of the buffer cache is 
more important than having a theoretically correct point in time 
snapshot. Go for it :-)

regards

Mark




Re: less expensive pg_buffercache on big shmem

From
Tomas Vondra
Date:
On 09/05/2016 06:19 PM, Ivan Kartyshov wrote:
> On 09/03/2016 05:04 AM, Tomas Vondra wrote:
>> This patch needs a rebase, as 06d7fd6e bumped the version to 1.2.
> Thank you for a valuable hint.
>

So, will we get a rebased patch? I see the patch is back in 'needs 
review' but there's no new version.

>> > If we will replace consistent method, then we should replace it with
> the
>> > partially consistent method (called "nonconsistent") because:
>> > 1) it's based on fast spinlocks (it's not fully up to its name, though)
>>
>> In other words, it does exactly the thing proposed up-thread, i.e.
>> locking only buffer headers.
>>
>> What do you mean by fast spinlocks? And why aren't they up to the name?
>
> Not they (spinlocks), but the name “nonconsistent” is somewhat
> misleading. At the moment we can't implement concurrent shared memory
> access without locks in general, so most inconsistent method that has
> been proposed was the "nonconsistent" one. But roughly speaking
> *nonconsistent* is not as such by the name, because it contains a
> locking mechanism (spinlocks).
>

I'm a bit confused by the naming, but my understanding is that 
nonconsistent only acquires (spin)locks on individual buffers, so it 
does not have a consistent view on shared_buffers as a whole, but data 
about individual buffers are consistent. I think that's fine and 
perfectly acceptable for this extension.

>> > 2) it's *almost* the fastest one (the less time needed for execution of
>> > method, the less data core will change and as a consequence the more
>> > consistent snapshot will be)
>>
>> I'm not sure I understand what you're saying here? What do you mean by
>> "almost the fastest one"?
>
> I mean the fastest one that has been proposed ("consistent"|
> "semiconsistent"| "nonconsistent").
>
>> I'm a bit confused by the results you reported before, i.e. that
>>
>> 1) nonconsistent is 10% faster than consistent method
>> 2) semiconsistent is 5-times slower than nonconsistent method
>>
>> What does that mean? Are you refering to duration of the queries reading
>> data from pg_buffercache, or to impact on the other queries?
>
> Here I mean "working duration time".
>

I have no idea what "working duration time" is.

>> How can be semiconsistent 5x slower than nonconsistent? Wouldn't that
>> make it significantly slower than the consistent method?
>
> Firstly, when we want to measure the quality of pg_buffercache, we must
> measure several values:
> 1) Execution time (duration of the queries reading data from
> pg_buffercache)
> 2) How it influences the system (and other queries) during its work
>
> Secondly, the semiconsistent is slower than nonconsistent and consistent
> method, but it makes less influence on other queries then consistent.
>
> Thirdly, it is slower because it locks the partitions of shared memory
> in a different way than in consistent or nonconsistent methods.
> The semi-consistent strategy implies that only one buffer is locked at a
> time. Therefor has no significant effect on the system, but it takes
> more time.
>

It'd be really useful if you could provide actual numbers, explain what 
metrics you compare and how. I'm sure it makes sense to you but it's 
utterly confusing for everyone else, and it also makes it impossible to 
reproduce the benchmark.

I've looked at the code (applied on top of e3b607c) and I see this in 
pg_buffercache_pages:
    for (j = 0; j < num_partit; j++)    {        if (snaptype == CONS_SNAP)            for (i = 0; i <
NUM_BUFFER_PARTITIONS;i++)                LWLockAcquire(BufMappingPartitionLockByIndex(i),
LW_SHARED);       else if (snaptype == SEMICONS_SNAP)            LWLockAcquire(BufMappingPartitionLockByIndex(j),
LW_SHARED);
        for (i = 0; i < NBuffers; i++)        {          ... walk through all shared buffers ...        }     }

How is the SEMICONS_SNAP case supposed to only scan the currently locked 
partition, when we simply scans all shared buffers for SEMICONS_SNAP? 
That seems outright broken, it should scan only the currently locked 
partition.

Secondly, I see this bit added to the loop over buffers:
    if (bufHdr->tag.forkNum == -1)    {        fctx->record[i].blocknum = InvalidBlockNumber;        continue;    }

and I have no idea why this is needed (when it was not needed before).

The patch also breaks a few comments, because it adds code between the 
comment and the related code. For example the (likely unnecessary) bit 
of code changes this:
    /* Lock each buffer header before inspecting. */    buf_state = LockBufHdr(bufHdr);

into this
    /* Lock each buffer header before inspecting. */    if (bufHdr->tag.forkNum == -1)    {
fctx->record[i].blocknum= InvalidBlockNumber;        continue;    }    buf_state = LockBufHdr(bufHdr);
 

which is confusing (and I'd argue broken).

Similarly when releasing the locks, the original comment/code block 
looks like this:
  /*   * And release locks.  We do this in reverse order for two reasons:   * (1) Anyone else who needs more than one
ofthe locks will be trying   * to lock them in increasing order; we don't want to release the   * other process until
itcan get all the locks it needs. (2) This   * avoids O(N^2) behavior inside LWLockRelease.   */  for (i =
NUM_BUFFER_PARTITIONS;--i >= 0;)    LWLockRelease(BufMappingPartitionLockByIndex(i));
 

but the patch changes is to this:
  if (snaptype == CONS_SNAP)    for (i = NUM_BUFFER_PARTITIONS; --i >= 0;)
LWLockRelease(BufMappingPartitionLockByIndex(i)); else if (snaptype == SEMICONS_SNAP)    /*     * And release locks.
Wedo this in reverse order for two reasons:     * (1) Anyone else who needs more than one of the locks will be     *
tryingto lock them in increasing order; we don't want to     * release the other process until it can get all the locks
it    * needs. (2) This avoids O(N^2) behavior inside LWLockRelease.     */
LWLockRelease(BufMappingPartitionLockByIndex(j));

Notice how the comment is moved from the loop to the block that only 
unlocks a single partition. That clearly makes no sense, as the comment 
is about the CONS_SNAP loop.

Now that I read the comment, I wonder whether the semiconsistent case 
(locking one partition at a time) changes this reverse-order unlocking 
behavior or not - clearly, the guarantee that the other process will get 
all the locks at once is no longer true, as pg_buffercache will likely 
hold the next one.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: less expensive pg_buffercache on big shmem

From
Tomas Vondra
Date:
On 09/02/2016 11:01 AM, Robert Haas wrote:
> On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund <andres@anarazel.de> wrote:
>> On 2016-09-02 08:31:42 +0530, Robert Haas wrote:
>>> I wonder whether we ought to just switch from the consistent method to
>>> the semiconsistent method and call it good.
>>
>> +1. I think, before long, we're going to have to switch away from having
>> locks & partitions in the first place. So I don't see a problem relaxing
>> this. It's not like that consistency really buys you anything...  I'd
>> even consider not using any locks.
>
> I think we certainly want to lock the buffer header, because otherwise
> we might get a torn read of the buffer tag, which doesn't seem good.
> But it's not obvious to me that there's any point in taking the lock
> on the buffer mapping partition; I'm thinking that doesn't really do
> anything unless we lock them all, and we all seem to agree that's
> going too far.

+1 from me to only locking the buffer headers. IMHO that's perfectly 
fine for the purpose of this extension.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: less expensive pg_buffercache on big shmem

From
Robert Haas
Date:
On Tue, Sep 20, 2016 at 7:43 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> On 09/02/2016 11:01 AM, Robert Haas wrote:
>>
>> On Fri, Sep 2, 2016 at 8:49 AM, Andres Freund <andres@anarazel.de> wrote:
>>>
>>> On 2016-09-02 08:31:42 +0530, Robert Haas wrote:
>>>>
>>>> I wonder whether we ought to just switch from the consistent method to
>>>> the semiconsistent method and call it good.
>>>
>>>
>>> +1. I think, before long, we're going to have to switch away from having
>>> locks & partitions in the first place. So I don't see a problem relaxing
>>> this. It's not like that consistency really buys you anything...  I'd
>>> even consider not using any locks.
>>
>> I think we certainly want to lock the buffer header, because otherwise
>> we might get a torn read of the buffer tag, which doesn't seem good.
>> But it's not obvious to me that there's any point in taking the lock
>> on the buffer mapping partition; I'm thinking that doesn't really do
>> anything unless we lock them all, and we all seem to agree that's
>> going too far.
>
> +1 from me to only locking the buffer headers. IMHO that's perfectly fine
> for the purpose of this extension.

So, I think we have agreement on the way forward here, but what we
don't have is a committable patch.  I'm willing to commit one before
the end of this CommitFest if somebody produces one RSN; otherwise,
this is going to have to go into the "Returned with Feedback" bucket.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: less expensive pg_buffercache on big shmem

From
Ivan Kartyshov
Date:
Hello everyone, patch was rebased.

Thank you Tomas for your reviewing this patch and for your valuable
comments.

 From the very beginning we had the misunderstanding with the naming of
meethods.

 > It'd be really useful if you could provide actual numbers, explain what
 > metrics you compare and how. I'm sure it makes sense to you but it's
 > utterly confusing for everyone else, and it also makes it impossible to
 > reproduce the benchmark.

I test it as I wrote earlier, I run it a several times collecting TPS in
each series, and calculate average value.

 > Secondly, I see this bit added to the loop over buffers:
 >
 >     if (bufHdr->tag.forkNum == -1)
 >     {
 >         fctx->record[i].blocknum = InvalidBlockNumber;
 >         continue;
 >     }
 >
 > and I have no idea why this is needed (when it was not needed before).

This helps to skip not used bufferpages. It is valuable on big and  not
warmup shared memory.

On 09/02/2016 12:01 PM, Robert Haas wrote:
> I think we certainly want to lock the buffer header, because otherwise
> we might get a torn read of the buffer tag, which doesn't seem good.
> But it's not obvious to me that there's any point in taking the lock
> on the buffer mapping partition; I'm thinking that doesn't really do
> anything unless we lock them all, and we all seem to agree that's
> going too far.

Replace consistent method with semiconsistent (that lock buffer headers
without partition lock). Made some additional fixes thanks to reviewers.


--
Ivan Kartyshov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Attachment

Re: less expensive pg_buffercache on big shmem

From
Heikki Linnakangas
Date:
On 09/29/2016 02:45 AM, Ivan Kartyshov wrote:
>  > Secondly, I see this bit added to the loop over buffers:
>  >
>  >     if (bufHdr->tag.forkNum == -1)
>  >     {
>  >         fctx->record[i].blocknum = InvalidBlockNumber;
>  >         continue;
>  >     }
>  >
>  > and I have no idea why this is needed (when it was not needed before).
>
> This helps to skip not used bufferpages. It is valuable on big and  not
> warmup shared memory.

That seems like an unrelated change. Checking for forkNum, instead of 
e.g. BM_VALID seems a bit surprising, too.

> On 09/02/2016 12:01 PM, Robert Haas wrote:
>> I think we certainly want to lock the buffer header, because otherwise
>> we might get a torn read of the buffer tag, which doesn't seem good.
>> But it's not obvious to me that there's any point in taking the lock
>> on the buffer mapping partition; I'm thinking that doesn't really do
>> anything unless we lock them all, and we all seem to agree that's
>> going too far.
>
> Replace consistent method with semiconsistent (that lock buffer headers
> without partition lock). Made some additional fixes thanks to reviewers.

This patch also does:

+++ b/contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql
@@ -0,0 +1,6 @@
+/* contrib/pg_buffercache/pg_buffercache--1.2--1.3.sql */
+
+-- complain if script is sourced in psql, rather than via ALTER EXTENSION
+\echo Use "ALTER EXTENSION pg_buffercache(text) UPDATE TO '1.3'" to 
load this file. \quit
+
+ALTER FUNCTION pg_buffercache_pages() PARALLEL SAFE;

Why? That seems unrelated to what's been discussed in this thread.


I have committed the straightforward part of this, removing the 
partition-locking. I think we're done here for this commitfest, but feel 
free to post new patches for that PARALLEL SAFE and the quick-check for 
unused buffers, if you think it's worthwhile.

- Heikki