Thread: [WIP PATCH] for Performance Improvement in Buffer Management

[WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:

This patch is based on below Todo Item:

Consider adding buffers the background writer finds reusable to the free list

 

I have tried implementing it and taken the readings for Select when all the data is in either OS buffers

or Shared Buffers.

 

The Patch has simple implementation for  "bgwriter or checkpoint process moving the unused buffers (unpinned with "ZERO" usage_count buffers) into "freelist".

Results (Results.html attached with mail) are taken with following configuration.

Current scenario is
    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all
       tables and indexes.
    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey"
       pages (using pg_prewarm with 'buffers' operation).
    3. Run the pgbench with select only for 20 minutes.
   
Platform details:
    Operating System: Suse-Linux 10.2 x86_64
    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
    RAM : 24GB
   
Server Configuration:
    shared_buffers = 6GB     (1/4 th of RAM size)

Pgbench configuration:
        transaction type: SELECT only
        scaling factor: 1200
        query mode: simple
        number of clients: <varying from 8 to 64 >
        number of threads: <varying from 8 to 64 >
        duration: 1200 s

 

 

Comments or suggestions?

 

I am still collecting data for update and other operations performance results with different database configuration.

 

With Regards,

Amit Kapila.

Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Jeff Janes
Date:
On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> This patch is based on below Todo Item:
>
> Consider adding buffers the background writer finds reusable to the free
> list
>
>
>
> I have tried implementing it and taken the readings for Select when all the
> data is in either OS buffers
>
> or Shared Buffers.
>
>
>
> The Patch has simple implementation for  "bgwriter or checkpoint process
> moving the unused buffers (unpinned with "ZERO" usage_count buffers) into
> "freelist".

I don't think InvalidateBuffer can be safely used in this way.  It
says "We assume
that no other backend could possibly be interested in using the page",
which is not true here.

Also, do we want to actually invalidate the buffers?  If someone does
happen to want one after it is put on the freelist, making it read it
in again into a different buffer doesn't seem like a nice thing to do,
rather than just letting it reclaim it.

Cheers,

Jeff



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>> This patch is based on below Todo Item:
>
>> Consider adding buffers the background writer finds reusable to the free
>> list
>
>
>
>> I have tried implementing it and taken the readings for Select when all the
>> data is in either OS buffers
>
>> or Shared Buffers.
>
>
>
>> The Patch has simple implementation for  "bgwriter or checkpoint process
>> moving the unused buffers (unpinned with "ZERO" usage_count buffers) into
>> "freelist".

> I don't think InvalidateBuffer can be safely used in this way.  It
> says "We assume
> that no other backend could possibly be interested in using the page",
> which is not true here.

As I understood and anlyzed based on above, that there is problem in attached patch such that in function
InvalidateBuffer(), after UnlockBufHdr() and before PartitionLock if some backend uses that buffer and increase the
usagecount to 1, still 
InvalidateBuffer() will remove the buffer from hash table and put it in Freelist.
I have modified the code to address above by checking refcount & usage_count  inside Partition Lock
, LockBufHdr and only after that move it to freelist which is similar to InvalidateBuffer.
In actual code we can optimize the current code by using extra parameter in InvalidateBuffer.

Please let me know if I understood you correctly or you want to say something else by above comment?

> Also, do we want to actually invalidate the buffers?  If someone does
> happen to want one after it is put on the freelist, making it read it
> in again into a different buffer doesn't seem like a nice thing to do,
> rather than just letting it reclaim it.

But even if bgwriter/checkpoint don't do, Backend needing new buffer will do similar things (remove from hash table)
forthis buffer as this is nextvictim buffer.  
The main intention of doing the MoveBufferToFreeList is to avoid contention of Partition Locks and BufFreeListLock
amongbackends, which  
has given Performance improvement in high contention scenarios.

One problem I could see with proposed change is that in some cases the usage count will get decrement for a buffer
allocated 
from free list immediately as it can be nextvictimbuffer.
However there can be solution to this problem.

Can you suggest some scenario's where I should do more performance test?

With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Tuesday, September 04, 2012 6:55 PM Amit kapila wrote:
On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>> This patch is based on below Todo Item:
>
>>> Consider adding buffers the background writer finds reusable to the free
>>> list
>
>
>
>>> I have tried implementing it and taken the readings for Select when all the
>>> data is in either OS buffers
>
>>> or Shared Buffers.
>
>
>
>>> The Patch has simple implementation for  "bgwriter or checkpoint process
>>> moving the unused buffers (unpinned with "ZERO" usage_count buffers) into
>>> "freelist".

>> I don't think InvalidateBuffer can be safely used in this way.  It
>> says "We assume
>> that no other backend could possibly be interested in using the page",
>> which is not true here.

> As I understood and anlyzed based on above, that there is problem in attached patch such that in function
> InvalidateBuffer(), after UnlockBufHdr() and before PartitionLock if some backend uses that buffer and
> increase the usage count to 1, still
> InvalidateBuffer() will remove the buffer from hash table and put it in Freelist.
> I have modified the code to address above by checking refcount & usage_count  inside Partition Lock
> , LockBufHdr and only after that move it to freelist which is similar to InvalidateBuffer.
> In actual code we can optimize the current code by using extra parameter in InvalidateBuffer.

> Please let me know if I understood you correctly or you want to say something else by above comment?

The results for the updated code is attached with this mail.
The scenario is same as in original mail.
    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all tables and indexes.
    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey" pages (using pg_prewarm with
'buffers'operation).  
    3. Run the pgbench with select only for 20 minutes.

Platform details:
    Operating System: Suse-Linux 10.2 x86_64
    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
    RAM : 24GB

Server Configuration:
    shared_buffers = 5GB     (1/4 th of RAM size)
    Total data size = 16GB
Pgbench configuration:
        transaction type: SELECT only
        scaling factor: 1200
        query mode: simple
        number of clients: <varying from 8 to 64 >
        number of threads: <varying from 8 to 64 >
        duration: 1200 s

I shall take further readings for following configurations and post the same:
1. The intention for taking with below configuration is that, with the defined testcase, there will be some cases where
I/Ocan happen. So I wanted to check the impact of it. 

Shared_buffers - 7 GB
number of clients: <varying from 8 to 64 >
 number of threads: <varying from 8 to 64 >
transaction type: SELECT only


2.The intention for taking with below configuration is that, with the defined testcase, memory kept for shared buffers
isless then the recommended. So I wanted to check the impact of it. 
Shared_buffers - 2 GB
number of clients: <varying from 8 to 64 >
number of threads: <varying from 8 to 64 >
transaction type: SELECT only


3. The intention for taking with below configuration is that, with the defined testcase, it will test mix of dml
operationswhere there will be I/O due to dml operations. So I wanted to check the impact of it. 
Shared_buffers - 5GB
number of clients: <varying from 8 to 64 >
number of threads: <varying from 8 to 64 >
transaction type: tpc_b

> One problem I could see with proposed change is that in some cases the usage count will get decrement for > a buffer
allocated
> from free list immediately as it can be nextvictimbuffer.
> However there can be solution to this problem.


With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Thursday, September 06, 2012 2:38 PM Amit kapila wrote:
On Tuesday, September 04, 2012 6:55 PM Amit kapila wrote:
On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>> This patch is based on below Todo Item:
>
>>> Consider adding buffers the background writer finds reusable to the free
>>> list

> The results for the updated code is attached with this mail.
> The scenario is same as in original mail.
>    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all tables and indexes.
>    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey" pages (using pg_prewarm with
'buffers'operation). 
>    3. Run the pgbench with select only for 20 minutes.

> Platform details:
>    Operating System: Suse-Linux 10.2 x86_64
>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>    RAM : 24GB

> Server Configuration:
>    shared_buffers = 5GB     (1/4 th of RAM size)
>    Total data size = 16GB
> Pgbench configuration:
>        transaction type: SELECT only
>        scaling factor: 1200
>        query mode: simple
>        number of clients: <varying from 8 to 64 >
>        number of threads: <varying from 8 to 64 >
>        duration: 1200 s

> I shall take further readings for following configurations and post the same:
> 1. The intention for taking with below configuration is that, with the defined testcase, there will be some cases
whereI/O can happen. So I wanted to check the  
> impact of it.

> Shared_buffers - 7 GB
> number of clients: <varying from 8 to 64 >
> number of threads: <varying from 8 to 64 >
> transaction type: SELECT only

The data for shared_buffers = 7GB is attached with this mail. I have also attached scripts used to take this data.

Note - I am using a utility pg_prewarm to warm up buffers which was developed by Robert Haas in last commit fest but it
wasnot Committed. I think some other utility can also be used to warm up the buffers if required.  

> One problem I could see with proposed change is that in some cases the usage count will get decrement for > a buffer
allocated
> from free list immediately as it can be nextvictimbuffer.
> However there can be solution to this problem.


With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Friday, September 07, 2012 6:44 PM Amit kapila wrote:
On Thursday, September 06, 2012 2:38 PM Amit kapila wrote:
On Tuesday, September 04, 2012 6:55 PM Amit kapila wrote:
On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>>> This patch is based on below Todo Item:
>>
>>>> Consider adding buffers the background writer finds reusable to the free
>>>> list

> The results for the updated code is attached with this mail.
> The scenario is same as in original mail.
>    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all tables and indexes.
>    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey" pages (using pg_prewarm with
'buffers'operation). 
>    3. Run the pgbench with select only for 20 minutes.

> Platform details:
>    Operating System: Suse-Linux 10.2 x86_64
>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>    RAM : 24GB

> Server Configuration:
>    shared_buffers = 5GB     (1/4 th of RAM size)
>    Total data size = 16GB
> Pgbench configuration:
>        transaction type: SELECT only
>        scaling factor: 1200
>        query mode: simple
>        number of clients: <varying from 8 to 64 >
>        number of threads: <varying from 8 to 64 >
>        duration: 1200 s

> I shall take further readings for following configurations and post the same:

> 2.The intention for taking with below configuration is that, with the defined testcase, memory kept for shared
buffersis less then the recommended. So I wanted  
> to check the impact of it.
> Shared_buffers - 2 GB
> number of clients: <varying from 8 to 64 >
> number of threads: <varying from 8 to 64 >
> transaction type: SELECT only

The results for this test are attached in Results_v2_sharedbuffers_2G.html

> 3. The intention for taking with below configuration is that, with the defined testcase, it will test mix of dml
operationswhere there will be I/O due to dml  
> operations. So I wanted to check the impact of it.
> Shared_buffers - 5GB
> number of clients: <varying from 8 to 64 >
> number of threads: <varying from 8 to 64 >
> transaction type: tpc_b

The results for this test are attached in Results_v2_sharedbuffers_5G_tcp_b.html

Conclusion for data collected till now:
1. When the shared buffers configuration is as per recommendation 25% of RAM, there is good performance improvement for
SELECToperation. 
     It further improves if there is high contention.
2. When the shared buffers configuration is less than recommendation 25% of RAM, there is no performance improvement or
slightdip for SELECT operation. 
     It get stablizes when there are more number of threads.
3. When the shared buffers configuration is as per recommendation 25% of RAM, there is negligible dip for tcp_b
benchmark.

If there is no objection about this performance improvement related to BufferManagement, I shall upload it for coming
CommitFest.
I know that I might need to do much more data collection for validating this patch, however if I get some feedback it
willmake much more sense. 

Suggestions/Opinions?

With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Jeff Janes
Date:
On Tue, Sep 4, 2012 at 6:25 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
> On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>> This patch is based on below Todo Item:
>>
>>> Consider adding buffers the background writer finds reusable to the free
>>> list
>>
>>
>>
>>> I have tried implementing it and taken the readings for Select when all the
>>> data is in either OS buffers
>>
>>> or Shared Buffers.
>>
>>
>>
>>> The Patch has simple implementation for  "bgwriter or checkpoint process
>>> moving the unused buffers (unpinned with "ZERO" usage_count buffers) into
>>> "freelist".
>
>> I don't think InvalidateBuffer can be safely used in this way.  It
>> says "We assume
>> that no other backend could possibly be interested in using the page",
>> which is not true here.
>
> As I understood and anlyzed based on above, that there is problem in attached patch such that in function
> InvalidateBuffer(), after UnlockBufHdr() and before PartitionLock if some backend uses that buffer and increase the
usagecount to 1, still
 
> InvalidateBuffer() will remove the buffer from hash table and put it in Freelist.
> I have modified the code to address above by checking refcount & usage_count  inside Partition Lock
> , LockBufHdr and only after that move it to freelist which is similar to InvalidateBuffer.
> In actual code we can optimize the current code by using extra parameter in InvalidateBuffer.
>
> Please let me know if I understood you correctly or you want to say something else by above comment?

Yes, I think that this is part of the risk I was hinting at.  I
haven't evaluated your fix to it.  But assuming it is now safe, I
still think it is a bad idea to invalidate a perfectly good buffer.
Now a process that wants that page will have to read it in again, even
though it is still sitting there.  This is particularly bad because
the background writer is coded to always circle the buffer pool every
2 minutes, whether that many clean buffers are needed or not.  I think
that that is a bad idea, but having it invalidate buffers as it goes
is even worse.

I think the code for the free-list linked list is written so that it
performs correctly for a valid buffer to be on the freelist, even
though that does not happen under current implementations.  If you
find that a buffer on the freelist has become pinned, used, or dirty
since it was added (which can only happen if it is still valid), you
just remove it and try again.

>
>> Also, do we want to actually invalidate the buffers?  If someone does
>> happen to want one after it is put on the freelist, making it read it
>> in again into a different buffer doesn't seem like a nice thing to do,
>> rather than just letting it reclaim it.
>
> But even if bgwriter/checkpoint don't do, Backend needing new buffer will do similar things (remove from hash table)
forthis buffer as this is nextvictim buffer.
 

Right, but only if it is the nextvictim, here we do it if it is
nextvictim+N, for some largish values of N.  (And due to the 2 minutes
rule, sometimes for very large values of N)

I'm not sure how to devise a test case to prove that this can be
important, though.

Robert wrote an accounting patch a while ago that tallied how often a
buffer was cleaned but then reclaimed for the same page before being
evicted.  But now I can't find it.  If you can find that thread, there
might be some benchmarks posted to it that would be useful.


Cheers,

Jeff



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Friday, October 19, 2012 9:15 PM Jeff Janes wrote:
On Tue, Sep 4, 2012 at 6:25 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
> On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>> This patch is based on below Todo Item:
>>
>>> Consider adding buffers the background writer finds reusable to the free
>>> list
>>
>>
>>
>>> I have tried implementing it and taken the readings for Select when all the
>>> data is in either OS buffers
>>
>>> or Shared Buffers.
>>
>>
>>
>> As I understood and anlyzed based on above, that there is problem in attached patch such that in function
>> InvalidateBuffer(), after UnlockBufHdr() and before PartitionLock if some backend uses that buffer and increase the
usagecount to 1, still 
>> InvalidateBuffer() will remove the buffer from hash table and put it in Freelist.
>> I have modified the code to address above by checking refcount & usage_count  inside Partition Lock
>> , LockBufHdr and only after that move it to freelist which is similar to InvalidateBuffer.
>> In actual code we can optimize the current code by using extra parameter in InvalidateBuffer.
>
>> Please let me know if I understood you correctly or you want to say something else by above comment?

> Yes, I think that this is part of the risk I was hinting at.  I
> haven't evaluated your fix to it.  But assuming it is now safe, I
> still think it is a bad idea to invalidate a perfectly good buffer.
> Now a process that wants that page will have to read it in again, even
> though it is still sitting there.  This is particularly bad because
> the background writer is coded to always circle the buffer pool every
> 2 minutes, whether that many clean buffers are needed or not.  I think
> that that is a bad idea, but having it invalidate buffers as it goes
> is even worse.

That is true, but is it not the case of low activity, and in general BGwriter takes into account how many buffers
allocedand clock swipe completed passes to make sure it cleans the buffers appropriately. 
One more doubt I have is whether this behavior (circle the buffer pool every 2 minutes) can't be controlled by
'bgwriter_lru_maxpages'as this number can dictate how much buffers to clean in each cycle. 

> I think the code for the free-list linked list is written so that it
> performs correctly for a valid buffer to be on the freelist, even
> though that does not happen under current implementations.

> If you
> find that a buffer on the freelist has become pinned, used, or dirty
> since it was added (which can only happen if it is still valid), you
> just remove it and try again.

Is it  actually possible in any usecase, that buffer mgmt algorithm can find any buffer on freelist which is pinned or
isdirty? 

>
>> Also, do we want to actually invalidate the buffers?  If someone does
>> happen to want one after it is put on the freelist, making it read it
>> in again into a different buffer doesn't seem like a nice thing to do,
>> rather than just letting it reclaim it.
>
> But even if bgwriter/checkpoint don't do, Backend needing new buffer will do similar things (remove from hash table)
forthis buffer as this is nextvictim buffer. 

> Right, but only if it is the nextvictim, here we do it if it is
> nextvictim+N, for some largish values of N.  (And due to the 2 minutes
> rule, sometimes for very large values of N)

Can't we control this 2 minutes rule using new or existing GUC, is there any harm in that as you pointed out earlier
alsoin mail chain that it is not good. 
Because such a parameter can make the flushing by BGwriter more valuable.

>I'm not sure how to devise a test case to prove that this can be important, though.

To start with, can't we do simple test where all (most) of the pages are in shared buffers and then run pg_bench select
onlytest? 
This test we can run with various configurations of shared buffers.

I have done the tests similar to above, and it shows good perf. improvement for shared buffers conf. as(25% of RAM).


> Robert wrote an accounting patch a while ago that tallied how often a
> buffer was cleaned but then reclaimed for the same page before being
> evicted.  But now I can't find it.  If you can find that thread, there
> might be some benchmarks posted to it that would be useful.

In my first level search, I am also not able to find it. But now I am planning to check all
mails of Robert Haas on PostgreSQL site (which are approximately 13,000).
If you can tell me how long ago approximately (last year, 2 yrs back, ..) or whether such a patch is submitted to any
CFor was just discussed in mail chain, then it will be little easier for me. 


Thank you for doing the initial review of work.

With Regards,
Amit Kapila.






Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Jeff Janes
Date:
On Fri, Sep 7, 2012 at 6:14 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Thursday, September 06, 2012 2:38 PM Amit kapila wrote:
> On Tuesday, September 04, 2012 6:55 PM Amit kapila wrote:
> On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
> On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>>> This patch is based on below Todo Item:
>>
>>>> Consider adding buffers the background writer finds reusable to the free
>>>> list
>
>> The results for the updated code is attached with this mail.
>> The scenario is same as in original mail.
>>    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all tables and indexes.
>>    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey" pages (using pg_prewarm with
'buffers'operation).
 
>>    3. Run the pgbench with select only for 20 minutes.
>
>> Platform details:
>>    Operating System: Suse-Linux 10.2 x86_64
>>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>>    RAM : 24GB
>
>> Server Configuration:
>>    shared_buffers = 5GB     (1/4 th of RAM size)
>>    Total data size = 16GB
>> Pgbench configuration:
>>        transaction type: SELECT only
>>        scaling factor: 1200
>>        query mode: simple
>>        number of clients: <varying from 8 to 64 >
>>        number of threads: <varying from 8 to 64 >
>>        duration: 1200 s
>
>> I shall take further readings for following configurations and post the same:
>> 1. The intention for taking with below configuration is that, with the defined testcase, there will be some cases
whereI/O can happen. So I wanted to check the
 
>> impact of it.
>
>> Shared_buffers - 7 GB
>> number of clients: <varying from 8 to 64 >
>> number of threads: <varying from 8 to 64 >
>> transaction type: SELECT only
>
> The data for shared_buffers = 7GB is attached with this mail. I have also attached scripts used to take this data.

Is this result reproducible?  Did you monitor IO (with something like
vmstat) to make sure there was no IO going on during the runs?  Run
the modes in reciprocating order?

If you have 7GB of shared_buffers and 16GB of database, that comes out
to 23GB of data to be held in 24GB of RAM.  In my experience it is
hard to get that much data cached by simple prewarm. the newer data
will drive out the older data even if technically there is room.  So
then when you start running the benchmark, you still have to read in
some of the data which dramatically slows down the benchmark.

I haven't been able to detect any reliable difference in performance
with this patch.  I've been testing with 150 scale factor with 4GB of
ram and 4 cores, over a variety of shared_buffers and concurrencies.

Cheers,

Jeff



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Jeff Janes
Date:
On Fri, Oct 19, 2012 at 11:00 PM, Amit kapila <amit.kapila@huawei.com> wrote:
>
>> Robert wrote an accounting patch a while ago that tallied how often a
>> buffer was cleaned but then reclaimed for the same page before being
>> evicted.  But now I can't find it.  If you can find that thread, there
>> might be some benchmarks posted to it that would be useful.
>
> In my first level search, I am also not able to find it. But now I am planning to check all
> mails of Robert Haas on PostgreSQL site (which are approximately 13,000).
> If you can tell me how long ago approximately (last year, 2 yrs back, ..) or whether such a patch is submitted
> to any CF or was just discussed in mail chain, then it will be little easier for me.

It was just an instrumentation patch for doing experiments, not
intended for commit.

I've tracked it down to the thread "Initial 9.2 pgbench write
results".  But I don't think it applies to the -S benchmark, because
it records when the background writer cleaned a buffer by finding it
dirty and writing it out to make it clean, while in this situation we
would need something more like "either made the buffer clean and
reusable, observed the buffer to already be clean and reusable"


Cheers,

Jeff



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
On Fri, Sep 7, 2012 at 6:14 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Thursday, September 06, 2012 2:38 PM Amit kapila wrote:
> On Tuesday, September 04, 2012 6:55 PM Amit kapila wrote:
> On Tuesday, September 04, 2012 12:42 AM Jeff Janes wrote:
> On Mon, Sep 3, 2012 at 7:15 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>>>> This patch is based on below Todo Item:
>>
>>>> Consider adding buffers the background writer finds reusable to the free
>>>> list
>
>>> The results for the updated code is attached with this mail.
>>> The scenario is same as in original mail.
>>>    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all tables and indexes.
>>>    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey" pages (using pg_prewarm
with'buffers' operation). 
>>>    3. Run the pgbench with select only for 20 minutes.
>
>>> Platform details:
>>>    Operating System: Suse-Linux 10.2 x86_64
>>>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>>>    RAM : 24GB
>
>>> Server Configuration:
>>>    shared_buffers = 5GB     (1/4 th of RAM size)
>>>    Total data size = 16GB
>>> Pgbench configuration:
>>>        transaction type: SELECT only
>>>        scaling factor: 1200
>>>        query mode: simple
>>>        number of clients: <varying from 8 to 64 >
>>>        number of threads: <varying from 8 to 64 >
>>>        duration: 1200 s
>
>>> I shall take further readings for following configurations and post the same:
>>> 1. The intention for taking with below configuration is that, with the defined testcase, there will be some cases
whereI/O can happen. So I wanted to check the 
>>> impact of it.
>
>>> Shared_buffers - 7 GB
>>> number of clients: <varying from 8 to 64 >
>>> number of threads: <varying from 8 to 64 >
>>> transaction type: SELECT only
>
>> The data for shared_buffers = 7GB is attached with this mail. I have also attached scripts used to take this data.

> Is this result reproducible?  Did you monitor IO (with something like
>vmstat) to make sure there was no IO going on during the runs?

Yes, I have reproduced it 2 times. However I shall reproduce once more and use vmstat as well.
I have not observed with vmstat but it is observable in the data.
When I have kept shared buffers = 5G, the tps is more and when I increased it to 7G, the tps is reduced which shows
thereis some I/O started happening. 
When I increased to 10G, the tps reduced drastically which shows there is lot of I/O. Tommorow I will post 10G shared
buffersdata as well. 

>Run the modes in reciprocating order?
Sorry, I didn't understood this, What do you mean by modes in reciprocating order?

> If you have 7GB of shared_buffers and 16GB of database, that comes out
> to 23GB of data to be held in 24GB of RAM.  In my experience it is
> hard to get that much data cached by simple prewarm. the newer data
> will drive out the older data even if technically there is room.  So
> then when you start running the benchmark, you still have to read in
> some of the data which dramatically slows down the benchmark.

Yes with 7G, the chances of doing I/O is high but with 5G, chances are less which is observed in the data as well(TPS
in7G data is less than in 5G). 
Please see the results of 5G shared buffers in mail below:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00318.php

In 7G case, you can see in the data that without this patch, the tps with original code is quite less as compare to 5G
data.
I am sorry, there is one typo error in 7G shared buffers data, it is mentioned wrongly 5G in heading of data.

>I haven't been able to detect any reliable difference in performance
>with this patch.  I've been testing with 150 scale factor with 4GB of
>ram and 4 cores, over a variety of shared_buffers and concurrencies.

I think the main reason for this is that when shared buffers are less, then there is no performance gain,
even the same is observed by me when I ran this test with shared buffers=2G, there is no performance gain.
Please see the results of shared buffers=2G in below mail:
http://archives.postgresql.org/pgsql-hackers/2012-09/msg00422.php

The reason I can think of is because when shared buffers are less then clock sweep runs very fast and there is no
bottleneck.
Only when shared buffers increase above some threshhold, it spends reasonable time in clock sweep.

I shall once run with the same configuration as mentioned by you, but I think it will not give any performance gain due
toreason mentioned above. 
Is it feasible for you to run with higher shared buffers and also somewhat large data and RAM.
Basically I want to know if you can mimic the situation mentioned by tests I have posted. In anycase I shall run the
testsonce again and post the data. 


With Regards,
Amit Kapila.




Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit Kapila
Date:
On Saturday, October 20, 2012 11:07 PM  Jeff Janes wrote:
> On Fri, Oct 19, 2012 at 11:00 PM, Amit kapila <amit.kapila@huawei.com>
> wrote:
> >
> >> Robert wrote an accounting patch a while ago that tallied how often a
> >> buffer was cleaned but then reclaimed for the same page before being
> >> evicted.  But now I can't find it.  If you can find that thread,
> there
> >> might be some benchmarks posted to it that would be useful.
> >
> > In my first level search, I am also not able to find it. But now I am
> planning to check all
> > mails of Robert Haas on PostgreSQL site (which are approximately
> 13,000).
> > If you can tell me how long ago approximately (last year, 2 yrs back,
> ..) or whether such a patch is submitted
> > to any CF or was just discussed in mail chain, then it will be little
> easier for me.
> 
> It was just an instrumentation patch for doing experiments, not
> intended for commit.
> 
> I've tracked it down to the thread "Initial 9.2 pgbench write
> results".  But I don't think it applies to the -S benchmark, because
> it records when the background writer cleaned a buffer by finding it
> dirty and writing it out to make it clean, while in this situation we
> would need something more like "either made the buffer clean and
> reusable, observed the buffer to already be clean and reusable"

Do you think an instrumentation patch which can give us how many times a
buffer is found by Clock Sweep and how many times it's found from freelist
will be useful?
I have written something on similar lines when I was testing this patch to
find out how many times this patch can avoid clock sweep.
My observation was that although the new implementation saves many cycles of
clock sweep, but still with shared buffers upto 2,2.5G there is no visible
performance gain.

With Regards,
Amit Kapila.




Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Sunday, October 21, 2012 1:29 PM Amit kapila wrote:
On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
On Fri, Sep 7, 2012 at 6:14 AM, Amit kapila <amit.kapila@huawei.com> wrote:

>>>> The results for the updated code is attached with this mail.
>>>> The scenario is same as in original mail.
>>>>    1. Load all the files in to OS buffers (using pg_prewarm with 'read' operation) of all tables and indexes.
>>>>    2. Try to load all buffers with "pgbench_accounts" table and "pgbench_accounts_pkey" pages (using pg_prewarm
with'buffers' operation). 
>>>>    3. Run the pgbench with select only for 20 minutes.
>
>>>> Platform details:
>>>>    Operating System: Suse-Linux 10.2 x86_64
>>>>    Hardware : 4 core (Intel(R) Xeon(R) CPU L5408 @ 2.13GHz)
>>>>    RAM : 24GB
>
>>>> Server Configuration:
>>>>    shared_buffers = 5GB     (1/4 th of RAM size)
>>>>    Total data size = 16GB
>>>> Pgbench configuration:
>>>>        transaction type: SELECT only
>>>>        scaling factor: 1200
>>>>        query mode: simple
>>>>        number of clients: <varying from 8 to 64 >
>>>>        number of threads: <varying from 8 to 64 >
>>>>        duration: 1200 s
>
>>>> I shall take further readings for following configurations and post the same:
>>>> 1. The intention for taking with below configuration is that, with the defined testcase, there will be some cases
whereI/O can happen. So I wanted to check the 
>>>> impact of it.
>
>>>> Shared_buffers - 7 GB
>>>> number of clients: <varying from 8 to 64 >
>>>> number of threads: <varying from 8 to 64 >
>>>> transaction type: SELECT only
>
>>> The data for shared_buffers = 7GB is attached with this mail. I have also attached scripts used to take this data.

>> Is this result reproducible?  Did you monitor IO (with something like
>>vmstat) to make sure there was no IO going on during the runs?

> Yes, I have reproduced it 2 times. However I shall reproduce once more and use vmstat as well.
> I have not observed with vmstat but it is observable in the data.
> When I have kept shared buffers = 5G, the tps is more and when I increased it to 7G, the tps is reduced which shows
thereis some I/O started happening. 
> When I increased to 10G, the tps reduced drastically which shows there is lot of I/O. Tommorow I will post 10G shared
buffersdata as well. 

Today again I have again collected the data for configuration Shared_buffers = 7G along with vmstat.
The data and vmstat information (bi) are attached with this mail. It is observed from vmstat info that I/O is happening
forboth cases, however after running for  
long time, the I/O is also comparatively less with new patch.

I have attached vmstat report for only one type of configuration, but I have data for others as well.
Please let me know if you want to have a look at that data as well.

With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Monday, October 22, 2012 11:21 PM Amit kapila wrote
On Sunday, October 21, 2012 1:29 PM Amit kapila wrote:
On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
On Fri, Sep 7, 2012 at 6:14 AM, Amit kapila <amit.kapila@huawei.com> wrote:

>>>>> The results for the updated code is attached with this mail.
>>>>> The scenario is same as in original mail.

>
>>>> The data for shared_buffers = 7GB is attached with this mail. I have also attached scripts used to take this data.

>>> Is this result reproducible?  Did you monitor IO (with something like
>>>vmstat) to make sure there was no IO going on during the runs?

>> Yes, I have reproduced it 2 times. However I shall reproduce once more and use vmstat as well.
>> I have not observed with vmstat but it is observable in the data.
>> When I have kept shared buffers = 5G, the tps is more and when I increased it to 7G, the tps is reduced which shows
thereis some I/O started happening. 
>> When I increased to 10G, the tps reduced drastically which shows there is lot of I/O. Tommorow I will post 10G
sharedbuffers data as well. 

> Today again I have again collected the data for configuration Shared_buffers = 7G along with vmstat.
> The data and vmstat information (bi) are attached with this mail. It is observed from vmstat info that I/O is
happeningfor both cases, however after running for 
> long time, the I/O is also comparatively less with new patch.

Please find the data for shared buffers = 5G and 10G attached with this mail.
Following is consolidated data for avg. of multiple runs:

-Patch-                                   -tps@-c8-   -tps@-c16-   -tps@-c32-   -tps@-c64- -tps@-c100-
head,-sb-5G                               59731         59185           56282            30068          12608
head+patch,-sb-5G                     59177         59957           57831            47986          21325

head,-sb-7G                                5866            6319            6604             5841
head+patch,-sb-7G                     15939           40501          38199           18025

head,-sb-10G                              2079             2824            3217             3206            2657
head+patch,-sb-10G                    2044             2706            3012             2967            2515

Script for collecting performance data are also attached with this mail:

# $1 = Initialize pgbench
# $2 = Scale Factor
# $3 = No Of Clients
# $4 = No Of pgbench Threads
# $5 = Execution time in seconds
# $6 = Shared Buffers
# $7 = number of sample runs
# $8 = Drop the tables

Eg: taking 16GB database & 5GB shared buffers.

./run_reading.sh 1 1200 8 8 1200 5GB 4 0
./run_reading.sh 0 1200 16 16 1200 5GB 4 0
./run_reading.sh 0 1200 32 32 1200 5GB 4 0
./run_reading.sh 0 1200 64 64 1200 5GB 4 0

Let me know your suggestions, how can we proceed to ensure whether it can be win or loss to have such a patch.


With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Jeff Janes
Date:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>>Run the modes in reciprocating order?
> Sorry, I didn't understood this, What do you mean by modes in reciprocating order?

Sorry for the long delay.  In your scripts, it looks like you always
run the unpatched first, and then the patched second.

By reciprocating, I mean to run them in the reverse order, or in random order.

Also, for the select only transactions, I think that 20 minutes is
much longer than necessary.  I'd rather see many more runs, each one
being shorter.

Because you can't restart the server without wiping out the
shared_buffers, what I would do is make a test patch which introduces
a new guc.c setting which allows the behavior to be turned on and off
with a SIGHUP (pg_ctl reload).


>
>>I haven't been able to detect any reliable difference in performance
>>with this patch.  I've been testing with 150 scale factor with 4GB of
>>ram and 4 cores, over a variety of shared_buffers and concurrencies.
>
> I think the main reason for this is that when shared buffers are less, then there is no performance gain,
> even the same is observed by me when I ran this test with shared buffers=2G, there is no performance gain.
> Please see the results of shared buffers=2G in below mail:
> http://archives.postgresql.org/pgsql-hackers/2012-09/msg00422.php

True, but I think that testing with shared_buffers=2G when RAM is 4GB
(and pgbench scale is also lower) should behave different than doing
so when RAM is 24 GB.

>
> The reason I can think of is because when shared buffers are less then clock sweep runs very fast and there is no
bottleneck.
> Only when shared buffers increase above some threshhold, it spends reasonable time in clock sweep.

I am rather skeptical of this.  When the work set doesn't fit in
memory under a select-only workload, then about half the buffers will
be evictable at any given time, and half will have usagecount=1, and a
handful will usagecount>=4 (index meta, root and branch blocks).  This
will be the case over a wide range of shared_buffers, as long as it is
big enough to hold all index branch blocks but not big enough to hold
everything.  Given this state of affairs, the average clock sweep
should be about 2, regardless of the exact size of shared_buffers.

The one wrinkle I could think of is if all the usagecount=1 buffers
are grouped into a continuous chunk, and all the usagecount=0 are in
another chunk.  The average would still be 2, but the average would be
made up of N/2 runs of length 1, followed by one run of length N/2.
Now if 1 process is stuck in the N/2 stretch and all other processes
are waiting on that, maybe that somehow escalates the waits so that
they are larger when N is larger, but I still don't see how the math
works on that.

Are you working on this just because it was on the ToDo List, or
because you have actually run into a problem with it?  I've never seen
freelist lock contention be a problem on machines with less than 8
CPU, but both of us are testing on smaller machines.  I think we
really need to test this on something bigger.

Cheers,

Jeff



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Jeff Janes
Date:
On Mon, Oct 22, 2012 at 10:51 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>

> Today again I have again collected the data for configuration Shared_buffers = 7G along with vmstat.
> The data and vmstat information (bi) are attached with this mail. It is observed from vmstat info that I/O is
happeningfor both cases, however after running for
 
> long time, the I/O is also comparatively less with new patch.

What I see in the vmstat report is that it takes 5.5 "runs" to get
really good and warmed up, and so it crawls for the first 5.5
benchmarks and then flies for the last 0.5 benchmark.  The way you
have your runs ordered, that last 0.5 of a benchmark is for the
patched version, and this drives up the average tps for the patched
case.

Also, there is no theoretical reason to think that your patch would
decrease the amount of IO needed (in fact, by invalidating buffers
early, it could be expected to increase the amount of IO).  So this
also argues that the increase in performance is caused by the decrease
in IO, but the patch isn't causing that decrease, it merely benefits
from it due to an accident of timing.

Cheers,

Jeff



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Monday, November 19, 2012 5:53 AM Jeff Janes wrote:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>>Run the modes in reciprocating order?
>> Sorry, I didn't understood this, What do you mean by modes in reciprocating order?

> Sorry for the long delay.  In your scripts, it looks like you always
> run the unpatched first, and then the patched second.

   Yes, thats true.

> By reciprocating, I mean to run them in the reverse order, or in random order.

Today for some configurations, I have ran by reciprocating the order.
Below are readings:
Configuration
16GB (Database) -7GB (Shared Buffers)

Here i had run in following order
        1. Run perf report with patch for 32 client
        2. Run perf report without patch for 32 client
        3. Run perf report with patch for 16 client
        4. Run perf report without patch for 16 client

Each execution is 5 minutes,
    16 client /16 thread    |   32 client /32 thread
   @mv-free-lst @9.3devl    |  @mv-free-lst @9.3devl
-------------------------------------------------------
      3669            4056            |   5356            5258
      3987            4121            |   4625            5185
      4840            4574            |   4502            6796
      6465            6932            |   4558            8233
      6966            7222            |   4955            8237
      7551            7219            |   9115            8269
      8315            7168            |   43171            8340
      9102            7136            |   57920            8349
-------------------------------------------------------
      6362            6054            |   16775            7333

increase 16c/16t: 5.09%
increase 32c/32t: 128.76%

Apart from above, I have kept the test for 1 hour. Here again the Order of execution is first run with Patch and then
original

 32 client /32 thread for 1 hour
                    @mv-free-lst    @9.3devl
Single-run:    9842.019229      8050.357981

Increase 32c/32t: 22%



> Also, for the select only transactions, I think that 20 minutes is
> much longer than necessary.  I'd rather see many more runs, each one
> being shorter.

Have taken care, don't know if 5 mins is appropriate or you meant it to be even shorter.

> Because you can't restart the server without wiping out the
> shared_buffers, what I would do is make a test patch which introduces
> a new guc.c setting which allows the behavior to be turned on and off
> with a SIGHUP (pg_ctl reload).

Okay, this is good idea.


>
>> The reason I can think of is because when shared buffers are less then clock sweep runs very fast and there is no
bottleneck.
>> Only when shared buffers increase above some threshhold, it spends reasonable time in clock sweep.

> I am rather skeptical of this.  When the work set doesn't fit in
> memory under a select-only workload, then about half the buffers will
> be evictable at any given time, and half will have usagecount=1, and a
> handful will usagecount>=4 (index meta, root and branch blocks).  This
> will be the case over a wide range of shared_buffers, as long as it is
> big enough to hold all index branch blocks but not big enough to hold
> everything.  Given this state of affairs, the average clock sweep
> should be about 2, regardless of the exact size of shared_buffers.

> The one wrinkle I could think of is if all the usagecount=1 buffers
> are grouped into a continuous chunk, and all the usagecount=0 are in
> another chunk.  The average would still be 2, but the average would be
> made up of N/2 runs of length 1, followed by one run of length N/2.
> Now if 1 process is stuck in the N/2 stretch and all other processes
> are waiting on that, maybe that somehow escalates the waits so that
> they are larger when N is larger, but I still don't see how the math
> works on that.

The 2 problems which are observed in V-Tune Profiler Reports for Buffer Management are:
a. Partition Lock
b. Buf-Free List Lock

Tommorow, I will send you some of the profiler reports for different scenario where the above is observed.

I think till there is contention for partition lock, reducing contention on Buf-Free list lock might not even show up.
The idea (Move the buffers to freelist) will improve situtation for both of the locks to an extent,
as after Invalidating Buffers by BGWriter, backend doesn't needs to remove from hash table and hence
no need of old partition lock.
Hash partition lock contention will be reduce by this, only if new and old partitions are different
which is quite probable as clock sweep have no care about partition when it tries to find usable buffer.


> Are you working on this just because it was on the ToDo List, or
> because you have actually run into a problem with it?

The reason behind this work is that late last year, I have done some benchmark of PostgreSQL 9.1 with some other
commercialdatabases and found that 
the performance of SELECT operation of PG is much below than others.

"One of key observation was that for PostgreSQL, on increasing Shared Buffers the performance increases upto certain
level,but after certain point 
this parameter doesn't increase performance, however the situation in other databases is better."

As part of that activity, I have done some design study for Buffer Management/Checkpoint and some others like MVCC for
variousdatabases.  
Some part of study for Buffer Management and Checkpoint are attached with this mail.

IMO, there are certain other things which we can attemt in Buffer Management:

1. Have the separate Hot and Cold lists of shared buffers, something similar to Clock-Pro.
2. Have some amount of shared buffers reserved for Hot tables
3. Instead of moving buffers to freelist by BGWriter/Checkpoint, move them to Cold list.
   Cold list concept is as follow:
   a. Have cold lists, equal to number of partitions.
   b. BGWriter/Checkpoint will move it to cold partition list number equal to hash partiion number.
      This will address the point that, even after moving to cold list, if the access to same page
      occurs before somebody uses it, no I/O will be required.
4. Reducing the free-list lock contention by having multiple freelists.
   This has been tried, but no performance improvement.
5. Reduce the granularity of free-list lock to get next buffer.
   Some time back Ants Aasma has sent the patch with which performance improvement is not observed.

Considering that points 5 & 6 have not given performance benefits, I think
reducing contention around only free-list lock will not yield any performance gain.


>I've never seen
> freelist lock contention be a problem on machines with less than 8
> CPU, but both of us are testing on smaller machines.  I think we
> really need to test this on something bigger.

Yeah, you are right. I shall try to do so.

With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Monday, November 19, 2012 6:05 AM Jeff Janes  wrote:
On Mon, Oct 22, 2012 at 10:51 AM, Amit kapila <amit.kapila@huawei.com> wrote:
>

>> Today again I have again collected the data for configuration Shared_buffers = 7G along with vmstat.
>> The data and vmstat information (bi) are attached with this mail. It is observed from vmstat info that I/O is
happeningfor both cases, however after running for 
>> long time, the I/O is also comparatively less with new patch.

>What I see in the vmstat report is that it takes 5.5 "runs" to get
>really good and warmed up, and so it crawls for the first 5.5
>benchmarks and then flies for the last 0.5 benchmark.  The way you
>have your runs ordered, that last 0.5 of a benchmark is for the
>patched version, and this drives up the average tps for the patched
>case.


> Also, there is no theoretical reason to think that your patch would
> decrease the amount of IO needed (in fact, by invalidating buffers
> early, it could be expected to increase the amount of IO).  So this
> also argues that the increase in performance is caused by the decrease
> in IO, but the patch isn't causing that decrease, it merely benefits
> from it due to an accident of timing.

Today, I have ran in the opposite order, still I see for some readings the similar observation.
I am also not sure of IO part, just based on data I was trying to interpret that way. However
may be for some particular scenario, due to OS buffer management it behaves that way.
As I am not aware of OS buffer management algorithm, so it's difficult to say that such a change would have any impact
onOS buffer management 
which can yield better performance.

With Regards,
Amit Kapila.

With Regards,
Amit Kapila.


Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Monday, November 19, 2012 8:52 PM Amit kapila wrote:
On Monday, November 19, 2012 5:53 AM Jeff Janes wrote:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>>Run the modes in reciprocating order?
>>> Sorry, I didn't understood this, What do you mean by modes in reciprocating order?


>> By reciprocating, I mean to run them in the reverse order, or in random order.

> Today for some configurations, I have ran by reciprocating the order.

The detailed data for all other various configuration is attached with this mail.

> The 2 problems which are observed in V-Tune Profiler Reports for Buffer Management are:
> a. Partition Lock
> b. Buf-Free List Lock

> Tommorow, I will send you some of the profiler reports for different scenario where the above is observed.

   Still not able to prepare reoprt. Shall try tommorow.


With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Tuesday, November 20, 2012 7:19 PM Amit kapila wrote:
On Monday, November 19, 2012 8:52 PM Amit kapila wrote:
On Monday, November 19, 2012 5:53 AM Jeff Janes wrote:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>> Tommorow, I will send you some of the profiler reports for different scenario where the above is observed.

>   Still not able to prepare reoprt. Shall try tommorow.

Please find the report attached with this mail.


With Regards,
Amit Kapila.
Attachment

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Pavan Deolasee
Date:



On Mon, Nov 19, 2012 at 8:52 PM, Amit kapila <amit.kapila@huawei.com> wrote:
On Monday, November 19, 2012 5:53 AM Jeff Janes wrote:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>>Run the modes in reciprocating order?
>> Sorry, I didn't understood this, What do you mean by modes in reciprocating order?

> Sorry for the long delay.  In your scripts, it looks like you always
> run the unpatched first, and then the patched second.

   Yes, thats true.

> By reciprocating, I mean to run them in the reverse order, or in random order.

Today for some configurations, I have ran by reciprocating the order.
Below are readings:
Configuration
16GB (Database) -7GB (Shared Buffers)

Here i had run in following order
        1. Run perf report with patch for 32 client
        2. Run perf report without patch for 32 client
        3. Run perf report with patch for 16 client
        4. Run perf report without patch for 16 client

Each execution is 5 minutes,
    16 client /16 thread    |   32 client /32 thread
   @mv-free-lst @9.3devl    |  @mv-free-lst @9.3devl
-------------------------------------------------------
      3669            4056            |   5356            5258
      3987            4121            |   4625            5185
      4840            4574            |   4502            6796
      6465            6932            |   4558            8233
      6966            7222            |   4955            8237
      7551            7219            |   9115            8269
      8315            7168            |   43171            8340
      9102            7136            |   57920            8349
-------------------------------------------------------
      6362            6054            |   16775            7333


Sorry, I haven't followed this thread at all, but the numbers (43171 and 57920) in the last two runs of @mv-free-list for 32 clients look aberrations, no ?  I wonder if that's skewing the average.

I also looked at the the Results.htm file down thread. There seem to be a steep degradation when the shared buffers are increased from 5GB to 10GB, both with and without the patch. Is that expected ? If so, isn't that worth investigating and possibly even fixing before we do anything else ?

Thanks,
Pavan

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit Kapila
Date:

 

From: Pavan Deolasee [mailto:pavan.deolasee@gmail.com]
Sent: Thursday, November 22, 2012 12:26 PM
To: Amit kapila
Cc: Jeff Janes; pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] [WIP PATCH] for Performance Improvement in Buffer Management

 

 

 

On Mon, Nov 19, 2012 at 8:52 PM, Amit kapila <amit.kapila@huawei.com> wrote:

On Monday, November 19, 2012 5:53 AM Jeff Janes wrote:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>>Run the modes in reciprocating order?
>> Sorry, I didn't understood this, What do you mean by modes in reciprocating order?

> Sorry for the long delay.  In your scripts, it looks like you always
> run the unpatched first, and then the patched second.

   Yes, thats true.


> By reciprocating, I mean to run them in the reverse order, or in random order.

Today for some configurations, I have ran by reciprocating the order.
Below are readings:
Configuration
16GB (Database) -7GB (Shared Buffers)

Here i had run in following order
        1. Run perf report with patch for 32 client
        2. Run perf report without patch for 32 client
        3. Run perf report with patch for 16 client
        4. Run perf report without patch for 16 client

Each execution is 5 minutes,
    16 client /16 thread    |   32 client /32 thread
   @mv-free-lst @9.3devl    |  @mv-free-lst @9.3devl
-------------------------------------------------------
      3669            4056            |   5356            5258
      3987            4121            |   4625            5185
      4840            4574            |   4502            6796
      6465            6932            |   4558            8233
      6966            7222            |   4955            8237
      7551            7219            |   9115            8269
      8315            7168            |   43171            8340
      9102            7136            |   57920            8349
-------------------------------------------------------
      6362            6054            |   16775            7333

 

>Sorry, I haven't followed this thread at all, but the numbers (43171 and 57920) in the last two runs of @mv-free-list for 32 clients look aberrations, no ?  I wonder if >that's skewing the average.

 

Yes, that is one of the main reasons, but in all runs this is consistent that for 32 clients or above this kind of numbers  are observed.

Even Jeff has pointed the similar thing in one of his mails and suggested to run the tests such that first test should run “with patch” and then “without patch”.

After doing what he suggested the observations are still similar.

 

 

> I also looked at the the Results.htm file down thread. There seem to be a steep degradation when the shared buffers are increased from 5GB to 10GB, both with and

> without the patch. Is that expected ? If so, isn't that worth investigating and possibly even fixing before we do anything else ?

 

The reason for decrease in performance is that when shared buffers are increased from 5GB to 10GB, the I/O starts as after increasing it cannot hold all

the data in OS buffers.

 

With Regards,

Amit Kapila

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Pavan Deolasee
Date:



On Thu, Nov 22, 2012 at 2:05 PM, Amit Kapila <amit.kapila@huawei.com> wrote:

 


 

>Sorry, I haven't followed this thread at all, but the numbers (43171 and 57920) in the last two runs of @mv-free-list for 32 clients look aberrations, no ?  I wonder if >that's skewing the average.

 

Yes, that is one of the main reasons, but in all runs this is consistent that for 32 clients or above this kind of numbers  are observed.

Even Jeff has pointed the similar thing in one of his mails and suggested to run the tests such that first test should run “with patch” and then “without patch”.

After doing what he suggested the observations are still similar.

 


Are we convinced that the jump that we are seeing is a real one then ? I'm a bit surprised because it happens only with the patch and only for 32 clients. How would you explain that ?

 

 

> I also looked at the the Results.htm file down thread. There seem to be a steep degradation when the shared buffers are increased from 5GB to 10GB, both with and

> without the patch. Is that expected ? If so, isn't that worth investigating and possibly even fixing before we do anything else ?

 

The reason for decrease in performance is that when shared buffers are increased from 5GB to 10GB, the I/O starts as after increasing it cannot hold all

the data in OS buffers.


Shouldn't that data be in the shared buffers if not the OS cache and hence approximately same IO will be required ? Again, the drop in the performance is so severe that it seems worth investigating that further, especially because you can reproduce it reliably.

Thanks,
Pavan 

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit kapila
Date:
On Friday, November 23, 2012 11:15 AM Pavan Deolasee wrote:
On Thu, Nov 22, 2012 at 2:05 PM, Amit Kapila <amit.kapila@huawei.com> wrote:



>>>Sorry, I haven't followed this thread at all, but the numbers (43171 and 57920) in the last two runs of
@mv-free-listfor 32 clients look aberrations, no ?  I wonder if 
>>>that's skewing the average.

>>Yes, that is one of the main reasons, but in all runs this is consistent that for 32 clients or above this kind of
numbers are observed. 
>>Even Jeff has pointed the similar thing in one of his mails and suggested to run the tests such that first test
shouldrun “with patch” and then “without patch”.  
>>After doing what he suggested the observations are still similar.


>Are we convinced that the jump that we are seeing is a real one then ?
 Still not convinced, as the data has been collected in only my setup.

>I'm a bit surprised because it happens only with the patch and only for 32 clients. How >would you explain that ?

The reason this patch can improve performance is due to reduce contention for BufFreeListLock and PartitionLock (which
ittakes in BufferAlloc a. to remove old page from buffer or b. to see if block is already in buffer pool) in backends.
Asthe number of backends increase the chances of improved performance is much better. In particular for 32 clients when
testsrun for longer time results are not that skewed. 

For 32 clients, as mentioned in previous mail when the test has ran for 1 hr, the differrence is not very skewed.
32 client /32 thread for 1 hour                    @mv-free-lst    @9.3devl
Single-run:    9842.019229      8050.357981

>>> I also looked at the the Results.htm file down thread. There seem to be a steep degradation when the shared buffers
areincreased from 5GB to 10GB, both with and  
>>> without the patch. Is that expected ? If so, isn't that worth investigating and possibly even fixing before we do
anythingelse ? 

>> The reason for decrease in performance is that when shared buffers are increased from 5GB to 10GB, the I/O starts as
afterincreasing it cannot hold all 
>> the data in OS buffers.


>Shouldn't that data be in the shared buffers if not the OS cache and hence approximately same IO will be required?

I don't think so as the data in OS cache or PG Shared buffers doesn't have any direct relation, OS can flush its
buffersbased on its scheduler algorithm. 

Let us try to see by example:
Total RAM - 22G
Database size - 16G

Case -1 (Shared Buffers - 5G)
a. Load all the files in OS buffers. Chances are good that all 16G data will be there in OS buffers as OS has still 17G
ofmemory available. 
b. Try to load all in Shared buffers. Last 5G will be there in shared buffers.
c. Chances are high that remaining 11G buffers access will not lead to IO as they will be in OS buffers.

Case -2 (Shared Buffers - 10G)
a. Load all the files in OS buffers. In best case OS buffers can contain10-12G data as OS has 12G of memory available.
b. Try to load all in Shared buffers. Last 10G will be there in shared buffers.
c. Now as there is no direct correlation of data between Shared Buffers and OS buffers, so whenever PG has to access
anydata  which is not there in Shared Buffers, good chances are there that it can lead to IO. 


> Again, the drop in the performance is so severe that it seems worth investigating that further, especially because
youcan reproduce it reliably. 
  Yes, I agree that it is worth investigating, but IMO this is a different problem which might not be addressed with
thePatch in discussion.    The 2 reasons I can think for dip in performance when Shared Buffers increase beyond certain
threshholdpercentage of RAM are,   a. either the algorithm of Buffer Management has some bottleneck  b. due to the way
datais managed in Shared Buffers and OS buffer cache 

Any Suggestion/Comments?

With Regards,
Amit Kapila.


Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit Kapila
Date:
<div class="WordSection1"><p class="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>>Shouldn'tthat data be in the shared buffers if not the OS cache and hence approximately same
IOwill be required?</span><p class="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black"> </span><pclass="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>Idon't think so as the data in OS cache or PG Shared buffers doesn't have any direct relation,
>OScan flush its buffers based on its scheduler algorithm.</span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black"> </span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black">>Let us try to see by example:</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>Total RAM -
22G</span><pclass="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>Databasesize - 16G</span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black"> </span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black">>Case -1 (Shared Buffers - 5G)</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>a. Load all the
filesin OS buffers. Chances are good that all 16G data will be there in OS >buffers as OS has still 17G of memory
available.</span><pclass="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>b.Try to load all in Shared buffers. Last 5G will be there in shared buffers.</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>c. Chances are high
thatremaining 11G buffers access will not lead to IO as they will be in OS >buffers.</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black"> </span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>Case -2 (Shared
Buffers- 10G)</span><p class="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>a.Load all the files in OS buffers. In best case OS buffers can contain10-12G data as OS has
>12Gof memory available.</span><p class="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>b.Try to load all in Shared buffers. Last 10G will be there in shared buffers.</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>c. Now as there is
nodirect correlation of data between Shared Buffers and OS buffers, so >whenever PG has to access any data</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>   which is not
therein Shared Buffers, good chances are there that it can lead to IO.</span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black"> </span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black"> </span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black">>> Again, the drop in the performance is so severe
thatit seems worth investigating that further, especially because you can reproduce it reliably.</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black"> </span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">>   Yes, I agree
thatit is worth investigating, but IMO this is a different problem which might >not be addressed with the Patch in
discussion.</span><p class="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>   The 2 reasons I can think for dip in performance when Shared Buffers increase beyond certain
>threshhold percentage of RAM are, </span><p class="MsoNormal"><span lang="EN"
style="font-size:13.0pt;font-family:"CourierNew";color:black">> a. either the algorithm of Buffer Management has
somebottleneck</span><p class="MsoNormal"><span lang="EN" style="font-size:13.0pt;font-family:"Courier
New";color:black">>  b. due to the way data is managed in Shared Buffers and OS buffer cache</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black"> </span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">The point I want to
tellis explained at below link as well.</span><p class="MsoNormal"><span
style="font-size:10.0pt;font-family:"Arial","sans-serif""><a
href="http://blog.kimiensoftware.com/2011/05/postgresql-vs-oracle-differences-4-shared-memory-usage-257">http://blog.kimiensoftware.com/2011/05/postgresql-vs-oracle-differences-4-shared-memory-usage-257</a></span><p
class="MsoNormal"><spanstyle="font-size:10.0pt;font-family:"Arial","sans-serif""> </span><p class="MsoNormal"><span
lang="EN"style="font-size:13.0pt;font-family:"Courier New";color:black">So if above is true, I think the performance
willregain if in the test Shared Buffers are set to 16G. I shall once try that setting for test run.</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black"> </span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">With Regards,</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black">Amit Kapila.</span><p
class="MsoNormal"><spanlang="EN" style="font-size:13.0pt;font-family:"Courier New";color:black"> </span></div> 

Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Greg Smith
Date:
On 11/23/12 5:57 AM, Amit kapila wrote:
> Let us try to see by example:
> Total RAM - 22G
> Database size - 16G
>...
> Case -2 (Shared Buffers - 10G)
> a. Load all the files in OS buffers. In best case OS buffers can contain10-12G data as OS has 12G of memory
available.
> b. Try to load all in Shared buffers. Last 10G will be there in shared buffers.
> c. Now as there is no direct correlation of data between Shared Buffers and OS buffers, so whenever PG has to access
anydata
 
>     which is not there in Shared Buffers, good chances are there that it can lead to IO.

I don't think either of these examples are very representative of 
real-world behavior.  The idea of "load all the files in OS buffers" 
assumes someone has used a utility like pg_prewarm or pgfincore.  It's 
not something that happens in normal use.  Being able to re-populate all 
of RAM using those utilities isn't realistic anyway.  Anyone who tries 
to load more than (memory - shared_buffers) that way is likely to be 
disappointed by the result.

Similarly, the problem you're describing here has been described as the 
"double buffering" one for a while now.  The old suggestion that 
shared_buffers not be set about 25% of RAM comes from this sort of 
concern.  If triggering a problem requires doing that, essentially 
misconfiguring the server, it's hard to get too excited about it.

Anyway, none of that impacts on me mixing testing for this into what I'm 
working on.  The way most pgbench tests happen, it's hard to *not* have 
the important data in cache.  Once you run the init step, you have to 
either reboot or drop the OS cache to get those pages out of RAM.  That 
means the sort of cached setup you're using pg_prewarm to 
simulate--things are in the OS cache, but not the PostgreSQL one--is one 
that anyone running an init/test pair will often create.  You don't need 
pg_prewarm to do it.  If you initialize the database, then restart the 
server to clear shared_buffers, the result will be similar to what 
you're doing.

-- 
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support www.2ndQuadrant.com



Re: [WIP PATCH] for Performance Improvement in Buffer Management

From
Amit Kapila
Date:
On Wednesday, December 12, 2012 5:23 AM Greg Smith wrote:
> On 11/23/12 5:57 AM, Amit kapila wrote:
> > Let us try to see by example:
> > Total RAM - 22G
> > Database size - 16G
> >...
> > Case -2 (Shared Buffers - 10G)
> > a. Load all the files in OS buffers. In best case OS buffers can
> contain10-12G data as OS has 12G of memory available.
> > b. Try to load all in Shared buffers. Last 10G will be there in shared
> buffers.
> > c. Now as there is no direct correlation of data between Shared
> Buffers and OS buffers, so whenever PG has to access any data
> >     which is not there in Shared Buffers, good chances are there that
> it can lead to IO.
> 
> I don't think either of these examples are very representative of
> real-world behavior.  The idea of "load all the files in OS buffers"
> assumes someone has used a utility like pg_prewarm or pgfincore.  It's
> not something that happens in normal use.  Being able to re-populate all
> of RAM using those utilities isn't realistic anyway.  Anyone who tries
> to load more than (memory - shared_buffers) that way is likely to be
> disappointed by the result.

True, I also think nobody will directly try to do it in this way, but such
similar situations can arise after long run.
Something like if we assume most used pages fall under the range of RAM.

> Similarly, the problem you're describing here has been described as the
> "double buffering" one for a while now.  The old suggestion that
> shared_buffers not be set about 25% of RAM comes from this sort of
> concern.  If triggering a problem requires doing that, essentially
> misconfiguring the server, it's hard to get too excited about it.
> 
> Anyway, none of that impacts on me mixing testing for this into what I'm
> working on.  The way most pgbench tests happen, it's hard to *not* have
> the important data in cache.  Once you run the init step, you have to
> either reboot or drop the OS cache to get those pages out of RAM.  That
> means the sort of cached setup you're using pg_prewarm to
> simulate--things are in the OS cache, but not the PostgreSQL one--is one
> that anyone running an init/test pair will often create.  You don't need
> pg_prewarm to do it.  

The way, I have ran the tests is to just try to simulate scenario's where
invalidating buffers 
by bgwriter/checkpoint can have advantage.

With Regards,
Amit Kapila.