Re: [WIP PATCH] for Performance Improvement in Buffer Management - Mailing list pgsql-hackers

From Amit kapila
Subject Re: [WIP PATCH] for Performance Improvement in Buffer Management
Date
Msg-id 6C0B27F7206C9E4CA54AE035729E9C383BC76A4C@szxeml509-mbx
Whole thread Raw
In response to Re: [WIP PATCH] for Performance Improvement in Buffer Management  (Jeff Janes <jeff.janes@gmail.com>)
Responses Re: [WIP PATCH] for Performance Improvement in Buffer Management
Re: [WIP PATCH] for Performance Improvement in Buffer Management
List pgsql-hackers
On Monday, November 19, 2012 5:53 AM Jeff Janes wrote:
On Sun, Oct 21, 2012 at 12:59 AM, Amit kapila <amit.kapila@huawei.com> wrote:
> On Saturday, October 20, 2012 11:03 PM Jeff Janes wrote:
>
>>Run the modes in reciprocating order?
>> Sorry, I didn't understood this, What do you mean by modes in reciprocating order?

> Sorry for the long delay.  In your scripts, it looks like you always
> run the unpatched first, and then the patched second.

   Yes, thats true.

> By reciprocating, I mean to run them in the reverse order, or in random order.

Today for some configurations, I have ran by reciprocating the order.
Below are readings:
Configuration
16GB (Database) -7GB (Shared Buffers)

Here i had run in following order
        1. Run perf report with patch for 32 client
        2. Run perf report without patch for 32 client
        3. Run perf report with patch for 16 client
        4. Run perf report without patch for 16 client

Each execution is 5 minutes,
    16 client /16 thread    |   32 client /32 thread
   @mv-free-lst @9.3devl    |  @mv-free-lst @9.3devl
-------------------------------------------------------
      3669            4056            |   5356            5258
      3987            4121            |   4625            5185
      4840            4574            |   4502            6796
      6465            6932            |   4558            8233
      6966            7222            |   4955            8237
      7551            7219            |   9115            8269
      8315            7168            |   43171            8340
      9102            7136            |   57920            8349
-------------------------------------------------------
      6362            6054            |   16775            7333

increase 16c/16t: 5.09%
increase 32c/32t: 128.76%

Apart from above, I have kept the test for 1 hour. Here again the Order of execution is first run with Patch and then
original

 32 client /32 thread for 1 hour
                    @mv-free-lst    @9.3devl
Single-run:    9842.019229      8050.357981

Increase 32c/32t: 22%



> Also, for the select only transactions, I think that 20 minutes is
> much longer than necessary.  I'd rather see many more runs, each one
> being shorter.

Have taken care, don't know if 5 mins is appropriate or you meant it to be even shorter.

> Because you can't restart the server without wiping out the
> shared_buffers, what I would do is make a test patch which introduces
> a new guc.c setting which allows the behavior to be turned on and off
> with a SIGHUP (pg_ctl reload).

Okay, this is good idea.


>
>> The reason I can think of is because when shared buffers are less then clock sweep runs very fast and there is no
bottleneck.
>> Only when shared buffers increase above some threshhold, it spends reasonable time in clock sweep.

> I am rather skeptical of this.  When the work set doesn't fit in
> memory under a select-only workload, then about half the buffers will
> be evictable at any given time, and half will have usagecount=1, and a
> handful will usagecount>=4 (index meta, root and branch blocks).  This
> will be the case over a wide range of shared_buffers, as long as it is
> big enough to hold all index branch blocks but not big enough to hold
> everything.  Given this state of affairs, the average clock sweep
> should be about 2, regardless of the exact size of shared_buffers.

> The one wrinkle I could think of is if all the usagecount=1 buffers
> are grouped into a continuous chunk, and all the usagecount=0 are in
> another chunk.  The average would still be 2, but the average would be
> made up of N/2 runs of length 1, followed by one run of length N/2.
> Now if 1 process is stuck in the N/2 stretch and all other processes
> are waiting on that, maybe that somehow escalates the waits so that
> they are larger when N is larger, but I still don't see how the math
> works on that.

The 2 problems which are observed in V-Tune Profiler Reports for Buffer Management are:
a. Partition Lock
b. Buf-Free List Lock

Tommorow, I will send you some of the profiler reports for different scenario where the above is observed.

I think till there is contention for partition lock, reducing contention on Buf-Free list lock might not even show up.
The idea (Move the buffers to freelist) will improve situtation for both of the locks to an extent,
as after Invalidating Buffers by BGWriter, backend doesn't needs to remove from hash table and hence
no need of old partition lock.
Hash partition lock contention will be reduce by this, only if new and old partitions are different
which is quite probable as clock sweep have no care about partition when it tries to find usable buffer.


> Are you working on this just because it was on the ToDo List, or
> because you have actually run into a problem with it?

The reason behind this work is that late last year, I have done some benchmark of PostgreSQL 9.1 with some other
commercialdatabases and found that 
the performance of SELECT operation of PG is much below than others.

"One of key observation was that for PostgreSQL, on increasing Shared Buffers the performance increases upto certain
level,but after certain point 
this parameter doesn't increase performance, however the situation in other databases is better."

As part of that activity, I have done some design study for Buffer Management/Checkpoint and some others like MVCC for
variousdatabases.  
Some part of study for Buffer Management and Checkpoint are attached with this mail.

IMO, there are certain other things which we can attemt in Buffer Management:

1. Have the separate Hot and Cold lists of shared buffers, something similar to Clock-Pro.
2. Have some amount of shared buffers reserved for Hot tables
3. Instead of moving buffers to freelist by BGWriter/Checkpoint, move them to Cold list.
   Cold list concept is as follow:
   a. Have cold lists, equal to number of partitions.
   b. BGWriter/Checkpoint will move it to cold partition list number equal to hash partiion number.
      This will address the point that, even after moving to cold list, if the access to same page
      occurs before somebody uses it, no I/O will be required.
4. Reducing the free-list lock contention by having multiple freelists.
   This has been tried, but no performance improvement.
5. Reduce the granularity of free-list lock to get next buffer.
   Some time back Ants Aasma has sent the patch with which performance improvement is not observed.

Considering that points 5 & 6 have not given performance benefits, I think
reducing contention around only free-list lock will not yield any performance gain.


>I've never seen
> freelist lock contention be a problem on machines with less than 8
> CPU, but both of us are testing on smaller machines.  I think we
> really need to test this on something bigger.

Yeah, you are right. I shall try to do so.

With Regards,
Amit Kapila.
Attachment

pgsql-hackers by date:

Previous
From: "Karl O. Pinc"
Date:
Subject: Re: gset updated patch
Next
From: Kohei KaiGai
Date:
Subject: Re: ALTER command reworks