Thread: Performance under contention

Performance under contention

From
Ivan Voras
Date:
This is not a request for help but a report, in case it helps developers
or someone in the future. The setup is:

AMD64 machine, 24 GB RAM, 2x6-core Xeon CPU + HTT (24 logical CPUs)
FreeBSD 8.1-stable, AMD64
PostgreSQL 9.0.1, 10 GB shared buffers, using pgbench with a scale
factor of 500 (7.5 GB database)

with pgbench -S (SELECT-queries only) the performance curve is:

-c#    result
4    33549
8    64864
12    79491
16    79887
20    66957
24    52576
28    50406
32    49491
40    45535
50    39499
75    29415

After 16 clients (which is still good since there are only 12 "real"
cores in the system), the performance drops sharply, and looking at the
processes' state, most of them seem to eat away system call (i.e.
executing in the kernel) in states "semwait" and "sbwait", i.e.
semaphore wait and socket buffer wait, for example:

  3047 pgsql       1  60    0 10533M   283M sbwait 12   0:01  6.79% postgres
  3055 pgsql       1  64    0 10533M   279M sbwait 15   0:01  6.79% postgres
  3033 pgsql       1  64    0 10533M   279M semwai  6   0:01  6.69% postgres
  3038 pgsql       1  64    0 10533M   283M CPU5   13   0:01  6.69% postgres
  3037 pgsql       1  62    0 10533M   279M sbwait 23   0:01  6.69% postgres
  3048 pgsql       1  65    0 10533M   280M semwai  4   0:01  6.69% postgres
  3056 pgsql       1  65    0 10533M   277M semwai  1   0:01  6.69% postgres
  3002 pgsql       1  62    0 10533M   284M CPU19   0   0:01  6.59% postgres
  3042 pgsql       1  63    0 10533M   279M semwai 21   0:01  6.59% postgres
  3029 pgsql       1  63    0 10533M   277M semwai 23   0:01  6.59% postgres
  3046 pgsql       1  63    0 10533M   278M RUN     5   0:01  6.59% postgres
  3036 pgsql       1  63    0 10533M   278M CPU1   12   0:01  6.59% postgres
  3051 pgsql       1  63    0 10533M   277M semwai  1   0:01  6.59% postgres
  3030 pgsql       1  63    0 10533M   281M semwai  1   0:01  6.49% postgres
  3050 pgsql       1  60    0 10533M   276M semwai  1   0:01  6.49% postgres

The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on
semwait indicates large contention in PostgreSQL.

Re: Performance under contention

From
"Kevin Grittner"
Date:
Ivan Voras  wrote:

> After 16 clients (which is still good since there are only 12
> "real" cores in the system), the performance drops sharply

Yet another data point to confirm the importance of connection
pooling.  :-)

-Kevin

Re: Performance under contention

From
Ivan Voras
Date:
On 11/22/10 02:47, Kevin Grittner wrote:
> Ivan Voras  wrote:
>
>> After 16 clients (which is still good since there are only 12
>> "real" cores in the system), the performance drops sharply
>
> Yet another data point to confirm the importance of connection
> pooling.  :-)

I agree, connection pooling will get rid of the symptom. But not the
underlying problem. I'm not saying that having 1000s of connections to
the database is a particularly good design, only that there shouldn't be
a sharp decline in performance when it does happen. Ideally, the
performance should remain the same as it was at its peek.

I've been monitoring the server some more and it looks like there are
periods where almost all servers are in the semwait state followed by
periods of intensive work - approximately similar to the "thundering
herd" problem, or maybe to what Josh Berkus has posted a few days ago.


Re: Performance under contention

From
Jignesh Shah
Date:
On Sun, Nov 21, 2010 at 9:18 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 11/22/10 02:47, Kevin Grittner wrote:
>>
>> Ivan Voras  wrote:
>>
>>> After 16 clients (which is still good since there are only 12
>>> "real" cores in the system), the performance drops sharply
>>
>> Yet another data point to confirm the importance of connection
>> pooling.  :-)
>
> I agree, connection pooling will get rid of the symptom. But not the
> underlying problem. I'm not saying that having 1000s of connections to the
> database is a particularly good design, only that there shouldn't be a sharp
> decline in performance when it does happen. Ideally, the performance should
> remain the same as it was at its peek.
>
> I've been monitoring the server some more and it looks like there are
> periods where almost all servers are in the semwait state followed by
> periods of intensive work - approximately similar to the "thundering herd"
> problem, or maybe to what Josh Berkus has posted a few days ago.
>
>
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Try it with systemtap or dtrace and see if you find the same
bottlenecks as I do in
http://jkshah.blogspot.com/2010/11/postgresql-90-simple-select-scaling.html

I will probably retry it with pgBench and see what  I find ..

Regards,
Jignesh

Re: Performance under contention

From
Omar Kilani
Date:
Hi Ivan,

We have the same issue on our database machines (which are 2x6
Intel(R) Xeon(R) CPU X5670 @ 2.93GHz with 24 logical cores and 144Gb
of RAM) -- they run RHEL 5. The issue occurs with our normal OLTP
workload, so it's not just pgbench.

We use pgbouncer to limit total connections to 15 (this seemed to be
the 'sweet spot' for us) -- there's definitely a bunch of contention
on ... something... for a workload where you're running a lot of very
fast SELECTs (around 2000-4000/s) from more than 15-16 clients.

I had a chat with Neil C or Gavin S about this at some point, but I
forget the reason for it. I don't think there's anything you can do
for it configuration-wise except use a connection pool.

Regards,
Omar

On Mon, Nov 22, 2010 at 5:54 PM, Jignesh Shah <jkshah@gmail.com> wrote:
> On Sun, Nov 21, 2010 at 9:18 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> On 11/22/10 02:47, Kevin Grittner wrote:
>>>
>>> Ivan Voras  wrote:
>>>
>>>> After 16 clients (which is still good since there are only 12
>>>> "real" cores in the system), the performance drops sharply
>>>
>>> Yet another data point to confirm the importance of connection
>>> pooling.  :-)
>>
>> I agree, connection pooling will get rid of the symptom. But not the
>> underlying problem. I'm not saying that having 1000s of connections to the
>> database is a particularly good design, only that there shouldn't be a sharp
>> decline in performance when it does happen. Ideally, the performance should
>> remain the same as it was at its peek.
>>
>> I've been monitoring the server some more and it looks like there are
>> periods where almost all servers are in the semwait state followed by
>> periods of intensive work - approximately similar to the "thundering herd"
>> problem, or maybe to what Josh Berkus has posted a few days ago.
>>
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
> Try it with systemtap or dtrace and see if you find the same
> bottlenecks as I do in
> http://jkshah.blogspot.com/2010/11/postgresql-90-simple-select-scaling.html
>
> I will probably retry it with pgBench and see what  I find ..
>
> Regards,
> Jignesh
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Performance under contention

From
"Kevin Grittner"
Date:
Ivan Voras <ivoras@freebsd.org> wrote:
> On 11/22/10 02:47, Kevin Grittner wrote:
>> Ivan Voras  wrote:
>>
>>> After 16 clients (which is still good since there are only 12
>>> "real" cores in the system), the performance drops sharply
>>
>> Yet another data point to confirm the importance of connection
>> pooling.  :-)
>
> I agree, connection pooling will get rid of the symptom. But not
> the underlying problem. I'm not saying that having 1000s of
> connections to the database is a particularly good design, only
> that there shouldn't be a sharp decline in performance when it
> does happen. Ideally, the performance should remain the same as it
> was at its peek.

Well, I suggested that we add an admission control[1] mechanism,
with at least part of the initial default policy being that there is
a limit on the number of active database transactions.  Such a
policy would do what you are suggesting, but the idea was shot down
on the basis that in most of the cases where this would help, people
would be better served by using an external connection pool.

If interested, search the archives for details of the discussion.

-Kevin

[1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf
Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007.
Architecture of a Database System. Foundations and Trends(R) in
Databases Vol. 1, No. 2 (2007) 141*259
(see Section 2.4 - Admission Control)

Re: Performance under contention

From
Ivan Voras
Date:
On 11/22/10 16:26, Kevin Grittner wrote:
> Ivan Voras<ivoras@freebsd.org>  wrote:
>> On 11/22/10 02:47, Kevin Grittner wrote:
>>> Ivan Voras  wrote:
>>>
>>>> After 16 clients (which is still good since there are only 12
>>>> "real" cores in the system), the performance drops sharply
>>>
>>> Yet another data point to confirm the importance of connection
>>> pooling.  :-)
>>
>> I agree, connection pooling will get rid of the symptom. But not
>> the underlying problem. I'm not saying that having 1000s of
>> connections to the database is a particularly good design, only
>> that there shouldn't be a sharp decline in performance when it
>> does happen. Ideally, the performance should remain the same as it
>> was at its peek.
>
> Well, I suggested that we add an admission control[1] mechanism,

It looks like a hack (and one which is already implemented by connection
pool software); the underlying problem should be addressed.

But on the other hand if it's affecting so many people, maybe a warning
comment in postgresql.conf around max_connections would be helpful.

Re: Performance under contention

From
"Kevin Grittner"
Date:
Ivan Voras <ivoras@freebsd.org> wrote:

> It looks like a hack

Not to everyone.  In the referenced section, Hellerstein,
Stonebraker and Hamilton say:

"any good multi-user system has an admission control policy"

In the case of PostgreSQL I understand the counter-argument,
although I'm inclined to think that it's prudent for a product to
limit resource usage to a level at which it can still function well,
even if there's an external solution which can also work, should
people use it correctly.  It seems likely that a mature admission
control policy could do a better job of managing some resources than
an external product could.

-Kevin

Re: Performance under contention

From
Craig Ringer
Date:
On 11/22/2010 11:38 PM, Ivan Voras wrote:
> On 11/22/10 16:26, Kevin Grittner wrote:
>> Ivan Voras<ivoras@freebsd.org> wrote:
>>> On 11/22/10 02:47, Kevin Grittner wrote:
>>>> Ivan Voras wrote:
>>>>
>>>>> After 16 clients (which is still good since there are only 12
>>>>> "real" cores in the system), the performance drops sharply
>>>>
>>>> Yet another data point to confirm the importance of connection
>>>> pooling. :-)
>>>
>>> I agree, connection pooling will get rid of the symptom. But not
>>> the underlying problem. I'm not saying that having 1000s of
>>> connections to the database is a particularly good design, only
>>> that there shouldn't be a sharp decline in performance when it
>>> does happen. Ideally, the performance should remain the same as it
>>> was at its peek.
>>
>> Well, I suggested that we add an admission control[1] mechanism,
>
> It looks like a hack (and one which is already implemented by connection
> pool software); the underlying problem should be addressed.

My (poor) understanding is that addressing the underlying problem would
require a massive restructure of postgresql to separate "connection and
session state" from "executor and backend". Idle connections wouldn't
require a backend to sit around unused but participating in all-backends
synchronization and signalling. Active connections over a configured
maximum concurrency limit would queue for access to a backend rather
than fighting it out for resources at the OS level.

The trouble is that this would be an *enormous* rewrite of the codebase,
and would still only solve part of the problem. See the prior discussion
on in-server connection pooling and admission control.

Personally I think the current approach is clearly difficult for many
admins to understand and it's unfortunate that it requires external
software to be effective. OTOH, I'm not sure what the answer is.

--
Craig Ringer


Re: Performance under contention

From
Ivan Voras
Date:
On 24 November 2010 01:11, Craig Ringer <craig@postnewspapers.com.au> wrote:
> On 11/22/2010 11:38 PM, Ivan Voras wrote:

>> It looks like a hack (and one which is already implemented by connection
>> pool software); the underlying problem should be addressed.
>
> My (poor) understanding is that addressing the underlying problem would
> require a massive restructure of postgresql to separate "connection and
> session state" from "executor and backend". Idle connections wouldn't
> require a backend to sit around unused but participating in all-backends
> synchronization and signalling. Active connections over a configured maximum
> concurrency limit would queue for access to a backend rather than fighting
> it out for resources at the OS level.
>
> The trouble is that this would be an *enormous* rewrite of the codebase, and
> would still only solve part of the problem. See the prior discussion on
> in-server connection pooling and admission control.

I'm (also) not a PostgreSQL developer so I'm hoping that someone who
is will join the thread, but speaking generally, there is no reason
why this couldn't be a simpler problem which just requires
finer-grained locking or smarter semaphore usage.

I'm not talking about forcing performance out of situation where there
are no more CPU cycles to take, but about degrading gracefully in
those circumstances and not taking a 80%+ drop because of spinning
around in semaphore syscalls.

Re: Performance under contention

From
Vitalii Tymchyshyn
Date:
24.11.10 02:11, Craig Ringer написав(ла):
> On 11/22/2010 11:38 PM, Ivan Voras wrote:
>> On 11/22/10 16:26, Kevin Grittner wrote:
>>> Ivan Voras<ivoras@freebsd.org> wrote:
>>>> On 11/22/10 02:47, Kevin Grittner wrote:
>>>>> Ivan Voras wrote:
>>>>>
>>>>>> After 16 clients (which is still good since there are only 12
>>>>>> "real" cores in the system), the performance drops sharply
>>>>>
>>>>> Yet another data point to confirm the importance of connection
>>>>> pooling. :-)
>>>>
>>>> I agree, connection pooling will get rid of the symptom. But not
>>>> the underlying problem. I'm not saying that having 1000s of
>>>> connections to the database is a particularly good design, only
>>>> that there shouldn't be a sharp decline in performance when it
>>>> does happen. Ideally, the performance should remain the same as it
>>>> was at its peek.
>>>
>>> Well, I suggested that we add an admission control[1] mechanism,
>>
>> It looks like a hack (and one which is already implemented by connection
>> pool software); the underlying problem should be addressed.
>
> My (poor) understanding is that addressing the underlying problem
> would require a massive restructure of postgresql to separate
> "connection and session state" from "executor and backend". Idle
> connections wouldn't require a backend to sit around unused but
> participating in all-backends synchronization and signalling. Active
> connections over a configured maximum concurrency limit would queue
> for access to a backend rather than fighting it out for resources at
> the OS level.
>
> The trouble is that this would be an *enormous* rewrite of the
> codebase, and would still only solve part of the problem. See the
> prior discussion on in-server connection pooling and admission control.
Hello.

IMHO the main problem is not a backend sitting and doing nothing, but
multiple backends trying to do their work. So, as for me, the simplest
option that will make most people happy would be to have a limit
(waitable semaphore) on backends actively executing the query. Such a
limit can even be automatically detected based on number of CPUs
(simple) and spindels (not sure if simple, but some default can be
used). Idle (or waiting for a lock) backend consumes little resources.
If one want to reduce resource usage for such a backends, he can
introduce external pooling, but such a simple limit would make me happy
(e.g. having max_active_connections=1000, max_active_queries=20).
The main Q here, is how much resources can take a backend that is
waiting for a lock. Is locking done at the query start? Or it may go
into wait while consumed much of work_mem. In the second case, the limit
won't be work_mem limit, but will still prevent much contention.

Best regards, Vitalii Tymchyshyn

Re: Performance under contention

From
"Kevin Grittner"
Date:
Vitalii Tymchyshyn <tivv00@gmail.com> wrote:

> the simplest option that will make most people happy would be to
> have a limit (waitable semaphore) on backends actively executing
> the query.

That's very similar to the admission control policy I proposed,
except that I suggested a limit on the number of active database
transactions rather than the number of queries.  The reason is that
you could still get into a lot of lock contention with a query-based
limit -- a query could acquire locks (perhaps by writing rows to the
database) and then be blocked waiting its turn, leading to conflicts
with other transactions.  Such problems would be less common with a
transaction limit, since most common locks don't persist past the
end of the transaction.

-Kevin

Re: Performance under contention

From
Ivan Voras
Date:
On 11/22/10 18:47, Kevin Grittner wrote:
> Ivan Voras<ivoras@freebsd.org>  wrote:
>
>> It looks like a hack
>
> Not to everyone.  In the referenced section, Hellerstein,
> Stonebraker and Hamilton say:
>
> "any good multi-user system has an admission control policy"
>
> In the case of PostgreSQL I understand the counter-argument,
> although I'm inclined to think that it's prudent for a product to
> limit resource usage to a level at which it can still function well,
> even if there's an external solution which can also work, should
> people use it correctly.  It seems likely that a mature admission
> control policy could do a better job of managing some resources than
> an external product could.

I didn't think it would be that useful but yesterday I did some
(unrelated) testing with MySQL and it looks like its configuration
parameter "thread_concurrency" does something to that effect.

Initially I thought it is equivalent to PostgreSQL's max_connections but
no, connections can grow (MySQL spawns a thread per connection by
default) but the actual concurrency is limited in some way by this
parameter.

The comment for the parameter says "# Try number of CPU's*2 for
thread_concurrency" but obviously it would depend a lot on the
real-world load.


Re: Performance under contention

From
Greg Smith
Date:
Ivan Voras wrote:
> PostgreSQL 9.0.1, 10 GB shared buffers, using pgbench with a scale
> factor of 500 (7.5 GB database)
>
> with pgbench -S (SELECT-queries only) the performance curve is:
>
> -c#    result
> 4    33549
> 8    64864
> 12    79491
> 16    79887
> 20    66957
> 24    52576
> 28    50406
> 32    49491
> 40    45535
> 50    39499
> 75    29415

Two suggestions to improve your results here:

1) Don't set shared_buffers to 10GB.  There are some known issues with
large settings for that which may or may not be impacting your results.
Try 4GB instead, just to make sure you're not even on the edge of that area.

2) pgbench itself is known to become a bottleneck when running with lots
of clients.  You should be using the "-j" option to spawn multiple
workers, probably 12 of them (one per core), to make some of this go
away.  On the system I saw the most improvement here, I got a 15-25%
gain having more workers at the higher client counts.

> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking
> on semwait indicates large contention in PostgreSQL.

It will be interesting to see if that's different after the changes
suggested above.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services and Support        www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: Performance under contention

From
Ivan Voras
Date:
On 26 November 2010 03:00, Greg Smith <greg@2ndquadrant.com> wrote:

> Two suggestions to improve your results here:
>
> 1) Don't set shared_buffers to 10GB.  There are some known issues with large
> settings for that which may or may not be impacting your results.  Try 4GB
> instead, just to make sure you're not even on the edge of that area.
>
> 2) pgbench itself is known to become a bottleneck when running with lots of
> clients.  You should be using the "-j" option to spawn multiple workers,
> probably 12 of them (one per core), to make some of this go away.  On the
> system I saw the most improvement here, I got a 15-25% gain having more
> workers at the higher client counts.

> It will be interesting to see if that's different after the changes
> suggested above.

Too late, can't test on the hardware anymore. I did use -j on pgbench,
but after 2 threads there were not significant improvements - the two
threads did not saturate two CPU cores.

However, I did run a similar select-only test on tmpfs on different
hardware with much less memory (4 GB total), with shared_buffers
somewhere around 2 GB, with the same performance curve:

http://ivoras.sharanet.org/blog/tree/2010-07-21.postgresql-on-tmpfs.html

so I doubt the curve would change by reducing shared_buffers below
what I used in the original post.

Re: Performance under contention

From
Robert Haas
Date:
On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on
> semwait indicates large contention in PostgreSQL.

I can reproduce this.  I suspect, but cannot yet prove, that this is
contention over the lock manager partition locks or the buffer mapping
locks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Robert Haas
Date:
On Mon, Dec 6, 2010 at 12:10 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on
>> semwait indicates large contention in PostgreSQL.
>
> I can reproduce this.  I suspect, but cannot yet prove, that this is
> contention over the lock manager partition locks or the buffer mapping
> locks.

I compiled with LWLOCK_STATS defined and found that exactly one lock
manager partition lwlock was heavily contended, because, of course,
the SELECT-only test only hits one table, and all the threads fight
over acquisition and release of AccessShareLock on that table.  One
might argue that in more normal workloads there will be more than one
table involved, but that's not necessarily true, and in any case there
might not be more than a handful of major ones.

However, I don't have a very clear idea what to do about it.
Increasing the number of lock partitions doesn't help, because the one
table you care about is still only in one partition.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Jignesh Shah
Date:
On Tue, Dec 7, 2010 at 1:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on
>> semwait indicates large contention in PostgreSQL.
>
> I can reproduce this.  I suspect, but cannot yet prove, that this is
> contention over the lock manager partition locks or the buffer mapping
> locks.
>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>


Hi Robert,

That's exactly what I concluded when I was doing the sysbench simple
read-only test. I had also tried with different lock partitions and it
did not help since they all go after the same table. I think one way
to kind of avoid the problem on the same table is to do more granular
locking (Maybe at page level instead of table level). But then I dont
really understand on how to even create a prototype related to this
one. If you can help create a prototype then I can test it out with my
setup and see if it helps us to catch up with other guys out there.

Also on the subject whether this is a real workload: in fact it seems
all social networks uses this frequently with their usertables and
this test actually came from my talks with Mark Callaghan which he
says is very common in their environment where thousands of users pull
up their userprofile data from the same table. Which is why I got
interested in trying it more.

Regards,
Jignesh

Re: Performance under contention

From
Jignesh Shah
Date:
On Tue, Dec 7, 2010 at 10:59 AM, Jignesh Shah <jkshah@gmail.com> wrote:
> On Tue, Dec 7, 2010 at 1:10 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>>> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on
>>> semwait indicates large contention in PostgreSQL.
>>
>> I can reproduce this.  I suspect, but cannot yet prove, that this is
>> contention over the lock manager partition locks or the buffer mapping
>> locks.
>>
>> --
>> Robert Haas
>> EnterpriseDB: http://www.enterprisedb.com
>> The Enterprise PostgreSQL Company
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>>
>
>
> Hi Robert,
>
> That's exactly what I concluded when I was doing the sysbench simple
> read-only test. I had also tried with different lock partitions and it
> did not help since they all go after the same table. I think one way
> to kind of avoid the problem on the same table is to do more granular
> locking (Maybe at page level instead of table level). But then I dont
> really understand on how to even create a prototype related to this
> one. If you can help create a prototype then I can test it out with my
> setup and see if it helps us to catch up with other guys out there.
>
> Also on the subject whether this is a real workload: in fact it seems
> all social networks uses this frequently with their usertables and
> this test actually came from my talks with Mark Callaghan which he
> says is very common in their environment where thousands of users pull
> up their userprofile data from the same table. Which is why I got
> interested in trying it more.
>
> Regards,
> Jignesh
>

Also I forgot to mention in my sysbench test I saw exactly two locks
one related to AccessShareLock on the table but other related to
RevalidateCachePlan one which atleast to me seemed to be slightly
bigger problem than the AccessShareLock one..

But I will take anything. Ideally both :-)

Regards,
Jignesh

Re: Performance under contention

From
Robert Haas
Date:
On Mon, Dec 6, 2010 at 9:59 PM, Jignesh Shah <jkshah@gmail.com> wrote:
> That's exactly what I concluded when I was doing the sysbench simple
> read-only test. I had also tried with different lock partitions and it
> did not help since they all go after the same table. I think one way
> to kind of avoid the problem on the same table is to do more granular
> locking (Maybe at page level instead of table level). But then I dont
> really understand on how to even create a prototype related to this
> one. If you can help create a prototype then I can test it out with my
> setup and see if it helps us to catch up with other guys out there.

We're trying to lock the table against a concurrent DROP or schema
change, so locking only part of it doesn't really work.  I don't
really see any way to avoid needing some kind of a lock here; the
trick is how to take it quickly.  The main obstacle to making this
faster is that the deadlock detector needs to be able to obtain enough
information to break cycles, which means we've got to record in shared
memory not only the locks that are granted but who has them.  However,
I wonder if it would be possible to have a very short critical section
where we grab the partition lock, acquire the heavyweight lock, and
release the partition lock; and then only as a second step record (in
the form of a PROCLOCK) the fact that we got it.  During this second
step, we'd hold a lock associated with the PROC, not the LOCK.  If the
deadlock checker runs after we've acquired the lock and before we've
recorded that we have it, it'll see more locks than lock holders, but
that should be OK, since the process which hasn't yet recorded its
lock acquisition is clearly not part of any deadlock.

Currently, PROCLOCKs are included in both a list of locks held by that
PROC, and a list of lockers of that LOCK.  The latter list would be
hard to maintain in this scheme, but maybe that's OK too.  We really
only need that information for the deadlock checker, and the deadlock
checker could potentially still get the information by grovelling
through all the PROCs.  That might be a bit slow, but maybe it'd be
OK, or maybe we could think of a clever way to speed it up.

Just thinking out loud here...

> Also on the subject whether this is a real workload: in fact it seems
> all social networks uses this frequently with their usertables and
> this test actually came from my talks with Mark Callaghan which he
> says is very common in their environment where thousands of users pull
> up their userprofile data from the same table. Which is why I got
> interested in trying it more.

Yeah.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Compan

Re: Performance under contention

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> I wonder if it would be possible to have a very short critical section
> where we grab the partition lock, acquire the heavyweight lock, and
> release the partition lock; and then only as a second step record (in
> the form of a PROCLOCK) the fact that we got it.

[ confused... ]  Exactly what do you suppose "acquire the lock" would
be represented as, if not "create a PROCLOCK entry attached to it"?

In any case, I think this is another example of not understanding where
the costs really are.  As far as I can tell, on modern MP systems much
of the elapsed time in these operations comes from acquiring exclusive
access to shared-memory cache lines.  Reducing the number of changes you
have to make within a small area of shared memory won't save much, once
you've paid for the first one.  Changing structures that aren't heavily
contended (such as a proc's list of its own locks) doesn't cost much at
all.

One thing that might be interesting, but that I don't know how to attack
in a reasonably machine-independent way, is to try to ensure that shared
and local data structures don't accidentally overlap within cache lines.
When they do, you pay for fighting the cache line away from another
processor even when there's no real need.

            regards, tom lane

Re: Performance under contention

From
Dave Crooke
Date:
Hi Tom

I suspect I may be missing something here, but I think it's a pretty universal truism that cache lines are aligned to power-of-2 memory addresses, so it would suffice to ensure during setup that the lower order n bits of the object address are all zeros for each critical object; if the malloc() routine being used doesn't support that, it could be done by allocating a slightly larger than necessary block of memory and choosing a location within that.

The value of n could be architecture dependent, but n=8 would cover everyone, hopefully without wasting too much RAM.

Cheers
Dave

On Tue, Dec 7, 2010 at 11:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

One thing that might be interesting, but that I don't know how to attack
in a reasonably machine-independent way, is to try to ensure that shared
and local data structures don't accidentally overlap within cache lines.
When they do, you pay for fighting the cache line away from another
processor even when there's no real need.

                       regards, tom lane

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Performance under contention

From
Ivan Voras
Date:
On 7 December 2010 18:37, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Dec 6, 2010 at 9:59 PM, Jignesh Shah <jkshah@gmail.com> wrote:
>> That's exactly what I concluded when I was doing the sysbench simple
>> read-only test. I had also tried with different lock partitions and it
>> did not help since they all go after the same table. I think one way
>> to kind of avoid the problem on the same table is to do more granular
>> locking (Maybe at page level instead of table level). But then I dont
>> really understand on how to even create a prototype related to this
>> one. If you can help create a prototype then I can test it out with my
>> setup and see if it helps us to catch up with other guys out there.
>
> We're trying to lock the table against a concurrent DROP or schema
> change, so locking only part of it doesn't really work.  I don't
> really see any way to avoid needing some kind of a lock here; the
> trick is how to take it quickly.  The main obstacle to making this
> faster is that the deadlock detector needs to be able to obtain enough
> information to break cycles, which means we've got to record in shared
> memory not only the locks that are granted but who has them.

I'm not very familiar with PostgreSQL code but if we're
brainstorming... if you're only trying to protect against a small
number of expensive operations (like DROP, etc.) that don't really
happen often, wouldn't an atomic reference counter be good enough for
the purpose (e.g. the expensive operations would spin-wait until the
counter is 0)?

Re: Performance under contention

From
Robert Haas
Date:
On Tue, Dec 7, 2010 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I wonder if it would be possible to have a very short critical section
>> where we grab the partition lock, acquire the heavyweight lock, and
>> release the partition lock; and then only as a second step record (in
>> the form of a PROCLOCK) the fact that we got it.
>
> [ confused... ]  Exactly what do you suppose "acquire the lock" would
> be represented as, if not "create a PROCLOCK entry attached to it"?

Update the "granted" array and, if necessary, the grantMask.

> In any case, I think this is another example of not understanding where
> the costs really are.

Possible.

> As far as I can tell, on modern MP systems much
> of the elapsed time in these operations comes from acquiring exclusive
> access to shared-memory cache lines.  Reducing the number of changes you
> have to make within a small area of shared memory won't save much, once
> you've paid for the first one.

Seems reasonable.

> Changing structures that aren't heavily
> contended (such as a proc's list of its own locks) doesn't cost much at
> all.

I'm not sure where you're getting the idea that a proc's list of its
own locks isn't heavily contended.   That could be true, but it isn't
obvious to me.  We allocate PROCLOCK structures out of a shared hash
table while holding the lock manager partition lock, and we add every
lock to a queue associated with the PROC and a second queue associated
with the LOCK.  So if two processes acquire an AccessShareLock on the
same table, both the LOCK object and at least the SHM_QUEUE portions
of each PROCLOCK are shared, and those aren't necessarily nearby in
memory.

> One thing that might be interesting, but that I don't know how to attack
> in a reasonably machine-independent way, is to try to ensure that shared
> and local data structures don't accidentally overlap within cache lines.
> When they do, you pay for fighting the cache line away from another
> processor even when there's no real need.

I'd be sort of surprised if this is a problem - as I understand it,
cache lines are small, contiguous chunks, and surely the heap and the
shared memory segment are mapped into different portions of the
address space...

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Robert Haas
Date:
On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> On 7 December 2010 18:37, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Mon, Dec 6, 2010 at 9:59 PM, Jignesh Shah <jkshah@gmail.com> wrote:
>>> That's exactly what I concluded when I was doing the sysbench simple
>>> read-only test. I had also tried with different lock partitions and it
>>> did not help since they all go after the same table. I think one way
>>> to kind of avoid the problem on the same table is to do more granular
>>> locking (Maybe at page level instead of table level). But then I dont
>>> really understand on how to even create a prototype related to this
>>> one. If you can help create a prototype then I can test it out with my
>>> setup and see if it helps us to catch up with other guys out there.
>>
>> We're trying to lock the table against a concurrent DROP or schema
>> change, so locking only part of it doesn't really work.  I don't
>> really see any way to avoid needing some kind of a lock here; the
>> trick is how to take it quickly.  The main obstacle to making this
>> faster is that the deadlock detector needs to be able to obtain enough
>> information to break cycles, which means we've got to record in shared
>> memory not only the locks that are granted but who has them.
>
> I'm not very familiar with PostgreSQL code but if we're
> brainstorming... if you're only trying to protect against a small
> number of expensive operations (like DROP, etc.) that don't really
> happen often, wouldn't an atomic reference counter be good enough for
> the purpose (e.g. the expensive operations would spin-wait until the
> counter is 0)?

No, because (1) busy-waiting is only suitable for locks that will only
be held for a short time, and an AccessShareLock on a table might be
held while we read 10GB of data in from disk, and (2) that wouldn't
allow for deadlock detection.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Ivan Voras
Date:
On 7 December 2010 19:10, Robert Haas <robertmhaas@gmail.com> wrote:

>> I'm not very familiar with PostgreSQL code but if we're
>> brainstorming... if you're only trying to protect against a small
>> number of expensive operations (like DROP, etc.) that don't really
>> happen often, wouldn't an atomic reference counter be good enough for
>> the purpose (e.g. the expensive operations would spin-wait until the
>> counter is 0)?
>
> No, because (1) busy-waiting is only suitable for locks that will only
> be held for a short time, and an AccessShareLock on a table might be
> held while we read 10GB of data in from disk,

Generally yes, but a variant with adaptive sleeping could possibly be
used if it would be acceptable to delay (uncertainly) the already
expensive and rare operations.

> and (2) that wouldn't
> allow for deadlock detection.

Probably :)

Re: Performance under contention

From
Віталій Тимчишин
Date:


2010/12/7 Robert Haas <robertmhaas@gmail.com>
On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras <ivoras@freebsd.org> wrote:

> I'm not very familiar with PostgreSQL code but if we're
> brainstorming... if you're only trying to protect against a small
> number of expensive operations (like DROP, etc.) that don't really
> happen often, wouldn't an atomic reference counter be good enough for
> the purpose (e.g. the expensive operations would spin-wait until the
> counter is 0)?

No, because (1) busy-waiting is only suitable for locks that will only
be held for a short time, and an AccessShareLock on a table might be
held while we read 10GB of data in from disk, and (2) that wouldn't
allow for deadlock detection.

As far as I understand this thread, the talk is about contention - where large number of processors want to get single partition lock to get high-level shared lock.
As far as I can see from the source, there is a lot of code executed under the partition lock protection, like two hash searches (and possibly allocations).
What can be done, is that number of locks can be increased - one could use spin locks for hash table manipulations, e.g. a lock preventing rehashing (number of baskets being changed) and a lock for required basket.
In this case only small range of code can be protected by partition lock.
As for me, this will make locking process more cpu-intensive (more locks will be acquired/freed during the exection), but will decrease contention (since all but one lock can be spin locks working on atomic counters, hash searches can be done in parallel), won't it?
The thing I am not sure in is how much spinlocks on atomic counters cost today.   

--
Best regards,
 Vitalii Tymchyshyn

Re: Performance under contention

From
Ivan Voras
Date:
2010/12/7 Віталій Тимчишин <tivv00@gmail.com>:
>
>
> 2010/12/7 Robert Haas <robertmhaas@gmail.com>
>>
>> On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras <ivoras@freebsd.org> wrote:
>>
>> > I'm not very familiar with PostgreSQL code but if we're
>> > brainstorming... if you're only trying to protect against a small
>> > number of expensive operations (like DROP, etc.) that don't really
>> > happen often, wouldn't an atomic reference counter be good enough for
>> > the purpose (e.g. the expensive operations would spin-wait until the
>> > counter is 0)?
>>
>> No, because (1) busy-waiting is only suitable for locks that will only
>> be held for a short time, and an AccessShareLock on a table might be
>> held while we read 10GB of data in from disk, and (2) that wouldn't
>> allow for deadlock detection.

> What can be done, is that number of locks can be increased - one could use
> spin locks for hash table manipulations, e.g. a lock preventing rehashing
> (number of baskets being changed) and a lock for required basket.
> In this case only small range of code can be protected by partition lock.
> As for me, this will make locking process more cpu-intensive (more locks
> will be acquired/freed during the exection), but will decrease contention
> (since all but one lock can be spin locks working on atomic counters, hash
> searches can be done in parallel), won't it?

For what it's worth, this is pretty much the opposite of what I had in
mind. I proposed atomic reference counters (as others pointed, this
probably won't work) as poor-man's shared-exclusive locks, so that
most operations would not have to contend on them.

Re: Performance under contention

From
Robert Haas
Date:
2010/12/7 Віталій Тимчишин <tivv00@gmail.com>:
> As far as I can see from the source, there is a lot of code executed under
> the partition lock protection, like two hash searches (and possibly
> allocations).

Yeah, that was my concern, too, though Tom seems skeptical (perhaps
rightly).  And I'm not really sure why the PROCLOCKs need to be in a
hash table anyway - if we know the PROC and LOCK we can surely look up
the PROCLOCK pretty expensively by following the PROC SHM_QUEUE.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Robert Haas
Date:
2010/12/7 Robert Haas <robertmhaas@gmail.com>:
> 2010/12/7 Віталій Тимчишин <tivv00@gmail.com>:
>> As far as I can see from the source, there is a lot of code executed under
>> the partition lock protection, like two hash searches (and possibly
>> allocations).
>
> Yeah, that was my concern, too, though Tom seems skeptical (perhaps
> rightly).  And I'm not really sure why the PROCLOCKs need to be in a
> hash table anyway - if we know the PROC and LOCK we can surely look up
> the PROCLOCK pretty expensively by following the PROC SHM_QUEUE.

Err, pretty INexpensively.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
>> Yeah, that was my concern, too, though Tom seems skeptical (perhaps
>> rightly). �And I'm not really sure why the PROCLOCKs need to be in a
>> hash table anyway - if we know the PROC and LOCK we can surely look up
>> the PROCLOCK pretty expensively by following the PROC SHM_QUEUE.

> Err, pretty INexpensively.

There are plenty of scenarios in which a proc might hold hundreds or
even thousands of locks.  pg_dump, for example.  You do not want to be
doing seq search there.

Now, it's possible that you could avoid *ever* needing to search for a
specific PROCLOCK, in which case eliminating the hash calculation
overhead might be worth it.  Of course, you'd still have to replicate
all the space-management functionality of a shared hash table.

            regards, tom lane

Re: Performance under contention

From
Robert Haas
Date:
2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>>> Yeah, that was my concern, too, though Tom seems skeptical (perhaps
>>> rightly). šAnd I'm not really sure why the PROCLOCKs need to be in a
>>> hash table anyway - if we know the PROC and LOCK we can surely look up
>>> the PROCLOCK pretty expensively by following the PROC SHM_QUEUE.
>
>> Err, pretty INexpensively.
>
> There are plenty of scenarios in which a proc might hold hundreds or
> even thousands of locks.  pg_dump, for example.  You do not want to be
> doing seq search there.
>
> Now, it's possible that you could avoid *ever* needing to search for a
> specific PROCLOCK, in which case eliminating the hash calculation
> overhead might be worth it.

That seems like it might be feasible.  The backend that holds the lock
ought to be able to find out whether there's a PROCLOCK by looking at
the LOCALLOCK table, and the LOCALLOCK has a pointer to the PROCLOCK.
It's not clear to me whether there's any other use case for doing a
lookup for a particular combination of PROC A + LOCK B, but I'll have
to look at the code more closely.

> Of course, you'd still have to replicate
> all the space-management functionality of a shared hash table.

Maybe we ought to revisit Markus Wanner's wamalloc.  Although given
our recent discussions, I'm thinking that you might want to try to
design any allocation system so as to minimize cache line contention.
For example, you could hard-allocate each backend 512 bytes of
dedicated shared memory in which to record the locks it holds.  If it
needs more, it allocates additional 512 byte chunks.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Performance under contention

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> 2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>:
>> Now, it's possible that you could avoid *ever* needing to search for a
>> specific PROCLOCK, in which case eliminating the hash calculation
>> overhead might be worth it.

> That seems like it might be feasible.  The backend that holds the lock
> ought to be able to find out whether there's a PROCLOCK by looking at
> the LOCALLOCK table, and the LOCALLOCK has a pointer to the PROCLOCK.

Hm, that is a real good point.  Those shared memory data structures
predate the invention of the local lock tables, and I don't think we
looked real hard at whether we should rethink the fundamental
representation in shared memory given the additional local state.
The issue though is whether any other processes ever need to look
at a proc's PROCLOCKs.  I think at least deadlock detection does.

            regards, tom lane

Re: Performance under contention

From
Robert Haas
Date:
2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>:
> Robert Haas <robertmhaas@gmail.com> writes:
>> 2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>:
>>> Now, it's possible that you could avoid *ever* needing to search for a
>>> specific PROCLOCK, in which case eliminating the hash calculation
>>> overhead might be worth it.
>
>> That seems like it might be feasible.  The backend that holds the lock
>> ought to be able to find out whether there's a PROCLOCK by looking at
>> the LOCALLOCK table, and the LOCALLOCK has a pointer to the PROCLOCK.
>
> Hm, that is a real good point.  Those shared memory data structures
> predate the invention of the local lock tables, and I don't think we
> looked real hard at whether we should rethink the fundamental
> representation in shared memory given the additional local state.
> The issue though is whether any other processes ever need to look
> at a proc's PROCLOCKs.  I think at least deadlock detection does.

Sure, but it doesn't use the hash table to do it.  All the PROCLOCKs
for any given LOCK are in a linked list; we just walk it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company