Thread: Performance under contention
This is not a request for help but a report, in case it helps developers or someone in the future. The setup is: AMD64 machine, 24 GB RAM, 2x6-core Xeon CPU + HTT (24 logical CPUs) FreeBSD 8.1-stable, AMD64 PostgreSQL 9.0.1, 10 GB shared buffers, using pgbench with a scale factor of 500 (7.5 GB database) with pgbench -S (SELECT-queries only) the performance curve is: -c# result 4 33549 8 64864 12 79491 16 79887 20 66957 24 52576 28 50406 32 49491 40 45535 50 39499 75 29415 After 16 clients (which is still good since there are only 12 "real" cores in the system), the performance drops sharply, and looking at the processes' state, most of them seem to eat away system call (i.e. executing in the kernel) in states "semwait" and "sbwait", i.e. semaphore wait and socket buffer wait, for example: 3047 pgsql 1 60 0 10533M 283M sbwait 12 0:01 6.79% postgres 3055 pgsql 1 64 0 10533M 279M sbwait 15 0:01 6.79% postgres 3033 pgsql 1 64 0 10533M 279M semwai 6 0:01 6.69% postgres 3038 pgsql 1 64 0 10533M 283M CPU5 13 0:01 6.69% postgres 3037 pgsql 1 62 0 10533M 279M sbwait 23 0:01 6.69% postgres 3048 pgsql 1 65 0 10533M 280M semwai 4 0:01 6.69% postgres 3056 pgsql 1 65 0 10533M 277M semwai 1 0:01 6.69% postgres 3002 pgsql 1 62 0 10533M 284M CPU19 0 0:01 6.59% postgres 3042 pgsql 1 63 0 10533M 279M semwai 21 0:01 6.59% postgres 3029 pgsql 1 63 0 10533M 277M semwai 23 0:01 6.59% postgres 3046 pgsql 1 63 0 10533M 278M RUN 5 0:01 6.59% postgres 3036 pgsql 1 63 0 10533M 278M CPU1 12 0:01 6.59% postgres 3051 pgsql 1 63 0 10533M 277M semwai 1 0:01 6.59% postgres 3030 pgsql 1 63 0 10533M 281M semwai 1 0:01 6.49% postgres 3050 pgsql 1 60 0 10533M 276M semwai 1 0:01 6.49% postgres The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on semwait indicates large contention in PostgreSQL.
Ivan Voras wrote: > After 16 clients (which is still good since there are only 12 > "real" cores in the system), the performance drops sharply Yet another data point to confirm the importance of connection pooling. :-) -Kevin
On 11/22/10 02:47, Kevin Grittner wrote: > Ivan Voras wrote: > >> After 16 clients (which is still good since there are only 12 >> "real" cores in the system), the performance drops sharply > > Yet another data point to confirm the importance of connection > pooling. :-) I agree, connection pooling will get rid of the symptom. But not the underlying problem. I'm not saying that having 1000s of connections to the database is a particularly good design, only that there shouldn't be a sharp decline in performance when it does happen. Ideally, the performance should remain the same as it was at its peek. I've been monitoring the server some more and it looks like there are periods where almost all servers are in the semwait state followed by periods of intensive work - approximately similar to the "thundering herd" problem, or maybe to what Josh Berkus has posted a few days ago.
On Sun, Nov 21, 2010 at 9:18 PM, Ivan Voras <ivoras@freebsd.org> wrote: > On 11/22/10 02:47, Kevin Grittner wrote: >> >> Ivan Voras wrote: >> >>> After 16 clients (which is still good since there are only 12 >>> "real" cores in the system), the performance drops sharply >> >> Yet another data point to confirm the importance of connection >> pooling. :-) > > I agree, connection pooling will get rid of the symptom. But not the > underlying problem. I'm not saying that having 1000s of connections to the > database is a particularly good design, only that there shouldn't be a sharp > decline in performance when it does happen. Ideally, the performance should > remain the same as it was at its peek. > > I've been monitoring the server some more and it looks like there are > periods where almost all servers are in the semwait state followed by > periods of intensive work - approximately similar to the "thundering herd" > problem, or maybe to what Josh Berkus has posted a few days ago. > > > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > Try it with systemtap or dtrace and see if you find the same bottlenecks as I do in http://jkshah.blogspot.com/2010/11/postgresql-90-simple-select-scaling.html I will probably retry it with pgBench and see what I find .. Regards, Jignesh
Hi Ivan, We have the same issue on our database machines (which are 2x6 Intel(R) Xeon(R) CPU X5670 @ 2.93GHz with 24 logical cores and 144Gb of RAM) -- they run RHEL 5. The issue occurs with our normal OLTP workload, so it's not just pgbench. We use pgbouncer to limit total connections to 15 (this seemed to be the 'sweet spot' for us) -- there's definitely a bunch of contention on ... something... for a workload where you're running a lot of very fast SELECTs (around 2000-4000/s) from more than 15-16 clients. I had a chat with Neil C or Gavin S about this at some point, but I forget the reason for it. I don't think there's anything you can do for it configuration-wise except use a connection pool. Regards, Omar On Mon, Nov 22, 2010 at 5:54 PM, Jignesh Shah <jkshah@gmail.com> wrote: > On Sun, Nov 21, 2010 at 9:18 PM, Ivan Voras <ivoras@freebsd.org> wrote: >> On 11/22/10 02:47, Kevin Grittner wrote: >>> >>> Ivan Voras wrote: >>> >>>> After 16 clients (which is still good since there are only 12 >>>> "real" cores in the system), the performance drops sharply >>> >>> Yet another data point to confirm the importance of connection >>> pooling. :-) >> >> I agree, connection pooling will get rid of the symptom. But not the >> underlying problem. I'm not saying that having 1000s of connections to the >> database is a particularly good design, only that there shouldn't be a sharp >> decline in performance when it does happen. Ideally, the performance should >> remain the same as it was at its peek. >> >> I've been monitoring the server some more and it looks like there are >> periods where almost all servers are in the semwait state followed by >> periods of intensive work - approximately similar to the "thundering herd" >> problem, or maybe to what Josh Berkus has posted a few days ago. >> >> >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > Try it with systemtap or dtrace and see if you find the same > bottlenecks as I do in > http://jkshah.blogspot.com/2010/11/postgresql-90-simple-select-scaling.html > > I will probably retry it with pgBench and see what I find .. > > Regards, > Jignesh > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
Ivan Voras <ivoras@freebsd.org> wrote: > On 11/22/10 02:47, Kevin Grittner wrote: >> Ivan Voras wrote: >> >>> After 16 clients (which is still good since there are only 12 >>> "real" cores in the system), the performance drops sharply >> >> Yet another data point to confirm the importance of connection >> pooling. :-) > > I agree, connection pooling will get rid of the symptom. But not > the underlying problem. I'm not saying that having 1000s of > connections to the database is a particularly good design, only > that there shouldn't be a sharp decline in performance when it > does happen. Ideally, the performance should remain the same as it > was at its peek. Well, I suggested that we add an admission control[1] mechanism, with at least part of the initial default policy being that there is a limit on the number of active database transactions. Such a policy would do what you are suggesting, but the idea was shot down on the basis that in most of the cases where this would help, people would be better served by using an external connection pool. If interested, search the archives for details of the discussion. -Kevin [1] http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf Joseph M. Hellerstein, Michael Stonebraker and James Hamilton. 2007. Architecture of a Database System. Foundations and Trends(R) in Databases Vol. 1, No. 2 (2007) 141*259 (see Section 2.4 - Admission Control)
On 11/22/10 16:26, Kevin Grittner wrote: > Ivan Voras<ivoras@freebsd.org> wrote: >> On 11/22/10 02:47, Kevin Grittner wrote: >>> Ivan Voras wrote: >>> >>>> After 16 clients (which is still good since there are only 12 >>>> "real" cores in the system), the performance drops sharply >>> >>> Yet another data point to confirm the importance of connection >>> pooling. :-) >> >> I agree, connection pooling will get rid of the symptom. But not >> the underlying problem. I'm not saying that having 1000s of >> connections to the database is a particularly good design, only >> that there shouldn't be a sharp decline in performance when it >> does happen. Ideally, the performance should remain the same as it >> was at its peek. > > Well, I suggested that we add an admission control[1] mechanism, It looks like a hack (and one which is already implemented by connection pool software); the underlying problem should be addressed. But on the other hand if it's affecting so many people, maybe a warning comment in postgresql.conf around max_connections would be helpful.
Ivan Voras <ivoras@freebsd.org> wrote: > It looks like a hack Not to everyone. In the referenced section, Hellerstein, Stonebraker and Hamilton say: "any good multi-user system has an admission control policy" In the case of PostgreSQL I understand the counter-argument, although I'm inclined to think that it's prudent for a product to limit resource usage to a level at which it can still function well, even if there's an external solution which can also work, should people use it correctly. It seems likely that a mature admission control policy could do a better job of managing some resources than an external product could. -Kevin
On 11/22/2010 11:38 PM, Ivan Voras wrote: > On 11/22/10 16:26, Kevin Grittner wrote: >> Ivan Voras<ivoras@freebsd.org> wrote: >>> On 11/22/10 02:47, Kevin Grittner wrote: >>>> Ivan Voras wrote: >>>> >>>>> After 16 clients (which is still good since there are only 12 >>>>> "real" cores in the system), the performance drops sharply >>>> >>>> Yet another data point to confirm the importance of connection >>>> pooling. :-) >>> >>> I agree, connection pooling will get rid of the symptom. But not >>> the underlying problem. I'm not saying that having 1000s of >>> connections to the database is a particularly good design, only >>> that there shouldn't be a sharp decline in performance when it >>> does happen. Ideally, the performance should remain the same as it >>> was at its peek. >> >> Well, I suggested that we add an admission control[1] mechanism, > > It looks like a hack (and one which is already implemented by connection > pool software); the underlying problem should be addressed. My (poor) understanding is that addressing the underlying problem would require a massive restructure of postgresql to separate "connection and session state" from "executor and backend". Idle connections wouldn't require a backend to sit around unused but participating in all-backends synchronization and signalling. Active connections over a configured maximum concurrency limit would queue for access to a backend rather than fighting it out for resources at the OS level. The trouble is that this would be an *enormous* rewrite of the codebase, and would still only solve part of the problem. See the prior discussion on in-server connection pooling and admission control. Personally I think the current approach is clearly difficult for many admins to understand and it's unfortunate that it requires external software to be effective. OTOH, I'm not sure what the answer is. -- Craig Ringer
On 24 November 2010 01:11, Craig Ringer <craig@postnewspapers.com.au> wrote: > On 11/22/2010 11:38 PM, Ivan Voras wrote: >> It looks like a hack (and one which is already implemented by connection >> pool software); the underlying problem should be addressed. > > My (poor) understanding is that addressing the underlying problem would > require a massive restructure of postgresql to separate "connection and > session state" from "executor and backend". Idle connections wouldn't > require a backend to sit around unused but participating in all-backends > synchronization and signalling. Active connections over a configured maximum > concurrency limit would queue for access to a backend rather than fighting > it out for resources at the OS level. > > The trouble is that this would be an *enormous* rewrite of the codebase, and > would still only solve part of the problem. See the prior discussion on > in-server connection pooling and admission control. I'm (also) not a PostgreSQL developer so I'm hoping that someone who is will join the thread, but speaking generally, there is no reason why this couldn't be a simpler problem which just requires finer-grained locking or smarter semaphore usage. I'm not talking about forcing performance out of situation where there are no more CPU cycles to take, but about degrading gracefully in those circumstances and not taking a 80%+ drop because of spinning around in semaphore syscalls.
24.11.10 02:11, Craig Ringer написав(ла): > On 11/22/2010 11:38 PM, Ivan Voras wrote: >> On 11/22/10 16:26, Kevin Grittner wrote: >>> Ivan Voras<ivoras@freebsd.org> wrote: >>>> On 11/22/10 02:47, Kevin Grittner wrote: >>>>> Ivan Voras wrote: >>>>> >>>>>> After 16 clients (which is still good since there are only 12 >>>>>> "real" cores in the system), the performance drops sharply >>>>> >>>>> Yet another data point to confirm the importance of connection >>>>> pooling. :-) >>>> >>>> I agree, connection pooling will get rid of the symptom. But not >>>> the underlying problem. I'm not saying that having 1000s of >>>> connections to the database is a particularly good design, only >>>> that there shouldn't be a sharp decline in performance when it >>>> does happen. Ideally, the performance should remain the same as it >>>> was at its peek. >>> >>> Well, I suggested that we add an admission control[1] mechanism, >> >> It looks like a hack (and one which is already implemented by connection >> pool software); the underlying problem should be addressed. > > My (poor) understanding is that addressing the underlying problem > would require a massive restructure of postgresql to separate > "connection and session state" from "executor and backend". Idle > connections wouldn't require a backend to sit around unused but > participating in all-backends synchronization and signalling. Active > connections over a configured maximum concurrency limit would queue > for access to a backend rather than fighting it out for resources at > the OS level. > > The trouble is that this would be an *enormous* rewrite of the > codebase, and would still only solve part of the problem. See the > prior discussion on in-server connection pooling and admission control. Hello. IMHO the main problem is not a backend sitting and doing nothing, but multiple backends trying to do their work. So, as for me, the simplest option that will make most people happy would be to have a limit (waitable semaphore) on backends actively executing the query. Such a limit can even be automatically detected based on number of CPUs (simple) and spindels (not sure if simple, but some default can be used). Idle (or waiting for a lock) backend consumes little resources. If one want to reduce resource usage for such a backends, he can introduce external pooling, but such a simple limit would make me happy (e.g. having max_active_connections=1000, max_active_queries=20). The main Q here, is how much resources can take a backend that is waiting for a lock. Is locking done at the query start? Or it may go into wait while consumed much of work_mem. In the second case, the limit won't be work_mem limit, but will still prevent much contention. Best regards, Vitalii Tymchyshyn
Vitalii Tymchyshyn <tivv00@gmail.com> wrote: > the simplest option that will make most people happy would be to > have a limit (waitable semaphore) on backends actively executing > the query. That's very similar to the admission control policy I proposed, except that I suggested a limit on the number of active database transactions rather than the number of queries. The reason is that you could still get into a lot of lock contention with a query-based limit -- a query could acquire locks (perhaps by writing rows to the database) and then be blocked waiting its turn, leading to conflicts with other transactions. Such problems would be less common with a transaction limit, since most common locks don't persist past the end of the transaction. -Kevin
On 11/22/10 18:47, Kevin Grittner wrote: > Ivan Voras<ivoras@freebsd.org> wrote: > >> It looks like a hack > > Not to everyone. In the referenced section, Hellerstein, > Stonebraker and Hamilton say: > > "any good multi-user system has an admission control policy" > > In the case of PostgreSQL I understand the counter-argument, > although I'm inclined to think that it's prudent for a product to > limit resource usage to a level at which it can still function well, > even if there's an external solution which can also work, should > people use it correctly. It seems likely that a mature admission > control policy could do a better job of managing some resources than > an external product could. I didn't think it would be that useful but yesterday I did some (unrelated) testing with MySQL and it looks like its configuration parameter "thread_concurrency" does something to that effect. Initially I thought it is equivalent to PostgreSQL's max_connections but no, connections can grow (MySQL spawns a thread per connection by default) but the actual concurrency is limited in some way by this parameter. The comment for the parameter says "# Try number of CPU's*2 for thread_concurrency" but obviously it would depend a lot on the real-world load.
Ivan Voras wrote: > PostgreSQL 9.0.1, 10 GB shared buffers, using pgbench with a scale > factor of 500 (7.5 GB database) > > with pgbench -S (SELECT-queries only) the performance curve is: > > -c# result > 4 33549 > 8 64864 > 12 79491 > 16 79887 > 20 66957 > 24 52576 > 28 50406 > 32 49491 > 40 45535 > 50 39499 > 75 29415 Two suggestions to improve your results here: 1) Don't set shared_buffers to 10GB. There are some known issues with large settings for that which may or may not be impacting your results. Try 4GB instead, just to make sure you're not even on the edge of that area. 2) pgbench itself is known to become a bottleneck when running with lots of clients. You should be using the "-j" option to spawn multiple workers, probably 12 of them (one per core), to make some of this go away. On the system I saw the most improvement here, I got a 15-25% gain having more workers at the higher client counts. > The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking > on semwait indicates large contention in PostgreSQL. It will be interesting to see if that's different after the changes suggested above. -- Greg Smith 2ndQuadrant US greg@2ndQuadrant.com Baltimore, MD PostgreSQL Training, Services and Support www.2ndQuadrant.us "PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books
On 26 November 2010 03:00, Greg Smith <greg@2ndquadrant.com> wrote: > Two suggestions to improve your results here: > > 1) Don't set shared_buffers to 10GB. There are some known issues with large > settings for that which may or may not be impacting your results. Try 4GB > instead, just to make sure you're not even on the edge of that area. > > 2) pgbench itself is known to become a bottleneck when running with lots of > clients. You should be using the "-j" option to spawn multiple workers, > probably 12 of them (one per core), to make some of this go away. On the > system I saw the most improvement here, I got a 15-25% gain having more > workers at the higher client counts. > It will be interesting to see if that's different after the changes > suggested above. Too late, can't test on the hardware anymore. I did use -j on pgbench, but after 2 threads there were not significant improvements - the two threads did not saturate two CPU cores. However, I did run a similar select-only test on tmpfs on different hardware with much less memory (4 GB total), with shared_buffers somewhere around 2 GB, with the same performance curve: http://ivoras.sharanet.org/blog/tree/2010-07-21.postgresql-on-tmpfs.html so I doubt the curve would change by reducing shared_buffers below what I used in the original post.
On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote: > The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on > semwait indicates large contention in PostgreSQL. I can reproduce this. I suspect, but cannot yet prove, that this is contention over the lock manager partition locks or the buffer mapping locks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 6, 2010 at 12:10 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote: >> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on >> semwait indicates large contention in PostgreSQL. > > I can reproduce this. I suspect, but cannot yet prove, that this is > contention over the lock manager partition locks or the buffer mapping > locks. I compiled with LWLOCK_STATS defined and found that exactly one lock manager partition lwlock was heavily contended, because, of course, the SELECT-only test only hits one table, and all the threads fight over acquisition and release of AccessShareLock on that table. One might argue that in more normal workloads there will be more than one table involved, but that's not necessarily true, and in any case there might not be more than a handful of major ones. However, I don't have a very clear idea what to do about it. Increasing the number of lock partitions doesn't help, because the one table you care about is still only in one partition. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 7, 2010 at 1:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote: >> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on >> semwait indicates large contention in PostgreSQL. > > I can reproduce this. I suspect, but cannot yet prove, that this is > contention over the lock manager partition locks or the buffer mapping > locks. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance > Hi Robert, That's exactly what I concluded when I was doing the sysbench simple read-only test. I had also tried with different lock partitions and it did not help since they all go after the same table. I think one way to kind of avoid the problem on the same table is to do more granular locking (Maybe at page level instead of table level). But then I dont really understand on how to even create a prototype related to this one. If you can help create a prototype then I can test it out with my setup and see if it helps us to catch up with other guys out there. Also on the subject whether this is a real workload: in fact it seems all social networks uses this frequently with their usertables and this test actually came from my talks with Mark Callaghan which he says is very common in their environment where thousands of users pull up their userprofile data from the same table. Which is why I got interested in trying it more. Regards, Jignesh
On Tue, Dec 7, 2010 at 10:59 AM, Jignesh Shah <jkshah@gmail.com> wrote: > On Tue, Dec 7, 2010 at 1:10 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Sun, Nov 21, 2010 at 7:15 PM, Ivan Voras <ivoras@freebsd.org> wrote: >>> The "sbwait" part is from FreeBSD - IPC sockets, but so much blocking on >>> semwait indicates large contention in PostgreSQL. >> >> I can reproduce this. I suspect, but cannot yet prove, that this is >> contention over the lock manager partition locks or the buffer mapping >> locks. >> >> -- >> Robert Haas >> EnterpriseDB: http://www.enterprisedb.com >> The Enterprise PostgreSQL Company >> >> -- >> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) >> To make changes to your subscription: >> http://www.postgresql.org/mailpref/pgsql-performance >> > > > Hi Robert, > > That's exactly what I concluded when I was doing the sysbench simple > read-only test. I had also tried with different lock partitions and it > did not help since they all go after the same table. I think one way > to kind of avoid the problem on the same table is to do more granular > locking (Maybe at page level instead of table level). But then I dont > really understand on how to even create a prototype related to this > one. If you can help create a prototype then I can test it out with my > setup and see if it helps us to catch up with other guys out there. > > Also on the subject whether this is a real workload: in fact it seems > all social networks uses this frequently with their usertables and > this test actually came from my talks with Mark Callaghan which he > says is very common in their environment where thousands of users pull > up their userprofile data from the same table. Which is why I got > interested in trying it more. > > Regards, > Jignesh > Also I forgot to mention in my sysbench test I saw exactly two locks one related to AccessShareLock on the table but other related to RevalidateCachePlan one which atleast to me seemed to be slightly bigger problem than the AccessShareLock one.. But I will take anything. Ideally both :-) Regards, Jignesh
On Mon, Dec 6, 2010 at 9:59 PM, Jignesh Shah <jkshah@gmail.com> wrote: > That's exactly what I concluded when I was doing the sysbench simple > read-only test. I had also tried with different lock partitions and it > did not help since they all go after the same table. I think one way > to kind of avoid the problem on the same table is to do more granular > locking (Maybe at page level instead of table level). But then I dont > really understand on how to even create a prototype related to this > one. If you can help create a prototype then I can test it out with my > setup and see if it helps us to catch up with other guys out there. We're trying to lock the table against a concurrent DROP or schema change, so locking only part of it doesn't really work. I don't really see any way to avoid needing some kind of a lock here; the trick is how to take it quickly. The main obstacle to making this faster is that the deadlock detector needs to be able to obtain enough information to break cycles, which means we've got to record in shared memory not only the locks that are granted but who has them. However, I wonder if it would be possible to have a very short critical section where we grab the partition lock, acquire the heavyweight lock, and release the partition lock; and then only as a second step record (in the form of a PROCLOCK) the fact that we got it. During this second step, we'd hold a lock associated with the PROC, not the LOCK. If the deadlock checker runs after we've acquired the lock and before we've recorded that we have it, it'll see more locks than lock holders, but that should be OK, since the process which hasn't yet recorded its lock acquisition is clearly not part of any deadlock. Currently, PROCLOCKs are included in both a list of locks held by that PROC, and a list of lockers of that LOCK. The latter list would be hard to maintain in this scheme, but maybe that's OK too. We really only need that information for the deadlock checker, and the deadlock checker could potentially still get the information by grovelling through all the PROCs. That might be a bit slow, but maybe it'd be OK, or maybe we could think of a clever way to speed it up. Just thinking out loud here... > Also on the subject whether this is a real workload: in fact it seems > all social networks uses this frequently with their usertables and > this test actually came from my talks with Mark Callaghan which he > says is very common in their environment where thousands of users pull > up their userprofile data from the same table. Which is why I got > interested in trying it more. Yeah. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Compan
Robert Haas <robertmhaas@gmail.com> writes: > I wonder if it would be possible to have a very short critical section > where we grab the partition lock, acquire the heavyweight lock, and > release the partition lock; and then only as a second step record (in > the form of a PROCLOCK) the fact that we got it. [ confused... ] Exactly what do you suppose "acquire the lock" would be represented as, if not "create a PROCLOCK entry attached to it"? In any case, I think this is another example of not understanding where the costs really are. As far as I can tell, on modern MP systems much of the elapsed time in these operations comes from acquiring exclusive access to shared-memory cache lines. Reducing the number of changes you have to make within a small area of shared memory won't save much, once you've paid for the first one. Changing structures that aren't heavily contended (such as a proc's list of its own locks) doesn't cost much at all. One thing that might be interesting, but that I don't know how to attack in a reasonably machine-independent way, is to try to ensure that shared and local data structures don't accidentally overlap within cache lines. When they do, you pay for fighting the cache line away from another processor even when there's no real need. regards, tom lane
Hi Tom
I suspect I may be missing something here, but I think it's a pretty universal truism that cache lines are aligned to power-of-2 memory addresses, so it would suffice to ensure during setup that the lower order n bits of the object address are all zeros for each critical object; if the malloc() routine being used doesn't support that, it could be done by allocating a slightly larger than necessary block of memory and choosing a location within that.
The value of n could be architecture dependent, but n=8 would cover everyone, hopefully without wasting too much RAM.
Cheers
Dave
I suspect I may be missing something here, but I think it's a pretty universal truism that cache lines are aligned to power-of-2 memory addresses, so it would suffice to ensure during setup that the lower order n bits of the object address are all zeros for each critical object; if the malloc() routine being used doesn't support that, it could be done by allocating a slightly larger than necessary block of memory and choosing a location within that.
The value of n could be architecture dependent, but n=8 would cover everyone, hopefully without wasting too much RAM.
Cheers
Dave
On Tue, Dec 7, 2010 at 11:50 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
One thing that might be interesting, but that I don't know how to attack
in a reasonably machine-independent way, is to try to ensure that shared
and local data structures don't accidentally overlap within cache lines.
When they do, you pay for fighting the cache line away from another
processor even when there's no real need.
regards, tom lane
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On 7 December 2010 18:37, Robert Haas <robertmhaas@gmail.com> wrote: > On Mon, Dec 6, 2010 at 9:59 PM, Jignesh Shah <jkshah@gmail.com> wrote: >> That's exactly what I concluded when I was doing the sysbench simple >> read-only test. I had also tried with different lock partitions and it >> did not help since they all go after the same table. I think one way >> to kind of avoid the problem on the same table is to do more granular >> locking (Maybe at page level instead of table level). But then I dont >> really understand on how to even create a prototype related to this >> one. If you can help create a prototype then I can test it out with my >> setup and see if it helps us to catch up with other guys out there. > > We're trying to lock the table against a concurrent DROP or schema > change, so locking only part of it doesn't really work. I don't > really see any way to avoid needing some kind of a lock here; the > trick is how to take it quickly. The main obstacle to making this > faster is that the deadlock detector needs to be able to obtain enough > information to break cycles, which means we've got to record in shared > memory not only the locks that are granted but who has them. I'm not very familiar with PostgreSQL code but if we're brainstorming... if you're only trying to protect against a small number of expensive operations (like DROP, etc.) that don't really happen often, wouldn't an atomic reference counter be good enough for the purpose (e.g. the expensive operations would spin-wait until the counter is 0)?
On Tue, Dec 7, 2010 at 12:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> I wonder if it would be possible to have a very short critical section >> where we grab the partition lock, acquire the heavyweight lock, and >> release the partition lock; and then only as a second step record (in >> the form of a PROCLOCK) the fact that we got it. > > [ confused... ] Exactly what do you suppose "acquire the lock" would > be represented as, if not "create a PROCLOCK entry attached to it"? Update the "granted" array and, if necessary, the grantMask. > In any case, I think this is another example of not understanding where > the costs really are. Possible. > As far as I can tell, on modern MP systems much > of the elapsed time in these operations comes from acquiring exclusive > access to shared-memory cache lines. Reducing the number of changes you > have to make within a small area of shared memory won't save much, once > you've paid for the first one. Seems reasonable. > Changing structures that aren't heavily > contended (such as a proc's list of its own locks) doesn't cost much at > all. I'm not sure where you're getting the idea that a proc's list of its own locks isn't heavily contended. That could be true, but it isn't obvious to me. We allocate PROCLOCK structures out of a shared hash table while holding the lock manager partition lock, and we add every lock to a queue associated with the PROC and a second queue associated with the LOCK. So if two processes acquire an AccessShareLock on the same table, both the LOCK object and at least the SHM_QUEUE portions of each PROCLOCK are shared, and those aren't necessarily nearby in memory. > One thing that might be interesting, but that I don't know how to attack > in a reasonably machine-independent way, is to try to ensure that shared > and local data structures don't accidentally overlap within cache lines. > When they do, you pay for fighting the cache line away from another > processor even when there's no real need. I'd be sort of surprised if this is a problem - as I understand it, cache lines are small, contiguous chunks, and surely the heap and the shared memory segment are mapped into different portions of the address space... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras <ivoras@freebsd.org> wrote: > On 7 December 2010 18:37, Robert Haas <robertmhaas@gmail.com> wrote: >> On Mon, Dec 6, 2010 at 9:59 PM, Jignesh Shah <jkshah@gmail.com> wrote: >>> That's exactly what I concluded when I was doing the sysbench simple >>> read-only test. I had also tried with different lock partitions and it >>> did not help since they all go after the same table. I think one way >>> to kind of avoid the problem on the same table is to do more granular >>> locking (Maybe at page level instead of table level). But then I dont >>> really understand on how to even create a prototype related to this >>> one. If you can help create a prototype then I can test it out with my >>> setup and see if it helps us to catch up with other guys out there. >> >> We're trying to lock the table against a concurrent DROP or schema >> change, so locking only part of it doesn't really work. I don't >> really see any way to avoid needing some kind of a lock here; the >> trick is how to take it quickly. The main obstacle to making this >> faster is that the deadlock detector needs to be able to obtain enough >> information to break cycles, which means we've got to record in shared >> memory not only the locks that are granted but who has them. > > I'm not very familiar with PostgreSQL code but if we're > brainstorming... if you're only trying to protect against a small > number of expensive operations (like DROP, etc.) that don't really > happen often, wouldn't an atomic reference counter be good enough for > the purpose (e.g. the expensive operations would spin-wait until the > counter is 0)? No, because (1) busy-waiting is only suitable for locks that will only be held for a short time, and an AccessShareLock on a table might be held while we read 10GB of data in from disk, and (2) that wouldn't allow for deadlock detection. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7 December 2010 19:10, Robert Haas <robertmhaas@gmail.com> wrote: >> I'm not very familiar with PostgreSQL code but if we're >> brainstorming... if you're only trying to protect against a small >> number of expensive operations (like DROP, etc.) that don't really >> happen often, wouldn't an atomic reference counter be good enough for >> the purpose (e.g. the expensive operations would spin-wait until the >> counter is 0)? > > No, because (1) busy-waiting is only suitable for locks that will only > be held for a short time, and an AccessShareLock on a table might be > held while we read 10GB of data in from disk, Generally yes, but a variant with adaptive sleeping could possibly be used if it would be acceptable to delay (uncertainly) the already expensive and rare operations. > and (2) that wouldn't > allow for deadlock detection. Probably :)
2010/12/7 Robert Haas <robertmhaas@gmail.com>
No, because (1) busy-waiting is only suitable for locks that will onlyOn Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras <ivoras@freebsd.org> wrote:
> I'm not very familiar with PostgreSQL code but if we're
> brainstorming... if you're only trying to protect against a small
> number of expensive operations (like DROP, etc.) that don't really
> happen often, wouldn't an atomic reference counter be good enough for
> the purpose (e.g. the expensive operations would spin-wait until the
> counter is 0)?
be held for a short time, and an AccessShareLock on a table might be
held while we read 10GB of data in from disk, and (2) that wouldn't
allow for deadlock detection.
As far as I can see from the source, there is a lot of code executed under the partition lock protection, like two hash searches (and possibly allocations).
What can be done, is that number of locks can be increased - one could use spin locks for hash table manipulations, e.g. a lock preventing rehashing (number of baskets being changed) and a lock for required basket.
In this case only small range of code can be protected by partition lock.
As for me, this will make locking process more cpu-intensive (more locks will be acquired/freed during the exection), but will decrease contention (since all but one lock can be spin locks working on atomic counters, hash searches can be done in parallel), won't it?
The thing I am not sure in is how much spinlocks on atomic counters cost today.
--
Best regards,
Vitalii Tymchyshyn
--
Best regards,
Vitalii Tymchyshyn
2010/12/7 Віталій Тимчишин <tivv00@gmail.com>: > > > 2010/12/7 Robert Haas <robertmhaas@gmail.com> >> >> On Tue, Dec 7, 2010 at 1:08 PM, Ivan Voras <ivoras@freebsd.org> wrote: >> >> > I'm not very familiar with PostgreSQL code but if we're >> > brainstorming... if you're only trying to protect against a small >> > number of expensive operations (like DROP, etc.) that don't really >> > happen often, wouldn't an atomic reference counter be good enough for >> > the purpose (e.g. the expensive operations would spin-wait until the >> > counter is 0)? >> >> No, because (1) busy-waiting is only suitable for locks that will only >> be held for a short time, and an AccessShareLock on a table might be >> held while we read 10GB of data in from disk, and (2) that wouldn't >> allow for deadlock detection. > What can be done, is that number of locks can be increased - one could use > spin locks for hash table manipulations, e.g. a lock preventing rehashing > (number of baskets being changed) and a lock for required basket. > In this case only small range of code can be protected by partition lock. > As for me, this will make locking process more cpu-intensive (more locks > will be acquired/freed during the exection), but will decrease contention > (since all but one lock can be spin locks working on atomic counters, hash > searches can be done in parallel), won't it? For what it's worth, this is pretty much the opposite of what I had in mind. I proposed atomic reference counters (as others pointed, this probably won't work) as poor-man's shared-exclusive locks, so that most operations would not have to contend on them.
2010/12/7 Віталій Тимчишин <tivv00@gmail.com>: > As far as I can see from the source, there is a lot of code executed under > the partition lock protection, like two hash searches (and possibly > allocations). Yeah, that was my concern, too, though Tom seems skeptical (perhaps rightly). And I'm not really sure why the PROCLOCKs need to be in a hash table anyway - if we know the PROC and LOCK we can surely look up the PROCLOCK pretty expensively by following the PROC SHM_QUEUE. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
2010/12/7 Robert Haas <robertmhaas@gmail.com>: > 2010/12/7 Віталій Тимчишин <tivv00@gmail.com>: >> As far as I can see from the source, there is a lot of code executed under >> the partition lock protection, like two hash searches (and possibly >> allocations). > > Yeah, that was my concern, too, though Tom seems skeptical (perhaps > rightly). And I'm not really sure why the PROCLOCKs need to be in a > hash table anyway - if we know the PROC and LOCK we can surely look up > the PROCLOCK pretty expensively by following the PROC SHM_QUEUE. Err, pretty INexpensively. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: >> Yeah, that was my concern, too, though Tom seems skeptical (perhaps >> rightly). �And I'm not really sure why the PROCLOCKs need to be in a >> hash table anyway - if we know the PROC and LOCK we can surely look up >> the PROCLOCK pretty expensively by following the PROC SHM_QUEUE. > Err, pretty INexpensively. There are plenty of scenarios in which a proc might hold hundreds or even thousands of locks. pg_dump, for example. You do not want to be doing seq search there. Now, it's possible that you could avoid *ever* needing to search for a specific PROCLOCK, in which case eliminating the hash calculation overhead might be worth it. Of course, you'd still have to replicate all the space-management functionality of a shared hash table. regards, tom lane
2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >>> Yeah, that was my concern, too, though Tom seems skeptical (perhaps >>> rightly). šAnd I'm not really sure why the PROCLOCKs need to be in a >>> hash table anyway - if we know the PROC and LOCK we can surely look up >>> the PROCLOCK pretty expensively by following the PROC SHM_QUEUE. > >> Err, pretty INexpensively. > > There are plenty of scenarios in which a proc might hold hundreds or > even thousands of locks. pg_dump, for example. You do not want to be > doing seq search there. > > Now, it's possible that you could avoid *ever* needing to search for a > specific PROCLOCK, in which case eliminating the hash calculation > overhead might be worth it. That seems like it might be feasible. The backend that holds the lock ought to be able to find out whether there's a PROCLOCK by looking at the LOCALLOCK table, and the LOCALLOCK has a pointer to the PROCLOCK. It's not clear to me whether there's any other use case for doing a lookup for a particular combination of PROC A + LOCK B, but I'll have to look at the code more closely. > Of course, you'd still have to replicate > all the space-management functionality of a shared hash table. Maybe we ought to revisit Markus Wanner's wamalloc. Although given our recent discussions, I'm thinking that you might want to try to design any allocation system so as to minimize cache line contention. For example, you could hard-allocate each backend 512 bytes of dedicated shared memory in which to record the locks it holds. If it needs more, it allocates additional 512 byte chunks. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > 2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>: >> Now, it's possible that you could avoid *ever* needing to search for a >> specific PROCLOCK, in which case eliminating the hash calculation >> overhead might be worth it. > That seems like it might be feasible. The backend that holds the lock > ought to be able to find out whether there's a PROCLOCK by looking at > the LOCALLOCK table, and the LOCALLOCK has a pointer to the PROCLOCK. Hm, that is a real good point. Those shared memory data structures predate the invention of the local lock tables, and I don't think we looked real hard at whether we should rethink the fundamental representation in shared memory given the additional local state. The issue though is whether any other processes ever need to look at a proc's PROCLOCKs. I think at least deadlock detection does. regards, tom lane
2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>: > Robert Haas <robertmhaas@gmail.com> writes: >> 2010/12/8 Tom Lane <tgl@sss.pgh.pa.us>: >>> Now, it's possible that you could avoid *ever* needing to search for a >>> specific PROCLOCK, in which case eliminating the hash calculation >>> overhead might be worth it. > >> That seems like it might be feasible. The backend that holds the lock >> ought to be able to find out whether there's a PROCLOCK by looking at >> the LOCALLOCK table, and the LOCALLOCK has a pointer to the PROCLOCK. > > Hm, that is a real good point. Those shared memory data structures > predate the invention of the local lock tables, and I don't think we > looked real hard at whether we should rethink the fundamental > representation in shared memory given the additional local state. > The issue though is whether any other processes ever need to look > at a proc's PROCLOCKs. I think at least deadlock detection does. Sure, but it doesn't use the hash table to do it. All the PROCLOCKs for any given LOCK are in a linked list; we just walk it. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company