Re: Wait free LW_SHARED acquisition - v0.2 - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Wait free LW_SHARED acquisition - v0.2 |
Date | |
Msg-id | 20140202140014.GM5930@awork2.anarazel.de Whole thread Raw |
In response to | Re: Wait free LW_SHARED acquisition - v0.2 (Peter Geoghegan <pg@heroku.com>) |
Responses |
Re: Wait free LW_SHARED acquisition - v0.2
Re: Wait free LW_SHARED acquisition - v0.2 Re: Wait free LW_SHARED acquisition - v0.2 |
List | pgsql-hackers |
Hi, On 2014-02-01 19:47:29 -0800, Peter Geoghegan wrote: > Here are the results of a benchmark on Nathan Boley's 64-core, 4 > socket server: http://postgres-benchmarks.s3-website-us-east-1.amazonaws.com/amd-4-socket-rwlocks/ That's interesting. The maximum number of what you see here (~293125) is markedly lower than what I can get. ... poke around ... Hm, that's partially because you're using pgbench without -M prepared if I see that correctly. The bottleneck in that case is primarily memory allocation. But even after that I am getting higher numbers: ~342497. Trying to nail down the differnce it oddly seems to be your max_connections=80 vs my 100. The profile in both cases is markedly different, way much more spinlock contention with 80. All in Pin/UnpinBuffer(). I think =80 has to lead to some data being badly aligned. I can reproduce that =91 has *much* better performance than =90. 170841.844938 vs 368490.268577 in a 10s test. Reproducable both with an without the test. That's certainly worth some investigation. This is *not* reproducable on the intel machine, so it might the associativity of the L1/L2 cache on the AMD. > Perhaps I should have gone past 64 clients, because in the document > "Lock Scaling Analysis on Intel Xeon Processors" [1], Intel write: > > "This implies that with oversubscription (more threads running than > available logical CPUs), the performance of spinlocks can depend > heavily on the exact OS scheduler behavior, and may change drastically > with operating system or VM updates." > > I haven't bothered with a higher client counts though, because Andres > noted it's the same with 90 clients on this AMD system. Andres: Do you > see big problems when # clients < # logical cores on the affected > Intel systems? There's some slowdown with the patch applied, but it's not big. Without it, the slowdown is much earlier. > There is only a marginal improvement in performance on this big 4 > socket system. Andres informs me privately that he has reproduced the > problem on multiple new 4-socket Intel servers, so it seems reasonable > to suppose that more or less an Intel thing. I've just poked around, it's not just 4 socket, but also 2 socket systems. Some background: The setups that triggered me into working on the patchset didn't really have a pgbench like workload, the individual queries were/are more complicated even though it's still an high throughput OLTP workload. And the contention was *much* higher than what I can reproduce with pgbench -S, there was often nearly all time spent in the lwlock's spinlock, and it was primarily the buffer mapping lwlocks, being locked in shared mode. The difference is that instead of locking very few buffers per query like pgbench does, they touched much more. If you look at a profile of a pgbench -S workload -cj64 it's pretty much all bottlenecked by GetSnapshotData(): unpatched: - 40.91% postgres_plainl postgres_plainlw [.] s_lock - s_lock - 51.34% LWLockAcquire GetSnapshotData - GetTransactionSnapshot + 55.23% PortalStart + 44.77% PostgresMain - 48.66%LWLockRelease GetSnapshotData - GetTransactionSnapshot + 53.64% PostgresMain +46.36% PortalStart + 2.65% pgbench [kernel.kallsyms] [k] try_to_wake_up + 2.61% postgres_plainl postgres_plainlw [.] LWLockRelease + 1.36% postgres_plainl postgres_plainlw [.] LWLockAcquire + 1.25% postgres_plainl postgres_plainlw [.] GetSnapshotData patched: - 2.94% postgres postgres [.] LWLockAcquire - LWLockAcquire + 26.61% ReadBuffer_common + 17.08%GetSnapshotData + 10.71% _bt_relandgetbuf + 10.24% LockAcquireExtended + 10.11% VirtualXactLockTableInsert + 9.17% VirtualXactLockTableCleanup + 7.55% _bt_getbuf + 3.40% index_fetch_heap + 1.51% LockReleaseAll + 0.89% StartTransactionCommand + 0.83% _bt_next + 0.75% LockRelationOid + 0.52%ReadBufferExtended - 2.75% postgres postgres [.] _bt_compare - _bt_compare + 96.72% _bt_binsrch + 1.71% _bt_moveright + 1.29% _bt_search - 2.67% postgres postgres [.] GetSnapshotData - GetSnapshotData + 97.03% GetTransactionSnapshot + 2.16% PostgresMain + 0.81% PortalStart So now the profile looks much saner/less contended which immediately is visible in transaction rates: 192584.218530 vs 552834.002573. But if you try to attack the contention from the other end, by setting default_transaction_isolation='repeatable read' to reduce the number of snapshots taken, its suddenly 536789.807391 vs 566839.328922. A *much* smaller benefit. The reason the patch doesn't help that much with that setting is that there simply isn't as much actual contention there: + 2.77% postgres postgres [.] _bt_compare - 2.72% postgres postgres [.] LWLockAcquire - LWLockAcquire + 30.51% ReadBuffer_common + 12.45%VirtualXactLockTableInsert + 12.07% _bt_relandgetbuf + 10.61% VirtualXactLockTableCleanup + 9.95% LockAcquireExtended + 8.59% GetSnapshotData + 7.51% _bt_getbuf + 2.54% index_fetch_heap + 1.47% LockReleaseAll + 1.16% StartTransactionCommand + 0.95% LockRelationOid + 0.89% _bt_next + 0.77% ReadBufferExtended - 2.19% postgres postgres [.] s_lock - s_lock + 52.41% PinBuffer + 47.04% UnpinBuffer While locking is still visible, it's not so extraordinarily dominant. If you then profile using -e stalled-cycles-backend, you can see that the actual problematic pipeline stalls are caused by Pin/UnpinBuffer. So that's now the bottleneck. Since that doesn't use lwlocks, it's not surprising to see an lwlocks patch doesn't bring as much benefit. > So, I think it's fair to say, given what we now know from Andres' > numbers and the numbers I got from Nathan's server, that this patch is > closer to being something that addresses a particularly unfortunate > pathology on many-socket Intel system than it is to being a general > performance optimization. I think one would see bigger improvements if we'd ha testcase that also caused real lock contention in the AMDs. If you look at a profile there: unpatched, -c default_transaction_isolation=read committed - 3.64% postgres_plainl postgres_plainlw [.] s_lock - s_lock + 51.49% LWLockAcquire + 45.19% LWLockRelease + 1.73% PinBuffer + 1.59% UnpinBuffer - 3.46% postgres_plainl postgres_plainlw [.] LWLockAcquire - LWLockAcquire + 36.67% GetSnapshotData + 24.61%ReadBuffer_common + 11.45% _bt_relandgetbuf + 6.78% _bt_getbuf + 5.99% LockAcquireExtended + 5.27%VirtualXactLockTableInsert + 5.00% VirtualXactLockTableCleanup + 1.52% _bt_next + 1.31% index_fetch_heap + 0.76% LockReleaseAll - 2.90% postgres_plainl postgres_plainlw [.] GetSnapshotData - GetSnapshotData + 97.40% GetTransactionSnapshot + 2.31% PostgresMain unpatched, -c default_transaction_isolation=repeatable read - 2.78% postgres_plainl postgres_plainlw [.] LWLockAcquire - LWLockAcquire + 35.41% ReadBuffer_common + 15.12%_bt_relandgetbuf + 11.78% GetSnapshotData + 9.29% _bt_getbuf + 7.36% VirtualXactLockTableInsert + 7.25%LockAcquireExtended + 6.67% VirtualXactLockTableCleanup + 1.98% index_fetch_heap + 1.88% _bt_next +1.47% LockReleaseAll + 0.50% StartTransactionCommand - 2.21% postgres_plainl postgres_plainlw [.] hash_search_with_hash_value - hash_search_with_hash_value + 53.23%BufTableLookup + 13.92% LockAcquireExtended + 8.99% GetPortalByName + 6.79% RelationIdGetRelation +4.98% FetchPreparedStatement + 4.43% CreatePortal + 3.61% PortalDrop + 2.46% RemoveLocalLock There's obviously not even close as much contention as in the intel case. Which shows in the benchmark results: 440910.048200 vs. 473789.980486 So, the lwlock patch cannot improve concurrency by the degree it did for intel, since there simply isn't much of a bottleneck. But I'd be surprised if there aren't cases with much more prominent bottlenecks, this just isn't one of them. The changed algorithm for lwlock imo is an *algorithmic* improvement, not one for a particular architecture. The advantage being that locking a lwlock which is primarily taken in shared mode will never need need to wait or loop. > Based on the above quoted passage, it isn't > unreasonable to suppose other vendors or architectures could be > affected, but that isn't in evidence. While I welcome the use of > atomic operations in the context of LW_SHARED acquisition as general > progress, I think that to the outside world my proposed messaging is > more useful. It's not quite a bug-fix, but if you're using a > many-socket Intel server, you're *definitely* going to want to use a > PostgreSQL version that is unaffected. You may well not want to take > on the burden of waiting for 9.4, or waiting for it to fully > stabilize. > I note that Andres has a feature branch of this backported to Postgres > 9.2, no doubt because of a request from a 2ndQuadrant customer. I have > to wonder if we should think about making this available with a > configure switch in one or more back branches. Yes, that branch is used by some of them. But to make that clear to all that are still reading, I have *first* presented the patch & findings to -hackers and *then* backported it, and I have referenced the existance of the patch for 9.2 on list before. This isn't some kind of "secret sauce" deal... But note that there is one significant difference between the 9.2 and HEAD version, the former directly uses gcc intrinsics, without having any form of checks or abstraction across compilers... > I think that the > complete back-porting of the fsync request queue issue's fix in commit > 758728 could be considered a precedent - that too was a fix for a > really bad worst-case that was encountered fairly infrequently in the > wild. It's sort of horrifying to have red-hot spinlocks in production, > so that seems like the kind of thing we should make an effort to > address for those running multi-socket systems. Of those running > Postgres on new multi-socket systems, the reality is that the majority > are running on Intel hardware. Unfortunately, everyone knows that > Intel will soon be the only game in town when it comes to high-end > x86_64 servers, which contributes to my feeling that we need to target > back branches. We should do something about the possible regression > with older compilers using the fallback first, though. That might be something to do later, as it *really* can hurt in practice. We had one server go from load 240 to 11... But I think we should first focus on getting the patch ready for master, then we can see where it's going. At the very least I'd like to split of the part modifying the current spinlocks to use the atomics, that seems far to invasive. > It would be useful to know more about the nature of the problem that > made such an appreciable difference in Andres' original post. I unfortunately can't tell you that much more, not because it's private, but because it mostly was diagnosed by remote hand debugging, limiting insights considerably. What I *do* know is that the bottleneck entirely was caused by the buffer mapping lwlocks, all taken in shared mode according to call graph profiles. The call graph and the queries I have seen indicated that lots of the frequent queries involved nested loops over not inconsiderable number of pages. We've also tried to increase the number of buffer mapping locks, but that didn't prove to be very helpful. Hm, that went on a bit... Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
pgsql-hackers by date: