Thread: MVCC catalog access
We've had a number of discussions about the evils of SnapshotNow. As far as I can tell, nobody likes it and everybody wants it gone, but there is concern about the performance impact. I decided to do some testing to measure the impact. I was pleasantly surprised by the results. The attached patch is a quick hack to provide for MVCC catalog access. It adds a GUC called "mvcc_catalog_access". When this GUC is set to true, and heap_beginscan() or index_beginscan() is called with SnapshotNow, they call GetLatestSnapshot() and use the resulting snapshot in lieu of SnapshotNow. As a debugging double-check, I modified HeapTupleSatisfiesNow to elog(FATAL) if called with mvcc_catalog_access is true. Of course, both of these are dirty hacks. If we were actually to implement MVCC catalog access, I think we'd probably just go through and start replacing instances of SnapshotNow with GetLatestSnapshot(), but I wanted to make it easy to do perf testing. When I first made this change, I couldn't detect any real change; indeed, it seemed that make check was running ever-so-slightly faster than before, although that may well have been a testing artifact. I wrote a test case that created a schema with 100,000 functions in it and then dropped the schema (I believe it was Tom who previously suggested this test case as a worst-case scenario for MVCC catalog access). That didn't seem to be adversely affected either, even though it must take ~700k additional MVCC snapshots with mvcc_catalog_access = true. MVCC Off: Create 8743.101 ms, Drop 9655.471 ms MVCC On: Create 7462.882 ms, Drop 9515.537 ms MVCC Off: Create 7519.160 ms, Drop 9380.905 ms MVCC On: Create 7517.382 ms, Drop 9394.857 ms The first "Create" seems to be artificially slow because of some kind of backend startup overhead. Not sure exactly what. After wracking my brain for a few minutes, I realized that the lack of any apparent performance regression was probably due to the lack of any concurrent connections, making the scans of the PGXACT array very cheap. So I wrote a little program to open a bunch of extra connections. My MacBook Pro grumbled when I tried to open more than about 600, so I had to settle for that number. That was enough to show up the cost of all those extra snapshots: MVCC Off: Create 9065.887 ms, Drop 9599.494 ms MVCC On: Create 8384.065 ms, Drop 10532.909 ms MVCC Off: Create 7632.197 ms, Drop 9499.502 ms MVCC On: Create 8215.443 ms, Drop 10033.499 ms Now, I don't know about you, but I'm having a hard time getting agitated about those numbers. Most people are not going to drop 100,000 objects with a single cascaded drop. And most people are not going to have 600 connections open when they do. (The snapshot overhead should be roughly proportional to the product of the number of drops and the number of open connections, and the number of cases where the product is as high as 60 million has got to be pretty small.) But suppose that someone is in that situation. Well, then they will take a... 10% performance penalty? That sounds plenty tolerable to me, if it means we can start moving in the direction of allowing some concurrent DDL. Am I missing an important test case here? Are these results worse than I think they are? Did I boot this testing somehow? [MVCC catalog access patch, test program to create lots of idle connections, and pg_depend stress test case attached.] -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, May 21, 2013 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote: > [ MVCC catalog access seems to be pretty cheap ] In discussions today, Tom Lane suggested testing the time to start up a backend and run a simple query such as "SELECT 2+2" in the absence of a relcache file. I did this and can't measure any overhead as a result of MVCC catalog access. I tried it with no active connections. I tried it with 600 idle active connections (to make taking MVCC snapshots more expensive). I couldn't quite believe it made no difference, so I tried doing it in a tight loop under pgbench. I still can't measure any difference. I haven't tested carefully enough to rule out the possibility of an effect <1/2% at 600 connections, but there certainly isn't anything bigger than that and I don't even think there's that much of a difference. Andres Freund suggested creating a couple of simple tables and having lots of short-lived backends select data from them. rhaas=# create table af1 (x) as select g from generate_series(1,4) g; SELECT 4 rhaas=# create table af2 (x) as select g from generate_series(4,7) g; SELECT 4 Test query: SELECT * FROM af1, af2 WHERE af1.x = af2.x; pgbench command: pgbench -T 10 -c 50 -j 50 -n -f f -C With mvcc_catalog_access=off, I get ~1553 tps; with it on, I get ~1557 tps. Hmm... that could be because of the two-line debugging hunk my patch addes to HeapTupleSatisfiesNow(). After removing that, I get maybe a 1% regression with mvcc_catalog_access=on on this test, but it's not very consistent. If I restart the database server a few times, the overhead bounces around each time, and sometimes it's zero; the highest I saw is 1.4%. But it's not much, and this is a pretty brutal workload for PostgreSQL, since starting up >1500 connections per second is not a workload for which we're well-suited in the first place. All in all, I'm still feeling optimistic. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-05-22 22:51:13 -0400, Robert Haas wrote: > In discussions today, Tom Lane suggested testing the time to start up > a backend and run a simple query such as "SELECT 2+2" in the absence > of a relcache file. > I did this and can't measure any overhead as a result of MVCC catalog > access. I tried it with no active connections. I tried it with 600 > idle active connections (to make taking MVCC snapshots more > expensive). Did you try it with the 600 transactions actually being in a transaction and having acquired a snapshot? > All in all, I'm still feeling optimistic. +1. I still feel like this has to be much harder since we made it out to be hard for a long time ;) Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 22, 2013 at 11:02 PM, Andres Freund <andres@2ndquadrant.com> wrote: > On 2013-05-22 22:51:13 -0400, Robert Haas wrote: >> In discussions today, Tom Lane suggested testing the time to start up >> a backend and run a simple query such as "SELECT 2+2" in the absence >> of a relcache file. > >> I did this and can't measure any overhead as a result of MVCC catalog >> access. I tried it with no active connections. I tried it with 600 >> idle active connections (to make taking MVCC snapshots more >> expensive). > > Did you try it with the 600 transactions actually being in a transaction > and having acquired a snapshot? No... I can hack something up for that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-05-22 23:05:40 -0400, Robert Haas wrote: > On Wed, May 22, 2013 at 11:02 PM, Andres Freund <andres@2ndquadrant.com> wrote: > > On 2013-05-22 22:51:13 -0400, Robert Haas wrote: > >> In discussions today, Tom Lane suggested testing the time to start up > >> a backend and run a simple query such as "SELECT 2+2" in the absence > >> of a relcache file. > > > >> I did this and can't measure any overhead as a result of MVCC catalog > >> access. I tried it with no active connections. I tried it with 600 > >> idle active connections (to make taking MVCC snapshots more > >> expensive). > > > > Did you try it with the 600 transactions actually being in a transaction > > and having acquired a snapshot? > > No... I can hack something up for that. Make that actually having acquired an xid. We skip a large part of the work if a transaction doesn't yet have one afair. I don't think the mere presence of 600 idle connections without an xid in contrast to just having max_connection at 600 should actually make a difference in the cost of acquiring a snapshot? Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, May 22, 2013 at 11:11 PM, Andres Freund <andres@2ndquadrant.com> wrote: > Make that actually having acquired an xid. We skip a large part of the > work if a transaction doesn't yet have one afair. I don't think the mere > presence of 600 idle connections without an xid in contrast to just > having max_connection at 600 should actually make a difference in the > cost of acquiring a snapshot? Attached is a slightly updated version of the patch I'm using for testing, and an updated version of the pg_cxn source that I'm using to open lotsa connections. With this version, I can do this: ./pg_cxn -n 600 -c BEGIN -c 'SELECT txid_current()' ...which I think is sufficient to make sure all those transactions have XIDs. Then I reran the "depend" test case (create a schema with 1000,000 functions and then drop the schema with CASCADE) that I mentioned in my original posting. Here are the results: MVCC Off: Create 8685.662 ms, Drop 9973.233 ms MVCC On: Create 7931.039 ms, Drop 10189.189 ms MVCC Off: Create 7810.084 ms, Drop 9594.580 ms MVCC On: Create 8854.577 ms, Drop 10240.024 ms OK, let's try the rebuild-the-relcache test using the same pg_cxn scenario (600 transactions that have started a transaction and selected txid_current()). [rhaas ~]$ time for s in `seq 1 1000`; do rm -f pgdata/global/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null; done MVCC catalog access on: real 0m11.006s user 0m2.746s sys 0m2.664s MVCC catalog access off: real 0m10.583s user 0m2.745s sys 0m2.661s MVCC catalog access on: real 0m10.646s user 0m2.750s sys 0m2.661s MVCC catalog access off: real 0m10.823s user 0m2.756s sys 0m2.681s -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Perhaps we see little difference in performance because PGPROC has been separated into PGPROC and PGXACT, reducing lock contention with getting snapshot data?
By the way, I grabbed a 32-core machine and did some more performance tests with some open connections with XIDs assigned using pg_cxn v2 given by Robert in his previous mail to make sure that the snapshots get pretty large.$ time for s in `seq 1 1000`
> do
> rm -f ~/bin/pgsql/master/global/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null;
> done
And then the create/drop test.
I have done those tests with 250, 500, 1000 and 2000 open connections:
1) 250 open connections
1-1) read test
1) 250 open connections
1-1) read test
Round 1:
mvcc_catalog_access off:
real 0m9.124s
user 0m0.200s
sys 0m0.392s
mvcc_catalog_access on:
real 0m9.297s
user 0m0.148s
sys 0m0.444s
mvcc_catalog_access off:
real 0m9.124s
user 0m0.200s
sys 0m0.392s
mvcc_catalog_access on:
real 0m9.297s
user 0m0.148s
sys 0m0.444s
Round 2:
mvcc_catalog_access off:
real 0m8.985s
user 0m0.160s
sys 0m0.372s
mvcc_catalog_access on:
real 0m9.244s
user 0m0.240s
sys 0m0.400s
mvcc_catalog_access off:
real 0m8.985s
user 0m0.160s
sys 0m0.372s
mvcc_catalog_access on:
real 0m9.244s
user 0m0.240s
sys 0m0.400s
1-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 24554.849, Drop: 29755.146
mvcc on: Create: 26904.755, Drop: 32891.556
mvcc off: Create: 23337.342, Drop: 29921.990
mvcc on: Create: 24533.708, Drop: 31670.840
2) 500 open connections
2-1) read test
2-1) read test
Round 1:
mvcc_catalog_access off:
real 0m9.123s
user 0m0.200s
sys 0m0.396s
mvcc_catalog_access on:
real 0m9.627s
user 0m0.156s
sys 0m0.460s
Round 2:
mvcc_catalog_access off:
real 0m9.221s
user 0m0.316s
sys 0m0.392s
mvcc_catalog_access on:
real 0m9.592s
user 0m0.160s
sys 0m0.484s
mvcc_catalog_access off:
real 0m9.123s
user 0m0.200s
sys 0m0.396s
mvcc_catalog_access on:
real 0m9.627s
user 0m0.156s
sys 0m0.460s
Round 2:
mvcc_catalog_access off:
real 0m9.221s
user 0m0.316s
sys 0m0.392s
mvcc_catalog_access on:
real 0m9.592s
user 0m0.160s
sys 0m0.484s
2-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 25872.886, Drop: 31723.921
mvcc on: Create: 27076.429, Drop: 33674.336
mvcc off: Create: 24039.456, Drop: 30434.019
mvcc on: Create: 29105.713, Drop: 33821.170
3) 1000 open connections
3-1) read test
Round 1:
mvcc_catalog_access off:
real 0m9.240s
user 0m0.192s
sys 0m0.396s
mvcc_catalog_access on:
real 0m9.674s
user 0m0.236s
sys 0m0.440s
Round 2:
mvcc_catalog_access off:
real 0m9.302s
user 0m0.308s
sys 0m0.392s
mvcc_catalog_access on:
real 0m9.746s
user 0m0.204s
sys 0m0.436s
3-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 25563.705, Drop: 31747.451
mvcc on: Create: 33281.246, Drop: 36618.166
mvcc off: Create: 28178.210, Drop: 30550.166
mvcc on: Create: 31849.825, Drop: 36831.245
mvcc_catalog_access off:
real 0m9.240s
user 0m0.192s
sys 0m0.396s
mvcc_catalog_access on:
real 0m9.674s
user 0m0.236s
sys 0m0.440s
Round 2:
mvcc_catalog_access off:
real 0m9.302s
user 0m0.308s
sys 0m0.392s
mvcc_catalog_access on:
real 0m9.746s
user 0m0.204s
sys 0m0.436s
3-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 25563.705, Drop: 31747.451
mvcc on: Create: 33281.246, Drop: 36618.166
mvcc off: Create: 28178.210, Drop: 30550.166
mvcc on: Create: 31849.825, Drop: 36831.245
4) 2000 open connections
4-1) read test
Round 1:
mvcc_catalog_access off:
real 0m9.066s
user 0m0.128s
sys 0m0.420s
mvcc_catalog_access on:
real 0m9.978s
user 0m0.168s
sys 0m0.412s
Round 2:
mvcc_catalog_access off:
real 0m9.113s
user 0m0.152s
sys 0m0.444s
mvcc_catalog_access on:
real 0m9.974s
user 0m0.176s
sys 0m0.436s
mvcc_catalog_access off:
real 0m9.066s
user 0m0.128s
sys 0m0.420s
mvcc_catalog_access on:
real 0m9.978s
user 0m0.168s
sys 0m0.412s
Round 2:
mvcc_catalog_access off:
real 0m9.113s
user 0m0.152s
sys 0m0.444s
mvcc_catalog_access on:
real 0m9.974s
user 0m0.176s
sys 0m0.436s
More or less the same results as previously with 10% performance drop.
4-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 28708.095 ms, Drop: 32510.057 ms
mvcc off: Create: 28708.095 ms, Drop: 32510.057 ms
mvcc on: Create: 39987.815 ms, Drop: 43157.006 ms
mvcc off: Create: 28409.853 ms, Drop: 31248.163 ms
mvcc on: Create: 41669.829 ms, Drop: 44645.109 ms
mvcc on: Create: 41669.829 ms, Drop: 44645.109 ms
For read tests, we can see a performance drop up to 10% for 2000 connections.
For the write tests, we can see a performance drop of 9~10% for 250 connections, up to 30% performance drop with 2000 connections.
We barely see users drop that many objects at the same time with so much open transactions, they'll switch to a connection pooler before opening that many connections to the server. I am not sure that such a performance drop is acceptable as-is, but perhaps it is if we consider the functionality gain we can have thanks to MVCC catalogs.
--
Michael
On Sun, May 26, 2013 at 9:10 PM, Michael Paquier <michael.paquier@gmail.com> wrote: > Perhaps we see little difference in performance because PGPROC has been > separated into PGPROC and PGXACT, reducing lock contention with getting > snapshot data? > > By the way, I grabbed a 32-core machine and did some more performance tests > with some open connections with XIDs assigned using pg_cxn v2 given by > Robert in his previous mail to make sure that the snapshots get pretty > large. Thanks for checking this on another machine. It's interesting that you were able to measure a hit for relcache rebuild, whereas I was not, but it doesn't look horrible. IMHO, we should press forward with this approach. Considering that these are pretty extreme test cases, I'm inclined to view the performance loss as acceptable. We've never really viewed DDL as something that needs to be micro-optimized, and there is ample testimony to that fact in the existing code and in the treatment of prior patches in this area. This is not to say that we want to go around willy-nilly making it slower, but I think there will be very few users for which the number of microseconds it takes to create or drop an SQL object is performance-critical, especially when you consider that (1) the effect will be quite a bit less when the objects are tables, since in that case the snapshot cost will tend to be drowned out by the filesystem cost and (2) people who don't habitually keep hundreds and hundreds of connections open - which hopefully most people don't - won't see the effect anyway. Against that, this removes the single largest barrier to allowing more concurrent DDL, a feature that I suspect will make a whole lot of people *very* happy. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, May 28, 2013 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:
MichaelIMHO, we should press forward with this approach. Considering that
these are pretty extreme test cases, I'm inclined to view the
performance loss as acceptable. We've never really viewed DDL as
something that needs to be micro-optimized, and there is ample
testimony to that fact in the existing code and in the treatment of
prior patches in this area. This is not to say that we want to go
around willy-nilly making it slower, but I think there will be very
few users for which the number of microseconds it takes to create or
drop an SQL object is performance-critical, especially when you
consider that (1) the effect will be quite a bit less when the objects
are tables, since in that case the snapshot cost will tend to be
drowned out by the filesystem cost and (2) people who don't habitually
keep hundreds and hundreds of connections open - which hopefully most
people don't - won't see the effect anyway. Against that, this
removes the single largest barrier to allowing more concurrent DDL, a
feature that I suspect will make a whole lot of people *very* happy.
+1.
So, I imagine that the next step would be to add a new Snapshot validation
level in tqual.h. Something like SnapshotMVCC? Then replace SnapshotNow
level in tqual.h. Something like SnapshotMVCC? Then replace SnapshotNow
by SnapshotMVCC where it is required.
I am also seeing that SnapshotNow is used in places where we might not want to
have it changed. For example autovacuum code path when we retrieve database
or table list should not be changed, no?
--
--
On Thu, May 30, 2013 at 1:39 AM, Michael Paquier <michael.paquier@gmail.com> wrote: > +1. Here's a more serious patch for MVCC catalog access. This one involves more data copying than the last one, I think, because the previous version did not register the snapshots it took, which I think is not safe. So this needs to be re-tested for performance, which I have so far made no attempt to do. It strikes me as rather unfortunate that the snapshot interface is designed in such a way as to require so much data copying. It seems we always take a snapshot by copying from PGXACT/PGPROC into CurrentSnapshotData or SecondarySnapshotData, and then copying data a second time from there to someplace more permanent. It would be nice to avoid that, at least in common cases. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Jun 4, 2013 at 3:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:
2) backend startupOn Thu, May 30, 2013 at 1:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> +1.
Here's a more serious patch for MVCC catalog access. This one
involves more data copying than the last one, I think, because the
previous version did not register the snapshots it took, which I think
is not safe. So this needs to be re-tested for performance, which I
have so far made no attempt to do.
It strikes me as rather unfortunate that the snapshot interface is
designed in such a way as to require so much data copying. It seems
we always take a snapshot by copying from PGXACT/PGPROC into
CurrentSnapshotData or SecondarySnapshotData, and then copying data a
second time from there to someplace more permanent. It would be nice
to avoid that, at least in common cases.
And here are more results comparing master branch with and without this patch...
1) DDL CREATE/DROP test:
1-1) master:
250 connections:
Create: 24846.060, Drop: 30391.713
Create: 23771.394, Drop: 29769.396
500 connections:
Create: 24339.449, Drop: 30084.741
Create: 24152.176, Drop: 30643.471
1000 connections:
Create: 26007.960, Drop: 31019.918
Create: 25937.592, Drop: 30600.341
2000 connections:
Create: 26900.324, Drop: 30741.989
Create: 26910.660, Drop: 31577.247
1-2) mvcc catalogs:
250 connections:
Create: 25371.342, Drop: 31458.952
Create: 25685.094, Drop: 31492.377
500 connections:
Create: 28557.882, Drop: 33673.266
Create: 27901.910, Drop: 33223.006
1000 connections:
Create: 31910.130, Drop: 36770.062
Create: 32210.093, Drop: 36754.888
2000 connections:
Create: 40374.754, Drop: 43442.528
Create: 39763.691, Drop: 43234.243
1) DDL CREATE/DROP test:
1-1) master:
250 connections:
Create: 24846.060, Drop: 30391.713
Create: 23771.394, Drop: 29769.396
500 connections:
Create: 24339.449, Drop: 30084.741
Create: 24152.176, Drop: 30643.471
1000 connections:
Create: 26007.960, Drop: 31019.918
Create: 25937.592, Drop: 30600.341
2000 connections:
Create: 26900.324, Drop: 30741.989
Create: 26910.660, Drop: 31577.247
1-2) mvcc catalogs:
250 connections:
Create: 25371.342, Drop: 31458.952
Create: 25685.094, Drop: 31492.377
500 connections:
Create: 28557.882, Drop: 33673.266
Create: 27901.910, Drop: 33223.006
1000 connections:
Create: 31910.130, Drop: 36770.062
Create: 32210.093, Drop: 36754.888
2000 connections:
Create: 40374.754, Drop: 43442.528
Create: 39763.691, Drop: 43234.243
2-1) master branch:
250 connections:
real 0m8.993s
user 0m0.128s
sys 0m0.380s
500 connections:
real 0m9.004s
user 0m0.212s
sys 0m0.340s
1000 connections:
real 0m9.072s
user 0m0.272s
sys 0m0.332s
2000 connections:
real 0m9.257s
user 0m0.204s
sys 0m0.392s
2-2) MVCC catalogs:
250 connections:
real 0m9.067s
user 0m0.108s
sys 0m0.396s
500 connections:
real 0m9.034s
user 0m0.112s
sys 0m0.376s
1000 connections:
real 0m9.303s
user 0m0.176s
sys 0m0.328s
2000 connections
real 0m9.916s
user 0m0.160s
sys 0m0.428s
Except for the case of backend startup test for 500 connections that looks to have some noise, performance degradation reaches 6% for 2000 connections, and less than 1% for 250 connections. This is better than last time.
For the CREATE/DROP case, performance drop reaches 40% for 2000 connections (32% during last tests). I also noticed a lower performance drop for 250 connections now (3~4%) compared to the 1st time (9%).
I compiled the main results on tables here:
http://michael.otacoo.com/postgresql-2/postgres-9-4-devel-mvcc-catalog-access-take-2-2/
http://michael.otacoo.com/postgresql-2/postgres-9-4-devel-mvcc-catalog-access-take-2-2/
The results of last time are also available here:
http://michael.otacoo.com/postgresql-2/postgres-9-4-devel-mvcc-catalog-access-2/
http://michael.otacoo.com/postgresql-2/postgres-9-4-devel-mvcc-catalog-access-2/
Regards,
--
Michael
On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > We've had a number of discussions about the evils of SnapshotNow. As > far as I can tell, nobody likes it and everybody wants it gone, but > there is concern about the performance impact. I was always under the impression that the problem was we weren't quite sure what changes would be needed to make mvcc-snapshots work for the catalog lookups. The semantics of SnapshotNow aren't terribly clear either but we have years of experience telling us they seem to basically work. Most of the problems we've run into we either have worked around in the catalog accesses. Nobody really knows how many of the call sites will need different logic to behave properly with mvcc snapshots. I thought there were many call sites that were specifically depending on seeing dirty reads to avoid race conditions with other backends -- which probably just narrowed the race condition or created different ones. If you clean those all up it will be probably be cleaner and better but we don't know how many such sites will need to be modified. I'm not even sure what "clean them up" means. You can replace checks with things like constraints and locks but the implementation of constraints and locks will still need to use SnapshotNow surely? -- greg
On 06/05/2013 04:28 PM, Greg Stark wrote: > On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> We've had a number of discussions about the evils of SnapshotNow. As >> far as I can tell, nobody likes it and everybody wants it gone, but >> there is concern about the performance impact. > I was always under the impression that the problem was we weren't > quite sure what changes would be needed to make mvcc-snapshots work > for the catalog lookups. The semantics of SnapshotNow aren't terribly > clear either but we have years of experience telling us they seem to > basically work. Most of the problems we've run into we either have > worked around in the catalog accesses. Nobody really knows how many of > the call sites will need different logic to behave properly with mvcc > snapshots. > > I thought there were many call sites that were specifically depending > on seeing dirty reads to avoid race conditions with other backends -- > which probably just narrowed the race condition or created different > ones. If you clean those all up it will be probably be cleaner and > better but we don't know how many such sites will need to be modified. > I'm not even sure what "clean them up" means. You can replace checks > with things like constraints and locks but the implementation of > constraints and locks will still need to use SnapshotNow surely? I guess that anything that does *not* write should be happier with MVCC catalogue, especially if there has been any DDL after its snapshot. For writers the ability to compare MVCC and SnapshotNow snapshots would tell if they need to take extra steps. But undoubtedly the whole thing would be lot of work. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ
On 2013-06-05 15:28:09 +0100, Greg Stark wrote: > On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: > > We've had a number of discussions about the evils of SnapshotNow. As > > far as I can tell, nobody likes it and everybody wants it gone, but > > there is concern about the performance impact. > I thought there were many call sites that were specifically depending > on seeing dirty reads to avoid race conditions with other backends -- > which probably just narrowed the race condition or created different > ones. But SnapshotNow doesn't allow you to do actual dirty reads? It only gives you rows back that were actually visible when we checked. The difference to SnapshotMVCC is that during a scan the picture of which transactions are committed can change. > I'm not even sure what "clean them up" means. You can replace checks > with things like constraints and locks but the implementation of > constraints and locks will still need to use SnapshotNow surely? The places that require this should already use HeapTupleSatisfiesDirty which is something different. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Andres Freund <andres@2ndquadrant.com> writes: > On 2013-06-05 15:28:09 +0100, Greg Stark wrote: >> I thought there were many call sites that were specifically depending >> on seeing dirty reads to avoid race conditions with other backends -- >> which probably just narrowed the race condition or created different >> ones. > But SnapshotNow doesn't allow you to do actual dirty reads? Yeah. I believe the issue is that we can't simply do MVCC catalog reads with a snapshot taken at transaction start time or statement start time, as we would do if executing an MVCC scan for a user query. Rather, the snapshot has to be recent enough to ensure we see the current definition of any table we've just acquired lock on, *even if that's newer than the snapshot prevailing for the user's purposes*. Otherwise we might be using the wrong rowtype definition or failing to enforce a just-added constraint. The last time we talked about this, we were batting around ideas of keeping a "current snapshot for catalog purposes", which we'd update or at least invalidate anytime we acquired a new lock. (In principle, if that isn't new enough, we have a race condition that we'd better fix by adding some more locking.) Robert's results seem to say that that might be unnecessary optimization, and that it'd be sufficient to just take a new snap each time we need to do a catalog scan. TBH I'm not sure I believe that; it seems to me that this approach is surely going to create a great deal more contention from concurrent GetSnapshotData calls. But at the very least, this says we can experiment with the behavioral aspects without bothering to build infrastructure for tracking an appropriate catalog snapshot. regards, tom lane
On 2013-06-05 11:35:58 -0400, Tom Lane wrote: > Andres Freund <andres@2ndquadrant.com> writes: > > On 2013-06-05 15:28:09 +0100, Greg Stark wrote: > >> I thought there were many call sites that were specifically depending > >> on seeing dirty reads to avoid race conditions with other backends -- > >> which probably just narrowed the race condition or created different > >> ones. > > > But SnapshotNow doesn't allow you to do actual dirty reads? > > Yeah. I believe the issue is that we can't simply do MVCC catalog reads > with a snapshot taken at transaction start time or statement start time, > as we would do if executing an MVCC scan for a user query. Rather, the > snapshot has to be recent enough to ensure we see the current definition > of any table we've just acquired lock on, *even if that's newer than the > snapshot prevailing for the user's purposes*. Otherwise we might be > using the wrong rowtype definition or failing to enforce a just-added > constraint. Oh, definitely. At least Robert's previous prototype tried to do that (although I am not sure if it went far enough). And I'd be surprised the current one wouldn't do so. > The last time we talked about this, we were batting around ideas of > keeping a "current snapshot for catalog purposes", which we'd update > or at least invalidate anytime we acquired a new lock. (In principle, > if that isn't new enough, we have a race condition that we'd better fix > by adding some more locking.) Robert's results seem to say that that > might be unnecessary optimization, and that it'd be sufficient to just > take a new snap each time we need to do a catalog scan. TBH I'm not > sure I believe that; it seems to me that this approach is surely going > to create a great deal more contention from concurrent GetSnapshotData > calls. I still have a hard time believing those results as well, but I think we might have underestimated the effectiveness of the syscache during workloads which are sufficiently concurrent to make locking in GetSnapshotData() a problem. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Jun 5, 2013 at 10:28 AM, Greg Stark <stark@mit.edu> wrote: > On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> We've had a number of discussions about the evils of SnapshotNow. As >> far as I can tell, nobody likes it and everybody wants it gone, but >> there is concern about the performance impact. > > I was always under the impression that the problem was we weren't > quite sure what changes would be needed to make mvcc-snapshots work > for the catalog lookups. The semantics of SnapshotNow aren't terribly > clear either but we have years of experience telling us they seem to > basically work. Most of the problems we've run into we either have > worked around in the catalog accesses. Nobody really knows how many of > the call sites will need different logic to behave properly with mvcc > snapshots. With all respect, I think this is just plain wrong. SnapshotNow is just like an up-to-date MVCC snapshot. The only difference is that an MVCC snapshot, once established, stays fixed for the lifetime of the scan. On the other hand, the SnapshotNow view in the world changes the instant another transaction commits, meaning that scans can see multiple versions of a row, or no versions of a row, where any MVCC scan would have seen just one. There are very few places that want that behavior. Now, I did find a couple that I thought should probably stick with SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just gathering statistical information, so there's no harm in having the snapshot change part-way through the scan, and if the scan is long, the user might actually regard the results under SnapshotNow as more accurate. Whether that's the case or not, holding back xmin for those kinds of scans does not seem wise. But in most other parts of the code, the changes-in-mid-scan behavior of SnapshotNow is a huge liability. The fact that it is fully up-to-date *as of the time the scan starts* is critical for correctness. But the fact that it can then change during the scan is in almost every case something that we do not want. The patch preserves the first property while ditching the second one. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > Now, I did find a couple that I thought should probably stick with > SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just > gathering statistical information, so there's no harm in having the > snapshot change part-way through the scan, and if the scan is long, > the user might actually regard the results under SnapshotNow as more > accurate. Whether that's the case or not, holding back xmin for those > kinds of scans does not seem wise. FWIW, I think if we're going to ditch SnapshotNow we should ditch SnapshotNow, full stop, even removing the tqual.c routines for it. Then we can require that *any* reference to SnapshotNow is replaced by an MVCC reference prior to execution, and throw an error if we actually try to test a tuple with that snapshot. If we don't do it like that I think we'll have errors of omission all over the place (I had really no confidence in your original patch because of that worry). The fact that there are a couple of contrib modules for which there might be an arguable advantage in doing it the old way isn't sufficient reason to expose ourselves to bugs like that. If they really want that sort of uncertain semantics they could use SnapshotDirty, no? regards, tom lane
On Wed, Jun 5, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> Now, I did find a couple that I thought should probably stick with >> SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just >> gathering statistical information, so there's no harm in having the >> snapshot change part-way through the scan, and if the scan is long, >> the user might actually regard the results under SnapshotNow as more >> accurate. Whether that's the case or not, holding back xmin for those >> kinds of scans does not seem wise. > > FWIW, I think if we're going to ditch SnapshotNow we should ditch > SnapshotNow, full stop, even removing the tqual.c routines for it. > Then we can require that *any* reference to SnapshotNow is replaced by > an MVCC reference prior to execution, and throw an error if we actually > try to test a tuple with that snapshot. If we don't do it like that > I think we'll have errors of omission all over the place (I had really > no confidence in your original patch because of that worry). The fact > that there are a couple of contrib modules for which there might be an > arguable advantage in doing it the old way isn't sufficient reason to > expose ourselves to bugs like that. If they really want that sort of > uncertain semantics they could use SnapshotDirty, no? I had the same thought, initially. I went through the exercise of doing a grep for SnapshotNow and trying to eliminate as many references as possible, but there were a few that I couldn't convince myself to rip out. However, if you'd like to apply the patch and grep for SnapshotNow and suggest what to do about the remaining cases (or hack the patch up yourself) I think that would be great. I'd love to see it completely gone if we can see our way clear to that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-05 18:56:28 -0400, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > Now, I did find a couple that I thought should probably stick with > > SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just > > gathering statistical information, so there's no harm in having the > > snapshot change part-way through the scan, and if the scan is long, > > the user might actually regard the results under SnapshotNow as more > > accurate. Whether that's the case or not, holding back xmin for those > > kinds of scans does not seem wise. > > FWIW, I think if we're going to ditch SnapshotNow we should ditch > SnapshotNow, full stop, even removing the tqual.c routines for it. > Then we can require that *any* reference to SnapshotNow is replaced by > an MVCC reference prior to execution, and throw an error if we actually > try to test a tuple with that snapshot. I suggest simply renaming it to something else. Every external project that decides to follow the renaming either has a proper usecase for it or the amount of sympathy for them keeping the old behaviour is rather limited. > If they really want that sort of uncertain semantics they could use > SnapshotDirty, no? Not that happy with that. A scan behaving inconsistently over its proceedings is something rather different than reading uncommitted rows. I have the feeling that trouble is waiting that way. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi Robert, Took a quick look through the patch to understand what your current revision is actually doing and to facilitate thinking about possible pain points. Here are the notes I made during my reading: On 2013-06-03 14:57:12 -0400, Robert Haas wrote: > +++ b/src/backend/catalog/catalog.c > @@ -232,6 +232,10 @@ IsReservedName(const char *name) > * know if it's shared. Fortunately, the set of shared relations is > * fairly static, so a hand-maintained list of their OIDs isn't completely > * impractical. > + * > + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer > + * true. Are there other good reasons to hard-code this, or should we revisit > + * that decision? > */ We could just the function by looking in the shared relmapper. Everything that can be mapped via it is shared. > --- a/src/backend/commands/cluster.c > +++ b/src/backend/commands/cluster.c > @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD > * against concurrent SnapshotNow scans of pg_index. Therefore this is unsafe > * to execute with less than full exclusive lock on the parent table; > * otherwise concurrent executions of RelationGetIndexList could miss indexes. > + * > + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index > + * shouldn't be common enough to worry about. The above comment needs > + * to be updated, and it may be possible to simplify the logic here in other > + * ways also. > */ You're right, the comment needs to be changed, but I don't think the effect can. A non-inplace upgrade changes the xmin of the row which is relevant for indcheckxmin. (In fact, isn't this update possibly causing problems like delaying the use of such an index already) > --- a/src/backend/commands/tablecmds.c > +++ b/src/backend/commands/tablecmds.c > @@ -2738,7 +2738,7 @@ AlterTableGetLockLevel(List *cmds) > * multiple DDL operations occur in a stream against frequently accessed > * tables. > * > - * 1. Catalog tables are read using SnapshotNow, which has a race bug that > + * 1. Catalog tables were read using SnapshotNow, which has a race bug that Heh. > --- a/src/backend/utils/time/snapmgr.c > +++ b/src/backend/utils/time/snapmgr.c > @@ -207,6 +207,19 @@ GetLatestSnapshot(void) > /* > + * GetInstantSnapshot > + * Get a snapshot that is up-to-date as of the current instant, > + * but don't set the transaction snapshot. > + */ > +Snapshot > +GetInstantSnapshot(void) > +{ > + SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData); > + > + return SecondarySnapshot; > +} Hm. Looks like this should also change the description of SecondarySnapshot: /** CurrentSnapshot points to the only snapshot taken in transaction-snapshot* mode, and to the latest one taken in a read-committedtransaction.* SecondarySnapshot is a snapshot that's always up-to-date as of the current* instant, even intransaction-snapshot mode. It should only be used for* special-purpose code (say, RI checking.)* and/* * Checking SecondarySnapshot is probably useless here, but it seems * better to be sure. */ Also looks like BuildEventTriggerCache() in evtcache.c should use GetInstanSnapshot() now. I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None of the callers seem to rely on it's behaviour from a quick look and it seems rather confusing to have both. > --- a/src/bin/pg_dump/pg_dump.c > +++ b/src/bin/pg_dump/pg_dump.c > @@ -14,13 +14,13 @@ > * Note that pg_dump runs in a transaction-snapshot mode transaction, > * so it sees a consistent snapshot of the database including system > * catalogs. However, it relies in part on various specialized backend > - * functions like pg_get_indexdef(), and those things tend to run on > - * SnapshotNow time, ie they look at the currently committed state. So > - * it is possible to get 'cache lookup failed' error if someone > - * performs DDL changes while a dump is happening. The window for this > - * sort of thing is from the acquisition of the transaction snapshot to > - * getSchemaData() (when pg_dump acquires AccessShareLock on every > - * table it intends to dump). It isn't very large, but it can happen. > + * functions like pg_get_indexdef(), and those things tend to look at > + * the currently committed state. So it is possible to get 'cache > + * lookup failed' error if someone performs DDL changes while a dump is > + * happening. The window for this sort of thing is from the acquisition > + * of the transaction snapshot to getSchemaData() (when pg_dump acquires > + * AccessShareLock on every table it intends to dump). It isn't very large, > + * but it can happen. I think we need another term for what SnapshotNow used to express here... Imo this description got less clear with this change. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer >> + * true. Are there other good reasons to hard-code this, or should we revisit >> + * that decision? > > We could just the function by looking in the shared > relmapper. Everything that can be mapped via it is shared. I suspect there are several possible sources for this information, but it's hard to beat a hard-coded list for efficiency, so I wasn't sure if we should tinker with this or not. >> --- a/src/backend/commands/cluster.c >> +++ b/src/backend/commands/cluster.c >> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD >> * against concurrent SnapshotNow scans of pg_index. Therefore this is unsafe >> * to execute with less than full exclusive lock on the parent table; >> * otherwise concurrent executions of RelationGetIndexList could miss indexes. >> + * >> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index >> + * shouldn't be common enough to worry about. The above comment needs >> + * to be updated, and it may be possible to simplify the logic here in other >> + * ways also. >> */ > > You're right, the comment needs to be changed, but I don't think the > effect can. A non-inplace upgrade changes the xmin of the row which is > relevant for indcheckxmin. OK. > (In fact, isn't this update possibly causing problems like delaying the > use of such an index already) Well, maybe. In general, the ephemeral snapshot taken for a catalog scan can't be any older than the primary snapshot already held. But there could be some corner case where that's not true, if we use this technique somewhere that such a snapshot hasn't already been acquired. > Hm. Looks like this should also change the description of SecondarySnapshot: > > /* > * CurrentSnapshot points to the only snapshot taken in transaction-snapshot > * mode, and to the latest one taken in a read-committed transaction. > * SecondarySnapshot is a snapshot that's always up-to-date as of the current > * instant, even in transaction-snapshot mode. It should only be used for > * special-purpose code (say, RI checking.) > * I think that's still more or less true, though we could add catalog scans as another example. > and > /* > * Checking SecondarySnapshot is probably useless here, but it seems > * better to be sure. > */ Yeah, that is not useless any more, for sure. > Also looks like BuildEventTriggerCache() in evtcache.c should use > GetInstanSnapshot() now. OK. > I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None > of the callers seem to rely on it's behaviour from a quick look and it > seems rather confusing to have both. I assume Tom had some reason for making GetLatestSnapshot() behave the way it does, so I refrained from doing that. I might be wrong. > I think we need another term for what SnapshotNow used to express > here... Imo this description got less clear with this change. I thought it was OK but I'm open to suggestions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 6/5/13 3:49 PM, Robert Haas wrote: > Now, I did find a couple that I thought should probably stick with > SnapshotNow, specifically pgrowlocks and pgstattuple. FWIW, I've often wished for a way to make all stat access transactional, across all the stats views. Perhaps that couldn'tbe done by default, but I'd love something like a function that would make a "snapshot" of all stats data as of onepoint. Even if that snapshot itself wasn't completely atomic, at least then you could query any stats views however youwanted and know that the info wasn't changing over time. The reason I don't think this would work so well if done in userspace is how long it would take. Presumably making a completebackend-local copy of pg_proc etc and the stats file would be orders of magnitude faster than a bunch of CREATE TEMPTABLE's. -- Jim C. Nasby, Data Architect jim@nasby.net 512.569.9461 (cell) http://jim.nasby.net
On Thu, Jun 6, 2013 at 2:48 PM, Jim Nasby <jim@nasby.net> wrote: > On 6/5/13 3:49 PM, Robert Haas wrote: >> >> Now, I did find a couple that I thought should probably stick with >> SnapshotNow, specifically pgrowlocks and pgstattuple. > > FWIW, I've often wished for a way to make all stat access transactional, > across all the stats views. Perhaps that couldn't be done by default, but > I'd love something like a function that would make a "snapshot" of all stats > data as of one point. Even if that snapshot itself wasn't completely atomic, > at least then you could query any stats views however you wanted and know > that the info wasn't changing over time. > > The reason I don't think this would work so well if done in userspace is how > long it would take. Presumably making a complete backend-local copy of > pg_proc etc and the stats file would be orders of magnitude faster than a > bunch of CREATE TEMP TABLE's. Well, maybe. But at any rate this is completely unrelated to the main topic of this thread. :-) -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-06 12:49:14 -0400, Robert Haas wrote: > On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer > >> + * true. Are there other good reasons to hard-code this, or should we revisit > >> + * that decision? > > > > We could just the function by looking in the shared > > relmapper. Everything that can be mapped via it is shared. > > I suspect there are several possible sources for this information, but > it's hard to beat a hard-coded list for efficiency, so I wasn't sure > if we should tinker with this or not. I can tell from experience that it makes adding a new shared relation more of a pain than it already is, but we're hopefully not doing that all that often. I just don't think that the mvcc angle has much to do with the decision. > >> --- a/src/backend/commands/cluster.c > >> +++ b/src/backend/commands/cluster.c > >> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD > >> * against concurrent SnapshotNow scans of pg_index. Therefore this is unsafe > >> * to execute with less than full exclusive lock on the parent table; > >> * otherwise concurrent executions of RelationGetIndexList could miss indexes. > >> + * > >> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index > >> + * shouldn't be common enough to worry about. The above comment needs > >> + * to be updated, and it may be possible to simplify the logic here in other > >> + * ways also. > >> */ > > > > You're right, the comment needs to be changed, but I don't think the > > effect can. A non-inplace upgrade changes the xmin of the row which is > > relevant for indcheckxmin. > > OK. > > > (In fact, isn't this update possibly causing problems like delaying the > > use of such an index already) > Well, maybe. In general, the ephemeral snapshot taken for a catalog > scan can't be any older than the primary snapshot already held. But > there could be some corner case where that's not true, if we use this > technique somewhere that such a snapshot hasn't already been acquired. I wasn't talking about catalog scans or this patch, I wonder whether the update we do there won't cause the index not being used for concurrent (normal) scans since now the xmin is newer while it might be far in the past before. I.e. we might need to unset indexcheckxmin if xmin is far enough in the past. > > Hm. Looks like this should also change the description of SecondarySnapshot: > > > > /* > > * CurrentSnapshot points to the only snapshot taken in transaction-snapshot > > * mode, and to the latest one taken in a read-committed transaction. > > * SecondarySnapshot is a snapshot that's always up-to-date as of the current > > * instant, even in transaction-snapshot mode. It should only be used for > > * special-purpose code (say, RI checking.) > > * > > I think that's still more or less true, though we could add catalog > scans as another example. I guess my feeling is that once catalog scans use it, it's not so much special purpose anymore ;). But I admit that the frequency of usage doesn't say much about its specificity... > > I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None > > of the callers seem to rely on it's behaviour from a quick look and it > > seems rather confusing to have both. > > I assume Tom had some reason for making GetLatestSnapshot() behave the > way it does, so I refrained from doing that. I might be wrong. At least I don't see any on a quick look - which doesn't say very much. I think I just dislike having *instant and *latest functions in there, seems to be confusing to me. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2013-06-03 14:57:12 -0400, Robert Haas wrote: > On Thu, May 30, 2013 at 1:39 AM, Michael Paquier > <michael.paquier@gmail.com> wrote: > > +1. > > Here's a more serious patch for MVCC catalog access. This one > involves more data copying than the last one, I think, because the > previous version did not register the snapshots it took, which I think > is not safe. So this needs to be re-tested for performance, which I > have so far made no attempt to do. Ok, I am starting to take a bit more serious look. Minor issues I noticed: * index.c:index_constraint_create()s - comments need to get updated * index.c:IndexCheckExclusion() - why do we still use a SnapshotNow? I'd rather not use *Now if it isn't necessary. * the * CONCURRENTLY infrastructure should be simplified once this has been applied, but I think it makes sense to keep thatseparate. * index.c:reindex_index() - SnapshotNow comment should be updated I still think that renaming SnapshotNow to something like SnapshotPerTuple to force everyone to reavaluate their usage would be good. So, the biggest issue with the patch seems to be performance worries. I tried to create a worst case scenario: postgres (patched and HEAD) running with: -c shared_buffers=4GB \ -c max_connections=2000 \ -c maintenance_work_mem=2GB \ -c checkpoint_segments=300 \ -c wal_buffers=64MB \ -c synchronous_commit=off \ -c autovacuum=off \ -p 5440 With one background pgbench running: pgbench -p 5440 -h /tmp -f /tmp/readonly-busy.sql -c 1000 -j 10 -T 100 postgres readonly-busy.sql: BEGIN; SELECT txid_current(); SELECT pg_sleep(0.0001); COMMIT; I measured the performance of one other pgbench: pgbench -h /tmp -p 5440 postgres -T 10 -c 100 -j 100 -n -f /tmp/simplequery.sql -C simplequery.sql: SELECT * FROM af1, af2 WHERE af1.x = af2.x; tables: create table af1 (x) as select g from generate_series(1,4) g; create table af2 (x) as select g from generate_series(4,7) g; With that setup one can create quite a noticeable overhead for the mvcc patch (best of 5): master-optimize: tps = 1261.629474 (including connections establishing) tps = 15121.648834 (excluding connections establishing) dev-optimize: tps = 773.719637 (including connections establishing) tps = 2804.239979 (excluding connections establishing) Most of the time in both, patched and unpatched is by far spent in GetSnapshotData. I think the reason this shows a far higher overhead than what you previously measured is that a) in your test the other backends were idle, in mine they actually modify PGXACT which causes noticeable cacheline bouncing b) I have higher numer of connections & #max_connections A quick test shows that even with max_connection=600, 400 background, and 100 foreground pgbenches there's noticeable overhead: master-optimize: tps = 2221.226711 (including connections establishing) tps = 31203.259472 (excluding connections establishing) dev-optimize: tps = 1629.734352 (including connections establishing) tps = 4754.449726 (excluding connections establishing) Now I grant that's a somewhat harsh test for postgres, but I don't think it's entirely unreasonable and the performance impact is quite stark. > It strikes me as rather unfortunate that the snapshot interface is > designed in such a way as to require so much data copying. It seems > we always take a snapshot by copying from PGXACT/PGPROC into > CurrentSnapshotData or SecondarySnapshotData, and then copying data a > second time from there to someplace more permanent. It would be nice > to avoid that, at least in common cases. Sounds doable. But let's do one thing at a atime ;). That copy wasn't visible in the rather extreme workload from above btw... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jun 17, 2013 at 8:12 AM, Andres Freund <andres@2ndquadrant.com> wrote: > So, the biggest issue with the patch seems to be performance worries. I > tried to create a worst case scenario: > postgres (patched and HEAD) running with: > -c shared_buffers=4GB \ > -c max_connections=2000 \ > -c maintenance_work_mem=2GB \ > -c checkpoint_segments=300 \ > -c wal_buffers=64MB \ > -c synchronous_commit=off \ > -c autovacuum=off \ > -p 5440 > > With one background pgbench running: > pgbench -p 5440 -h /tmp -f /tmp/readonly-busy.sql -c 1000 -j 10 -T 100 postgres > readonly-busy.sql: > BEGIN; > SELECT txid_current(); > SELECT pg_sleep(0.0001); > COMMIT; > > I measured the performance of one other pgbench: > pgbench -h /tmp -p 5440 postgres -T 10 -c 100 -j 100 -n -f /tmp/simplequery.sql -C > simplequery.sql: > SELECT * FROM af1, af2 WHERE af1.x = af2.x; > tables: > create table af1 (x) as select g from generate_series(1,4) g; > create table af2 (x) as select g from generate_series(4,7) g; > > With that setup one can create quite a noticeable overhead for the mvcc > patch (best of 5): > > master-optimize: > tps = 1261.629474 (including connections establishing) > tps = 15121.648834 (excluding connections establishing) > > dev-optimize: > tps = 773.719637 (including connections establishing) > tps = 2804.239979 (excluding connections establishing) > > Most of the time in both, patched and unpatched is by far spent in > GetSnapshotData. I think the reason this shows a far higher overhead > than what you previously measured is that a) in your test the other > backends were idle, in mine they actually modify PGXACT which causes > noticeable cacheline bouncing b) I have higher numer of connections & > #max_connections > > A quick test shows that even with max_connection=600, 400 background, > and 100 foreground pgbenches there's noticeable overhead: > master-optimize: > tps = 2221.226711 (including connections establishing) > tps = 31203.259472 (excluding connections establishing) > dev-optimize: > tps = 1629.734352 (including connections establishing) > tps = 4754.449726 (excluding connections establishing) > > Now I grant that's a somewhat harsh test for postgres, but I don't > think it's entirely unreasonable and the performance impact is quite > stark. It's not entirely unreasonable, but it *is* mostly unreasonable. I mean, nobody is going to run 1000 connections in the background that do nothing but thrash PGXACT on a real system. I just can't get concerned about that. What I am concerned about is that there may be other, more realistic workloads that show similar regressions. But I don't know how to find out whether that's actually the case. On the IBM POWER box where I tested this, it's not even GetSnapshotData() that kills you; it's the system CPU scheduler. The thing about this particular test is that it's artificial - normally, any operation that wants to modify PGXACT will spend a lot more time fighting of WALInsertLock and maybe waiting for disk I/O than is the case here. Of course, with Heikki's WAL scaling patch and perhaps other optimizations we expect that other overhead to go down, which might make the problems here more visible; and some of Heikki's existing testing has shown significant contention around ProcArrayLock as things stand. But I'm still on the fence about whether this is really a valid test. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-20 09:45:26 -0400, Robert Haas wrote: > > With that setup one can create quite a noticeable overhead for the mvcc > > patch (best of 5): > > > > master-optimize: > > tps = 1261.629474 (including connections establishing) > > tps = 15121.648834 (excluding connections establishing) > > > > dev-optimize: > > tps = 773.719637 (including connections establishing) > > tps = 2804.239979 (excluding connections establishing) > > > > Most of the time in both, patched and unpatched is by far spent in > > GetSnapshotData. I think the reason this shows a far higher overhead > > than what you previously measured is that a) in your test the other > > backends were idle, in mine they actually modify PGXACT which causes > > noticeable cacheline bouncing b) I have higher numer of connections & > > #max_connections > > > > A quick test shows that even with max_connection=600, 400 background, > > and 100 foreground pgbenches there's noticeable overhead: > > master-optimize: > > tps = 2221.226711 (including connections establishing) > > tps = 31203.259472 (excluding connections establishing) > > dev-optimize: > > tps = 1629.734352 (including connections establishing) > > tps = 4754.449726 (excluding connections establishing) > > > > Now I grant that's a somewhat harsh test for postgres, but I don't > > think it's entirely unreasonable and the performance impact is quite > > stark. > > It's not entirely unreasonable, but it *is* mostly unreasonable. Well, sure. So are the tests that you ran. But that's *completely* fine. In contrast to evaluating whether a performance improvement is worth its complexity we're not trying to measure real world improvements. We're trying to test the worst cases we can think of, even if they aren't really interesting by stressing potential pain points. If we can't find a relevant regression for those using something akin to microbenchmarks it's less likely that there are performance regressions. The "not entirely unreasonable" point is just about making sure you're not testing something entirely irrelevant. Say, performance of a 1TB database when shared_buffers is set to 64k. Or testing DDL performance while locking pg_class exclusively. The test was specifically chosen to: * do uncached syscache lookups (-C) to mesure the impact of the added GetSnapshotData() calls * make individual GetSnapshotData() calls slower. (all processes have an xid) * contend on ProcArrayLock but not much else (high number of clients in the background) > I > mean, nobody is going to run 1000 connections in the background that > do nothing but thrash PGXACT on a real system. I just can't get > concerned about that. In the original mail I did retry it with 400 and the regression is still pretty big. And the "background" things could also be doing something that's not that likely to be blocked by global locks. Say, operate on temporary or unlogged tables. Or just acquire a single row level lock and then continue to do readonly work in a read committed transaction. I think we both can come up with scenarios where at least part of the above scenario is present. But imo that doesn't really matter. > What I am concerned about is that there may be > other, more realistic workloads that show similar regressions. But I > don't know how to find out whether that's actually the case. So, given the results from that test and the profile I got where GetSnapshotData was by far the most expensive thing a more representative test might be something like a readonly pgbench with a moderately high number of short lived connections. I wouldn't be surprised if that still showed performance problems. If that's not enough something like: BEGIN; SELECT * FROM my_client WHERE client_id = :id FOR UPDATE; SELECT * FROM key_table WHERE key = :random ... SELECT * FROM key_table WHERE key = :random COMMIT; will sure still show the problem. > On the > IBM POWER box where I tested this, it's not even GetSnapshotData() > that kills you; it's the system CPU scheduler. I haven't tried yet, but I'd guess the above setup shows the difference with less than 400 clients. Might make it more reasonable to run there. > But I'm still on the fence about whether this is really a valid test. I think it shows that we need to be careful and do further performance evaluations and/or alleviate the pain by making things cheaper (say, a "ddl counter" in shared mem, allowing to cache snapshots for the syscache). If that artificial test hadn't shown problems I'd have voted for just going ahead and not worry further. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 20, 2013 at 10:35 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> But I'm still on the fence about whether this is really a valid test. > > I think it shows that we need to be careful and do further performance > evaluations and/or alleviate the pain by making things cheaper (say, a > "ddl counter" in shared mem, allowing to cache snapshots for the > syscache). If that artificial test hadn't shown problems I'd have voted > for just going ahead and not worry further. I tried a general snapshot counter that rolls over every time any transaction commits, but that doesn't help much. It's a small improvement on general workloads, but it's not effective against this kind of hammering. A DDL counter would be a bit more expensive because we'd have to insert an additional branch into GetSnapshotData() while ProcArrayLock is held, but it might be tolerable. Do you have code for this (or some portion of it) already? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-06-20 10:58:59 -0400, Robert Haas wrote: > On Thu, Jun 20, 2013 at 10:35 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> But I'm still on the fence about whether this is really a valid test. > > > > I think it shows that we need to be careful and do further performance > > evaluations and/or alleviate the pain by making things cheaper (say, a > > "ddl counter" in shared mem, allowing to cache snapshots for the > > syscache). If that artificial test hadn't shown problems I'd have voted > > for just going ahead and not worry further. > > I tried a general snapshot counter that rolls over every time any > transaction commits, but that doesn't help much. It's a small > improvement on general workloads, but it's not effective against this > kind of hammering. A DDL counter would be a bit more expensive > because we'd have to insert an additional branch into > GetSnapshotData() while ProcArrayLock is held, but it might be > tolerable. I actually wasn't thinking of adding it at that level. More like just not fetching a new snapshot in systable_* if a) the global ddl counter hasn't been increased b) our transactions hasn't performed any ddl. Instead use the one used the last time we fetched one for that purpose. The global ddl counter could be increased everytime we commit and the commit has invalidations queued. The local one everytime we execute local cache invalidation messages (i.e. around CommandCounterIncrement). I think something roughly along those lines should be doable without adding new overhead to global paths. Except maybe some check in snapmgr.c for that new, longer living, snapshot. > Do you have code for this (or some portion of it) already? Nope, sorry. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, Jun 20, 2013 at 11:13 AM, Andres Freund <andres@2ndquadrant.com> wrote: > I actually wasn't thinking of adding it at that level. More like just > not fetching a new snapshot in systable_* if a) the global ddl counter > hasn't been increased b) our transactions hasn't performed any > ddl. Instead use the one used the last time we fetched one for that > purpose. All right, so here's a patch that does something along those lines. We have to always take a new snapshot when somebody scans a catalog that has no syscache, because there won't be any invalidation messages to work off of in that case. The only catalog in that category that's accessed during backend startup (which is mostly what your awful test case is banging on) is pg_db_role_setting. We could add a syscache for that catalog or somehow force invalidation messages to be sent despite the lack of a syscache, but what I chose to do instead is refactor things slightly so that we use the same snapshot for all four scans of pg_db_role_setting, instead of taking a new one each time. I think that's unimpeachable on correctness grounds; it's no different than if we'd finished all four scans in the time it took us to finish the first one, and then gotten put to sleep by the OS scheduler for as long as it took us to scan the other three. Point being that there's no interlock there. Anyway, with that change, plus the general mechanism of the patch, each backend takes just two MVCC scans during startup. The first catalog access takes an MVCC snapshot for obvious reasons, and then there's one additional snapshot for the access to pg_db_role_setting for the reasons stated above. Everything else piggybacks on those snapshots, unless of course an invalidation intervenes. In my testing, this did not completely eliminate the performance hit on your test case, but it got it down to 3-4%. Best of five runs, with max_connections=2000 and 1000 connections running readonly-busy.sql: (patched) tps = 183.224651 (including connections establishing) tps = 1091.813178 (excluding connections establishing) (unpatched) tps = 190.598045 (including connections establishing) tps = 1129.422537 (excluding connections establishing) The difference is 3-4%, which is quite a lot less than what you measured before, although on different hardware, so results may vary. Now, I'm not sure this fix actually helps the other test scenarios very much; for example, it's not going to help the cases that pound on pg_depend, because that doesn't have a system cache either. As with pg_db_role_setting, we could optimize this mechanism by sending invalidation messages for pg_depend changes, but I'm not sure it's worth it. I'm also attaching a fixed version of pg_cxn.c; the last version had a few bugs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Robert Haas escribió: > All right, so here's a patch that does something along those lines. > We have to always take a new snapshot when somebody scans a catalog > that has no syscache, because there won't be any invalidation messages > to work off of in that case. The only catalog in that category that's > accessed during backend startup (which is mostly what your awful test > case is banging on) is pg_db_role_setting. We could add a syscache > for that catalog or somehow force invalidation messages to be sent > despite the lack of a syscache, but what I chose to do instead is > refactor things slightly so that we use the same snapshot for all four > scans of pg_db_role_setting, instead of taking a new one each time. I > think that's unimpeachable on correctness grounds; it's no different > than if we'd finished all four scans in the time it took us to finish > the first one, and then gotten put to sleep by the OS scheduler for as > long as it took us to scan the other three. Point being that there's > no interlock there. That seems perfectly acceptable to me, yeah. > The difference is 3-4%, which is quite a lot less than what you > measured before, although on different hardware, so results may vary. 3-4% on that synthetic benchmark sounds pretty acceptable to me, as well. -- Álvaro Herrera http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jun 28, 2013 at 12:22 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote: >> The difference is 3-4%, which is quite a lot less than what you >> measured before, although on different hardware, so results may vary. > > 3-4% on that synthetic benchmark sounds pretty acceptable to me, as > well. Here's a further update of this patch. In this version, I added some mechanism to send a new kind of sinval message that is sent when a catalog without catcaches is updated; it doesn't apply to all catalogs, just to whichever ones we want to have this treatment. That means we don't need to retake snapshots for those catalogs on every access, so backend startup requires just one extra MVCC snapshot as compared with current master. Assorted cleanup has been done, along with the removal of a few more SnapshotNow references. It's still possible to construct test cases that perform badly by pounding the server with 1000 clients running Andres's readonly-busy.sql. Consider the following test case: use a DO block to create a schema with 10,000 functions in it and then DROP .. CASCADE. When the server is unloaded, the extra MVCC overhead is pretty small. master Create: 1010.225 ms, Drop: 444.891 ms Create: 1001.237 ms, Drop: 444.084 ms Create: 979.621 ms, Drop: 446.091 ms patched Create: 992.366 ms, Drop: 459.334 ms Create: 992.436 ms, Drop: 459.921 ms Create: 990.971 ms, Drop: 459.573 ms The create case is actually running just a hair faster with the patch, and the drop case is about 3% slower. Now let's add 1000 clients running Andres's readonly-busy.sql in the background and retest: master Create: 21554.387 ms, Drop: 2594.534 ms Create: 32189.395 ms, Drop: 2493.213 ms Create: 30627.964 ms, Drop: 1813.160 ms patched Create: 44312.107 ms, Drop: 11718.305 ms Create: 46683.021 ms, Drop: 11732.284 ms Create: 50766.615 ms, Drop: 9363.742 ms Well, now the create is 52% slower and the drop is a whopping 4.7x slower. It's worth digging into the reasons just a bit. I was able to speed up this case quite a bit - it was 30x slower a few hours ago - by adding a few new relations to the switch in RelationInvalidatesSnapshotsOnly(). But the code still takes one MVCC snapshot per object dropped, because deleteOneObject() calls CommandCounterIncrement() and that, as it must, invalidates our previous snapshot. We could, if we were inclined to spend the effort, probably work out that although we need to change curcid, the rest of the snapshot is still OK, but I'm not too convinced that it's worth adding an even-more-complicated mechanism for this. We could probably also optimize the delete code to increment the command counter fewer times, but I'm not convinced that's worth doing either. I think my general feeling about this is that we're going to have to accept that there's no such thing as a free lunch, but maybe that's OK. The testing done to date shows that MVCC snapshots are not terribly expensive when PGXACT isn't heavily updated, even if you take an awful lot of them, but with enough concurrent activity on PGXACT they do get expensive enough to care about. And that's still not a big problem on normal workloads, but if you combine code that with a workload that requires an abnormally high number of snapshots compared to the overall work it does (like a DROP CASCADE on a schema with many objects but no tables) then you can make it quite slow. That's not great, but on the other hand, if it had always been that slow, I'm not all that sure anyone would have complained. DDL performance is not something we've spend a lot of cycles on, and for good reason. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On 2013-06-28 23:14:23 -0400, Robert Haas wrote: > Here's a further update of this patch. In this version, I added some > mechanism to send a new kind of sinval message that is sent when a > catalog without catcaches is updated; it doesn't apply to all > catalogs, just to whichever ones we want to have this treatment. That > means we don't need to retake snapshots for those catalogs on every > access, so backend startup requires just one extra MVCC snapshot as > compared with current master. Assorted cleanup has been done, along > with the removal of a few more SnapshotNow references. This is really cool stuff. > It's still possible to construct test cases that perform badly by > pounding the server with 1000 clients running Andres's > readonly-busy.sql. Consider the following test case: use a DO block > to create a schema with 10,000 functions in it and then DROP .. > CASCADE. When the server is unloaded, the extra MVCC overhead is > pretty small. > Well, now the create is 52% slower and the drop is a whopping 4.7x > slower. It's worth digging into the reasons just a bit. I was able > to speed up this case quite a bit - it was 30x slower a few hours ago > - by adding a few new relations to the switch in > RelationInvalidatesSnapshotsOnly(). But the code still takes one MVCC > snapshot per object dropped, because deleteOneObject() calls > CommandCounterIncrement() and that, as it must, invalidates our > previous snapshot. I have to say, if the thing that primarily suffers is pretty extreme DDL in extreme situations I am not really worried. Anybody running anything close to the territory of such concurrency won't perform that much DDL. > We could, if we were inclined to spend the effort, > probably work out that although we need to change curcid, the rest of > the snapshot is still OK, but I'm not too convinced that it's worth > adding an even-more-complicated mechanism for this. We could probably > also optimize the delete code to increment the command counter fewer > times, but I'm not convinced that's worth doing either. I am pretty convinced we shouldn't do either for now. Something picked up when quickly scanning over the last version of the patch: > +/* > + * Staleness detection for CatalogSnapshot. > + */ > +static bool CatalogSnapshotStale = true; > > /* > * These are updated by GetSnapshotData. We initialize them this way > @@ -177,6 +188,9 @@ GetTransactionSnapshot(void) > else > CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); > > + /* Don't allow catalog snapshot to be older than xact snapshot. */ > + CatalogSnapshotStale = true; > + > FirstSnapshotSet = true; > return CurrentSnapshot; > } > @@ -184,6 +198,9 @@ GetTransactionSnapshot(void) > if (IsolationUsesXactSnapshot()) > return CurrentSnapshot; > > + /* Don't allow catalog snapshot to be older than xact snapshot. */ > + CatalogSnapshotStale = true; > + > CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); > > return CurrentSnapshot; > @@ -207,6 +224,54 @@ GetLatestSnapshot(void) > } Do we really need to invalidate snapshots in either situation? Isn't it implied, that if it's still valid, according to a) no invalidation via local invalidation messages b) no invalidations from other backends, there shouldn't be any possible differences when you only look at the catalog? And if it needs to change, we could copy the newly generated snapshot to the catalog snapshot if it's currently valid. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Mon, Jul 1, 2013 at 10:04 AM, Andres Freund <andres@2ndquadrant.com> wrote: > This is really cool stuff. Thanks. > I have to say, if the thing that primarily suffers is pretty extreme DDL > in extreme situations I am not really worried. Anybody running anything > close to the territory of such concurrency won't perform that much DDL. /me wipes brow. > Something picked up when quickly scanning over the last version of the > patch: > >> +/* >> + * Staleness detection for CatalogSnapshot. >> + */ >> +static bool CatalogSnapshotStale = true; >> >> /* >> * These are updated by GetSnapshotData. We initialize them this way >> @@ -177,6 +188,9 @@ GetTransactionSnapshot(void) >> else >> CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); >> >> + /* Don't allow catalog snapshot to be older than xact snapshot. */ >> + CatalogSnapshotStale = true; >> + >> FirstSnapshotSet = true; >> return CurrentSnapshot; >> } >> @@ -184,6 +198,9 @@ GetTransactionSnapshot(void) >> if (IsolationUsesXactSnapshot()) >> return CurrentSnapshot; >> >> + /* Don't allow catalog snapshot to be older than xact snapshot. */ >> + CatalogSnapshotStale = true; >> + >> CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); >> >> return CurrentSnapshot; >> @@ -207,6 +224,54 @@ GetLatestSnapshot(void) >> } > > Do we really need to invalidate snapshots in either situation? Isn't it > implied, that if it's still valid, according to a) no invalidation via local > invalidation messages b) no invalidations from other backends, there > shouldn't be any possible differences when you only look at the catalog? I had the same thought, removed that code, and then put it back. The problem is that if we revive an older snapshot "from the dead", so to speak, our backend's advertised xmin might need to go backwards, and that seems unsafe - e.g. suppose another backend has updated a tuple but not yet committed. We don't see any invalidation messages so decide reuse our existing (old) snapshot and begin a scan. After we've looked at the page containing the new tuple (and decided not to see it), vacuum nukes the old tuple (which we then also don't see). Bad things ensue. It might be possible to avoid the majority of problems in this area via an appropriate set of grotty hacks, but I don't want to go there. > And if it needs to change, we could copy the newly generated snapshot > to the catalog snapshot if it's currently valid. Yeah, I think there's room for further fine-tuning there. But I think it would make sense to push the patch at this point, and then if we find cases that can be further improved, or things that it breaks, we can fix them. This area is complicated enough that I wouldn't be horribly surprised if we end up having to fix a few more problem cases or even revert the whole thing, but I think we've probably reached the point where further review has less value than getting the code out there in front of more people and seeing where (if anywhere) the wheels come off out in the wild. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-07-01 15:02:41 -0400, Robert Haas wrote: > >> * These are updated by GetSnapshotData. We initialize them this way > >> @@ -177,6 +188,9 @@ GetTransactionSnapshot(void) > >> else > >> CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData); > >> > >> + /* Don't allow catalog snapshot to be older than xact snapshot. */ > >> + CatalogSnapshotStale = true; > >> + > > Do we really need to invalidate snapshots in either situation? Isn't it > > implied, that if it's still valid, according to a) no invalidation via local > > invalidation messages b) no invalidations from other backends, there > > shouldn't be any possible differences when you only look at the catalog? > > I had the same thought, removed that code, and then put it back. The > problem is that if we revive an older snapshot "from the dead", so to > speak, our backend's advertised xmin might need to go backwards, and > that seems unsafe - e.g. suppose another backend has updated a tuple > but not yet committed. We don't see any invalidation messages so > decide reuse our existing (old) snapshot and begin a scan. After > we've looked at the page containing the new tuple (and decided not to > see it), vacuum nukes the old tuple (which we then also don't see). > Bad things ensue. It might be possible to avoid the majority of > problems in this area via an appropriate set of grotty hacks, but I > don't want to go there. Yes, I thought about that and I think it's a problem that can be solved without too ugly hacks. But, as you say: > Yeah, I think there's room for further fine-tuning there. But I think > it would make sense to push the patch at this point, and then if we > find cases that can be further improved, or things that it breaks, we > can fix them. This area is complicated enough that I wouldn't be > horribly surprised if we end up having to fix a few more problem cases > or even revert the whole thing, but I think we've probably reached the > point where further review has less value than getting the code out > there in front of more people and seeing where (if anywhere) the > wheels come off out in the wild. I am pretty sure that we will have to fix more stuff, but luckily we're in the beginning of the cycle. And while I'd prefer more eyes on the patch before it gets applied, especially ones knowledgeable about the implications this has, I don't really see that happening soon. So applying is more likely to lead to more review than waiting. So, from me: +1. Some things that might be worth changing when committing: * Could you add a Assert(!RelationHasSysCache(relid)) to RelationInvalidatesSnapshotsOnly? It's not unlikely that it willbe missed by the next person adding a syscache and that seems like it could have ugly and hard to detect consequences. * maybe use bsearch(3) instead of open coding the binary search? We already use it in the backend. * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or GetCatalogSnapshot ensuring we're dealing with a catalogrelation. The consistency mechanisms in GetCatalogSnapshot() only work for those, so there doesn't seem to be a validusecase for non-catalog relations. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 2, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Some things that might be worth changing when committing: > * Could you add a Assert(!RelationHasSysCache(relid)) to > RelationInvalidatesSnapshotsOnly? It's not unlikely that it will be > missed by the next person adding a syscache and that seems like it > could have ugly and hard to detect consequences. There's a cross-check in InitCatalogCache() for that very issue. > * maybe use bsearch(3) instead of open coding the binary search? We > already use it in the backend. I found comments elsewhere indicating that bsearch() was slower than open-coding it, so I copied the logic used for ScanKeywordLookup(). > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or > GetCatalogSnapshot ensuring we're dealing with a catalog relation. The > consistency mechanisms in GetCatalogSnapshot() only work for those, so > there doesn't seem to be a valid usecase for non-catalog relations. It'll actually work find as things stand; it'll just take a new snapshot every time. I have a few ideas for getting rid of the remaining uses of SnapshotNow that I'd like to throw out there: - In pgrowlocks and pgstattuple, I think it would be fine to use SnapshotSelf instead of SnapshotNow. The only difference is that it includes changes made by the current command that wouldn't otherwise be visible until CommandCounterIncrement(). That, however, is not really a problem for their usage. - In genam.c and execUtils.c, we treat SnapshotNow as a kind of default snapshot. That seems like a crappy idea. I propose that we either set that pointer to NULL and let the server core dump if the snapshot doesn't get set or (maybe better) add a new special snapshot called SnapshotError which just errors out if you try to use it for anything, and initialize to that. - I'm not quite sure what to do about get_actual_variable_range(). Taking a new MVCC snapshot there seems like it might be pricey on some workloads. However, I wonder if we could use SnapshotDirty. Presumably, even uncommitted tuples affect the amount of index-scanning we have to do, so that approach seems to have some theoretical justification. But I'm worried there could be unfortunate consequences to looking at uncommitted data, either now or in the future. SnapshotSelf seems less likely to have that problem, but feels wrong somehow. - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an argument to heap_get_latest_tid(). I don't know what these functions are supposed to be good for, but taking a new snapshot for every function call seems to guarantee that someone will find a way to use these functions as a poster child for how to brutalize PGXACT, so I don't particularly want to do that. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-07-02 09:31:23 -0400, Robert Haas wrote: > On Tue, Jul 2, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote: > > Some things that might be worth changing when committing: > > * Could you add a Assert(!RelationHasSysCache(relid)) to > > RelationInvalidatesSnapshotsOnly? It's not unlikely that it will be > > missed by the next person adding a syscache and that seems like it > > could have ugly and hard to detect consequences. > > There's a cross-check in InitCatalogCache() for that very issue. Great. > > * maybe use bsearch(3) instead of open coding the binary search? We > > already use it in the backend. > > I found comments elsewhere indicating that bsearch() was slower than > open-coding it, so I copied the logic used for ScanKeywordLookup(). Hm. Ok. > > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or > > GetCatalogSnapshot ensuring we're dealing with a catalog relation. The > > consistency mechanisms in GetCatalogSnapshot() only work for those, so > > there doesn't seem to be a valid usecase for non-catalog relations. > > It'll actually work find as things stand; it'll just take a new > snapshot every time. Ok. Doesn't really change my opinion that it's a crappy idea to use it otherwise ;) > - In genam.c and execUtils.c, we treat SnapshotNow as a kind of > default snapshot. That seems like a crappy idea. I propose that we > either set that pointer to NULL and let the server core dump if the > snapshot doesn't get set or (maybe better) add a new special snapshot > called SnapshotError which just errors out if you try to use it for > anything, and initialize to that. I vote for SnapshotError. > - I'm not quite sure what to do about get_actual_variable_range(). > Taking a new MVCC snapshot there seems like it might be pricey on some > workloads. However, I wonder if we could use SnapshotDirty. > Presumably, even uncommitted tuples affect the amount of > index-scanning we have to do, so that approach seems to have some > theoretical justification. But I'm worried there could be unfortunate > consequences to looking at uncommitted data, either now or in the > future. SnapshotSelf seems less likely to have that problem, but > feels wrong somehow. I don't like using SnapshotDirty either. Can't we just use the currently active snapshot? Unless I miss something this always will be called while we have one since when we plan we've done an explicit PushActiveSnapshot() and if we need to replan stuff during execution PortalRunSelect() will have pushed one. > - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an > argument to heap_get_latest_tid(). I don't know what these functions > are supposed to be good for, but taking a new snapshot for every > function call seems to guarantee that someone will find a way to use > these functions as a poster child for how to brutalize PGXACT, so I > don't particularly want to do that. Heikki mentioned that at some point they were added for the odbc driver. I am not particularly inclined to worry about taking too many snapshots here. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, Jul 2, 2013 at 9:52 AM, Andres Freund <andres@2ndquadrant.com> wrote: >> > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or >> > GetCatalogSnapshot ensuring we're dealing with a catalog relation. The >> > consistency mechanisms in GetCatalogSnapshot() only work for those, so >> > there doesn't seem to be a valid usecase for non-catalog relations. >> >> It'll actually work find as things stand; it'll just take a new >> snapshot every time. > > Ok. Doesn't really change my opinion that it's a crappy idea to use it > otherwise ;) I agree, but I don't see an easy way to write the assertion you want using only the OID. >> - In genam.c and execUtils.c, we treat SnapshotNow as a kind of >> default snapshot. That seems like a crappy idea. I propose that we >> either set that pointer to NULL and let the server core dump if the >> snapshot doesn't get set or (maybe better) add a new special snapshot >> called SnapshotError which just errors out if you try to use it for >> anything, and initialize to that. > > I vote for SnapshotError. OK. >> - I'm not quite sure what to do about get_actual_variable_range(). >> Taking a new MVCC snapshot there seems like it might be pricey on some >> workloads. However, I wonder if we could use SnapshotDirty. >> Presumably, even uncommitted tuples affect the amount of >> index-scanning we have to do, so that approach seems to have some >> theoretical justification. But I'm worried there could be unfortunate >> consequences to looking at uncommitted data, either now or in the >> future. SnapshotSelf seems less likely to have that problem, but >> feels wrong somehow. > > I don't like using SnapshotDirty either. Can't we just use the currently > active snapshot? Unless I miss something this always will be called > while we have one since when we plan we've done an explicit > PushActiveSnapshot() and if we need to replan stuff during execution > PortalRunSelect() will have pushed one. We could certainly do that, but I'd be a bit reluctant to do so without input from Tom. I imagine there might be cases where it could cause a regression. >> - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an >> argument to heap_get_latest_tid(). I don't know what these functions >> are supposed to be good for, but taking a new snapshot for every >> function call seems to guarantee that someone will find a way to use >> these functions as a poster child for how to brutalize PGXACT, so I >> don't particularly want to do that. > > Heikki mentioned that at some point they were added for the odbc > driver. I am not particularly inclined to worry about taking too many > snapshots here. Well, if it uses it with any regularity I think it's a legitimate concern. We have plenty of customers who use ODBC and some of them allow frightening numbers of concurrent server connections. Now you may say that's a bad idea, but so is 1000 backends doing txid_current() in a loop. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2013-07-02 10:38:17 -0400, Robert Haas wrote: > On Tue, Jul 2, 2013 at 9:52 AM, Andres Freund <andres@2ndquadrant.com> wrote: > >> > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or > >> > GetCatalogSnapshot ensuring we're dealing with a catalog relation. The > >> > consistency mechanisms in GetCatalogSnapshot() only work for those, so > >> > there doesn't seem to be a valid usecase for non-catalog relations. > >> > >> It'll actually work find as things stand; it'll just take a new > >> snapshot every time. > > > > Ok. Doesn't really change my opinion that it's a crappy idea to use it > > otherwise ;) > > I agree, but I don't see an easy way to write the assertion you want > using only the OID. Let's add /** IsSystemRelationId* True iff the relation is a system catalog relation.*/ bool IsSystemRelationId(Oid relid) { return relid < FirstNormalObjectId; } and change IsSystemRelation() to use that instead of what it does now... > >> - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an > >> argument to heap_get_latest_tid(). I don't know what these functions > >> are supposed to be good for, but taking a new snapshot for every > >> function call seems to guarantee that someone will find a way to use > >> these functions as a poster child for how to brutalize PGXACT, so I > >> don't particularly want to do that. > > > > Heikki mentioned that at some point they were added for the odbc > > driver. I am not particularly inclined to worry about taking too many > > snapshots here. > > Well, if it uses it with any regularity I think it's a legitimate > concern. We have plenty of customers who use ODBC and some of them > allow frightening numbers of concurrent server connections. I've quickly verified that it indeed uses them. I wish I hadn't. Brrr. I can't even guess what that should do from the surrounding code/function names. Except that it looks broken under concurrency as long as SnapshotNow is used (because the query's snapshot won't be as new as SnapshotNow, even in read committed mode). Heikki, do you understand the code well enough to explain it without investing time? > Now you may say that's a bad idea, but so is 1000 backends doing > txid_current() in a loop. Hehe ;). Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 02.07.2013 18:24, Andres Freund wrote: > I've quickly verified that it indeed uses them. I wish I hadn't. Brrr. I > can't even guess what that should do from the surrounding code/function > names. Except that it looks broken under concurrency as long as > SnapshotNow is used (because the query's snapshot won't be as new as > SnapshotNow, even in read committed mode). > > Heikki, do you understand the code well enough to explain it without > investing time? No, sorry. I think it has something to do with updateable cursors, but I don't understand the details. - Heikki
Hi Robert, On 2013-07-02 09:31:23 -0400, Robert Haas wrote: > I have a few ideas for getting rid of the remaining uses of > SnapshotNow that I'd like to throw out there: Is your current plan to get rid of SnapshotNow entirely? I am wonder because the changeset extraction needs to care and how the proper fix for dealing with CatalogSnapshotData looks depends on it... Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Jul 5, 2013 at 11:27 AM, Andres Freund <andres@2ndquadrant.com> wrote: > Hi Robert, > > On 2013-07-02 09:31:23 -0400, Robert Haas wrote: >> I have a few ideas for getting rid of the remaining uses of >> SnapshotNow that I'd like to throw out there: > > Is your current plan to get rid of SnapshotNow entirely? I am wonder > because the changeset extraction needs to care and how the proper fix > for dealing with CatalogSnapshotData looks depends on it... I would like to do that, but I haven't quite figured out how to get rid of the last few instances, per discussion upthread. I do plan to spend some more time on it, but likely not this week. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company