Thread: MVCC catalog access

MVCC catalog access

From

Robert Haas

Date:

22 May 2013, 05:18:25

We've had a number of discussions about the evils of SnapshotNow.  As
far as I can tell, nobody likes it and everybody wants it gone, but
there is concern about the performance impact.  I decided to do some
testing to measure the impact.  I was pleasantly surprised by the
results.

The attached patch is a quick hack to provide for MVCC catalog access.
 It adds a GUC called "mvcc_catalog_access".  When this GUC is set to
true, and heap_beginscan() or index_beginscan() is called with
SnapshotNow, they call GetLatestSnapshot() and use the resulting
snapshot in lieu of SnapshotNow.  As a debugging double-check, I
modified HeapTupleSatisfiesNow to elog(FATAL) if called with
mvcc_catalog_access is true.  Of course, both of these are dirty
hacks.  If we were actually to implement MVCC catalog access, I think
we'd probably just go through and start replacing instances of
SnapshotNow with GetLatestSnapshot(), but I wanted to make it easy to
do perf testing.

When I first made this change, I couldn't detect any real change;
indeed, it seemed that make check was running ever-so-slightly faster
than before, although that may well have been a testing artifact.  I
wrote a test case that created a schema with 100,000 functions in it
and then dropped the schema (I believe it was Tom who previously
suggested this test case as a worst-case scenario for MVCC catalog
access).  That didn't seem to be adversely affected either, even
though it must take ~700k additional MVCC snapshots with
mvcc_catalog_access = true.

MVCC Off: Create 8743.101 ms, Drop 9655.471 ms
MVCC On: Create 7462.882 ms, Drop 9515.537 ms
MVCC Off: Create 7519.160 ms, Drop 9380.905 ms
MVCC On: Create 7517.382 ms, Drop 9394.857 ms

The first "Create" seems to be artificially slow because of some kind
of backend startup overhead.  Not sure exactly what.

After wracking my brain for a few minutes, I realized that the lack of
any apparent performance regression was probably due to the lack of
any concurrent connections, making the scans of the PGXACT array very
cheap.  So I wrote a little program to open a bunch of extra
connections.  My MacBook Pro grumbled when I tried to open more than
about 600, so I had to settle for that number.  That was enough to
show up the cost of all those extra snapshots:

MVCC Off: Create 9065.887 ms, Drop 9599.494 ms
MVCC On: Create 8384.065 ms, Drop 10532.909 ms
MVCC Off: Create 7632.197 ms, Drop 9499.502 ms
MVCC On: Create 8215.443 ms, Drop 10033.499 ms

Now, I don't know about you, but I'm having a hard time getting
agitated about those numbers.  Most people are not going to drop
100,000 objects with a single cascaded drop.  And most people are not
going to have 600 connections open when they do.  (The snapshot
overhead should be roughly proportional to the product of the number
of drops and the number of open connections, and the number of cases
where the product is as high as 60 million has got to be pretty
small.)  But suppose that someone is in that situation.  Well, then
they will take a... 10% performance penalty?  That sounds plenty
tolerable to me, if it means we can start moving in the direction of
allowing some concurrent DDL.

Am I missing an important test case here?  Are these results worse
than I think they are?  Did I boot this testing somehow?

[MVCC catalog access patch, test program to create lots of idle
connections, and pg_depend stress test case attached.]

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: MVCC catalog access

From

Robert Haas

Date:

23 May 2013, 05:51:21

On Tue, May 21, 2013 at 10:18 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> [ MVCC catalog access seems to be pretty cheap ]

In discussions today, Tom Lane suggested testing the time to start up
a backend and run a simple query such as "SELECT 2+2" in the absence
of a relcache file.

I did this and can't measure any overhead as a result of MVCC catalog
access.  I tried it with no active connections.  I tried it with 600
idle active connections (to make taking MVCC snapshots more
expensive).  I couldn't quite believe it made no difference, so I
tried doing it in a tight loop under pgbench.  I still can't measure
any difference.  I haven't tested carefully enough to rule out the
possibility of an effect <1/2% at 600 connections, but there certainly
isn't anything bigger than that and I don't even think there's that
much of a difference.

Andres Freund suggested creating a couple of simple tables and having
lots of short-lived backends select data from them.

rhaas=# create table af1 (x) as select g from generate_series(1,4) g;
SELECT 4
rhaas=# create table af2 (x) as select g from generate_series(4,7) g;
SELECT 4

Test query: SELECT * FROM af1, af2 WHERE af1.x = af2.x;
pgbench command: pgbench -T 10 -c 50 -j 50 -n -f f -C

With mvcc_catalog_access=off, I get ~1553 tps; with it on, I get ~1557
tps.  Hmm... that could be because of the two-line debugging hunk my
patch addes to HeapTupleSatisfiesNow().  After removing that, I get
maybe a 1% regression with mvcc_catalog_access=on on this test, but
it's not very consistent.  If I restart the database server a few
times, the overhead bounces around each time, and sometimes it's zero;
the highest I saw is 1.4%.  But it's not much, and this is a pretty
brutal workload for PostgreSQL, since starting up >1500 connections
per second is not a workload for which we're well-suited in the first
place.

All in all, I'm still feeling optimistic.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

23 May 2013, 06:02:26

On 2013-05-22 22:51:13 -0400, Robert Haas wrote:
> In discussions today, Tom Lane suggested testing the time to start up
> a backend and run a simple query such as "SELECT 2+2" in the absence
> of a relcache file.

> I did this and can't measure any overhead as a result of MVCC catalog
> access.  I tried it with no active connections.  I tried it with 600
> idle active connections (to make taking MVCC snapshots more
> expensive).

Did you try it with the 600 transactions actually being in a transaction
and having acquired a snapshot?


> All in all, I'm still feeling optimistic.

+1. I still feel like this has to be much harder since we made it out to
be hard for a long time ;)

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

23 May 2013, 06:05:50

On Wed, May 22, 2013 at 11:02 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2013-05-22 22:51:13 -0400, Robert Haas wrote:
>> In discussions today, Tom Lane suggested testing the time to start up
>> a backend and run a simple query such as "SELECT 2+2" in the absence
>> of a relcache file.
>
>> I did this and can't measure any overhead as a result of MVCC catalog
>> access.  I tried it with no active connections.  I tried it with 600
>> idle active connections (to make taking MVCC snapshots more
>> expensive).
>
> Did you try it with the 600 transactions actually being in a transaction
> and having acquired a snapshot?

No... I can hack something up for that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

23 May 2013, 06:11:20

On 2013-05-22 23:05:40 -0400, Robert Haas wrote:
> On Wed, May 22, 2013 at 11:02 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> > On 2013-05-22 22:51:13 -0400, Robert Haas wrote:
> >> In discussions today, Tom Lane suggested testing the time to start up
> >> a backend and run a simple query such as "SELECT 2+2" in the absence
> >> of a relcache file.
> >
> >> I did this and can't measure any overhead as a result of MVCC catalog
> >> access.  I tried it with no active connections.  I tried it with 600
> >> idle active connections (to make taking MVCC snapshots more
> >> expensive).
> >
> > Did you try it with the 600 transactions actually being in a transaction
> > and having acquired a snapshot?
> 
> No... I can hack something up for that.

Make that actually having acquired an xid. We skip a large part of the
work if a transaction doesn't yet have one afair. I don't think the mere
presence of 600 idle connections without an xid in contrast to just
having max_connection at 600 should actually make a difference in the
cost of acquiring a snapshot?

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

23 May 2013, 15:46:03

On Wed, May 22, 2013 at 11:11 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> Make that actually having acquired an xid. We skip a large part of the
> work if a transaction doesn't yet have one afair. I don't think the mere
> presence of 600 idle connections without an xid in contrast to just
> having max_connection at 600 should actually make a difference in the
> cost of acquiring a snapshot?

Attached is a slightly updated version of the patch I'm using for
testing, and an updated version of the pg_cxn source that I'm using to
open lotsa connections.  With this version, I can do this:

./pg_cxn -n 600 -c BEGIN -c 'SELECT txid_current()'

...which I think is sufficient to make sure all those transactions
have XIDs.  Then I reran the "depend" test case (create a schema with
1000,000 functions and then drop the schema with CASCADE) that I
mentioned in my original posting.  Here are the results:

MVCC Off: Create 8685.662 ms, Drop 9973.233 ms
MVCC On: Create 7931.039 ms, Drop 10189.189 ms
MVCC Off: Create 7810.084 ms, Drop 9594.580 ms
MVCC On: Create 8854.577 ms, Drop 10240.024 ms

OK, let's try the rebuild-the-relcache test using the same pg_cxn
scenario (600 transactions that have started a transaction and
selected txid_current()).

[rhaas ~]$ time for s in `seq 1 1000`; do rm -f
pgdata/global/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null;
done

MVCC catalog access on:
real    0m11.006s
user    0m2.746s
sys    0m2.664s

MVCC catalog access off:
real    0m10.583s
user    0m2.745s
sys    0m2.661s

MVCC catalog access on:
real    0m10.646s
user    0m2.750s
sys    0m2.661s

MVCC catalog access off:
real    0m10.823s
user    0m2.756s
sys    0m2.681s

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: MVCC catalog access

From

Michael Paquier

Date:

27 May 2013, 04:11:03

Perhaps we see little difference in performance because PGPROC has been separated into PGPROC and PGXACT, reducing lock contention with getting snapshot data?

By the way, I grabbed a 32-core machine and did some more performance tests with some open connections with XIDs assigned using pg_cxn v2 given by Robert in his previous mail to make sure that the snapshots get pretty large.

First I ran the simple read test:
$ time for s in `seq 1 1000`
> do
> rm -f ~/bin/pgsql/master/global/pg_internal.init && psql -c 'SELECT 2+2' >/dev/null;
> done

And then the create/drop test.

I have done those tests with 250, 500, 1000 and 2000 open connections:

1) 250 open connections
1-1) read test

Round 1:
mvcc_catalog_access off:
real    0m9.124s
user    0m0.200s
sys    0m0.392s
mvcc_catalog_access on:
real    0m9.297s
user    0m0.148s
sys    0m0.444s

Round 2:
mvcc_catalog_access off:
real    0m8.985s
user    0m0.160s
sys    0m0.372s
mvcc_catalog_access on:
real    0m9.244s
user    0m0.240s
sys    0m0.400s

1-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 24554.849, Drop: 29755.146
mvcc on: Create: 26904.755, Drop: 32891.556
mvcc off: Create: 23337.342, Drop: 29921.990
mvcc on: Create: 24533.708, Drop: 31670.840

2) 500 open connections
2-1) read test

Round 1:
mvcc_catalog_access off:
real    0m9.123s
user    0m0.200s
sys    0m0.396s
mvcc_catalog_access on:
real    0m9.627s
user    0m0.156s
sys    0m0.460s
Round 2:
mvcc_catalog_access off:
real    0m9.221s
user    0m0.316s
sys    0m0.392s
mvcc_catalog_access on:
real    0m9.592s
user    0m0.160s
sys    0m0.484s

2-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 25872.886, Drop: 31723.921
mvcc on: Create: 27076.429, Drop: 33674.336
mvcc off: Create: 24039.456, Drop: 30434.019
mvcc on: Create: 29105.713, Drop: 33821.170

3) 1000 open connections

3-1) read test

Round 1:
mvcc_catalog_access off:
real    0m9.240s
user    0m0.192s
sys    0m0.396s
mvcc_catalog_access on:
real    0m9.674s
user    0m0.236s
sys    0m0.440s
Round 2:
mvcc_catalog_access off:
real    0m9.302s
user    0m0.308s
sys    0m0.392s
mvcc_catalog_access on:
real    0m9.746s
user    0m0.204s
sys    0m0.436s

3-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 25563.705, Drop: 31747.451
mvcc on: Create: 33281.246, Drop: 36618.166
mvcc off: Create: 28178.210, Drop: 30550.166
mvcc on: Create: 31849.825, Drop: 36831.245

4) 2000 open connections

4-1) read test

Round 1:
mvcc_catalog_access off:
real    0m9.066s
user    0m0.128s
sys    0m0.420s
mvcc_catalog_access on:
real    0m9.978s
user    0m0.168s
sys    0m0.412s

Round 2:
mvcc_catalog_access off:
real    0m9.113s
user    0m0.152s
sys    0m0.444s
mvcc_catalog_access on:
real    0m9.974s
user    0m0.176s
sys    0m0.436s

More or less the same results as previously with 10% performance drop.

4-2) DDL test (drop and creation of 100,000 objects)
mvcc off: Create: 28708.095 ms, Drop: 32510.057 ms

mvcc on: Create: 39987.815 ms, Drop: 43157.006 ms

mvcc off: Create: 28409.853 ms, Drop: 31248.163 ms
mvcc on: Create: 41669.829 ms, Drop: 44645.109 ms

For read tests, we can see a performance drop up to 10% for 2000 connections.

For the write tests, we can see a performance drop of 9~10% for 250 connections, up to 30% performance drop with 2000 connections.

We barely see users drop that many objects at the same time with so much open transactions, they'll switch to a connection pooler before opening that many connections to the server. I am not sure that such a performance drop is acceptable as-is, but perhaps it is if we consider the functionality gain we can have thanks to MVCC catalogs.

Michael

Re: MVCC catalog access

From

Robert Haas

Date:

28 May 2013, 16:39:43

On Sun, May 26, 2013 at 9:10 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> Perhaps we see little difference in performance because PGPROC has been
> separated into PGPROC and PGXACT, reducing lock contention with getting
> snapshot data?
>
> By the way, I grabbed a 32-core machine and did some more performance tests
> with some open connections with XIDs assigned using pg_cxn v2 given by
> Robert in his previous mail to make sure that the snapshots get pretty
> large.

Thanks for checking this on another machine.  It's interesting that
you were able to measure a hit for relcache rebuild, whereas I was
not, but it doesn't look horrible.

IMHO, we should press forward with this approach.  Considering that
these are pretty extreme test cases, I'm inclined to view the
performance loss as acceptable.  We've never really viewed DDL as
something that needs to be micro-optimized, and there is ample
testimony to that fact in the existing code and in the treatment of
prior patches in this area.  This is not to say that we want to go
around willy-nilly making it slower, but I think there will be very
few users for which the number of microseconds it takes to create or
drop an SQL object is performance-critical, especially when you
consider that (1) the effect will be quite a bit less when the objects
are tables, since in that case the snapshot cost will tend to be
drowned out by the filesystem cost and (2) people who don't habitually
keep hundreds and hundreds of connections open - which hopefully most
people don't - won't see the effect anyway.   Against that, this
removes the single largest barrier to allowing more concurrent DDL, a
feature that I suspect will make a whole lot of people *very* happy.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Michael Paquier

Date:

30 May 2013, 08:39:58

On Tue, May 28, 2013 at 10:39 PM, Robert Haas <robertmhaas@gmail.com> wrote:

IMHO, we should press forward with this approach. Considering that
these are pretty extreme test cases, I'm inclined to view the
performance loss as acceptable. We've never really viewed DDL as
something that needs to be micro-optimized, and there is ample
testimony to that fact in the existing code and in the treatment of
prior patches in this area. This is not to say that we want to go
around willy-nilly making it slower, but I think there will be very
few users for which the number of microseconds it takes to create or
drop an SQL object is performance-critical, especially when you
consider that (1) the effect will be quite a bit less when the objects
are tables, since in that case the snapshot cost will tend to be
drowned out by the filesystem cost and (2) people who don't habitually
keep hundreds and hundreds of connections open - which hopefully most
people don't - won't see the effect anyway. Against that, this
removes the single largest barrier to allowing more concurrent DDL, a
feature that I suspect will make a whole lot of people *very* happy.

+1.

So, I imagine that the next step would be to add a new Snapshot validation
level in tqual.h. Something like SnapshotMVCC? Then replace SnapshotNow

by SnapshotMVCC where it is required.

I am also seeing that SnapshotNow is used in places where we might not want to

have it changed. For example autovacuum code path when we retrieve database

or table list should not be changed, no?
--

Michael

Re: MVCC catalog access

From

Robert Haas

Date:

03 June 2013, 21:57:41

On Thu, May 30, 2013 at 1:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> +1.

Here's a more serious patch for MVCC catalog access.  This one
involves more data copying than the last one, I think, because the
previous version did not register the snapshots it took, which I think
is not safe.  So this needs to be re-tested for performance, which I
have so far made no attempt to do.

It strikes me as rather unfortunate that the snapshot interface is
designed in such a way as to require so much data copying.  It seems
we always take a snapshot by copying from PGXACT/PGPROC into
CurrentSnapshotData or SecondarySnapshotData, and then copying data a
second time from there to someplace more permanent.  It would be nice
to avoid that, at least in common cases.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

mvcc-catalog-access-v3.patch

Re: MVCC catalog access

From

Michael Paquier

Date:

04 June 2013, 07:52:14

On Tue, Jun 4, 2013 at 3:57 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, May 30, 2013 at 1:39 AM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> +1.

Here's a more serious patch for MVCC catalog access. This one
involves more data copying than the last one, I think, because the
previous version did not register the snapshots it took, which I think
is not safe. So this needs to be re-tested for performance, which I
have so far made no attempt to do.

It strikes me as rather unfortunate that the snapshot interface is
designed in such a way as to require so much data copying. It seems
we always take a snapshot by copying from PGXACT/PGPROC into
CurrentSnapshotData or SecondarySnapshotData, and then copying data a
second time from there to someplace more permanent. It would be nice
to avoid that, at least in common cases.

And here are more results comparing master branch with and without this patch...

1) DDL CREATE/DROP test:
1-1) master:
250 connections:
Create: 24846.060, Drop: 30391.713
Create: 23771.394, Drop: 29769.396
500 connections:
Create: 24339.449, Drop: 30084.741
Create: 24152.176, Drop: 30643.471
1000 connections:
Create: 26007.960, Drop: 31019.918
Create: 25937.592, Drop: 30600.341
2000 connections:
Create: 26900.324, Drop: 30741.989
Create: 26910.660, Drop: 31577.247

1-2) mvcc catalogs:
250 connections:
Create: 25371.342, Drop: 31458.952
Create: 25685.094, Drop: 31492.377
500 connections:
Create: 28557.882, Drop: 33673.266
Create: 27901.910, Drop: 33223.006
1000 connections:
Create: 31910.130, Drop: 36770.062
Create: 32210.093, Drop: 36754.888
2000 connections:
Create: 40374.754, Drop: 43442.528
Create: 39763.691, Drop: 43234.243

2) backend startup
2-1) master branch:
250 connections:
real    0m8.993s
user    0m0.128s
sys     0m0.380s
500 connections:
real    0m9.004s
user    0m0.212s
sys     0m0.340s
1000 connections:
real    0m9.072s
user    0m0.272s
sys     0m0.332s
2000 connections:
real    0m9.257s
user    0m0.204s
sys     0m0.392s

2-2) MVCC catalogs:
250 connections:
real    0m9.067s
user    0m0.108s
sys     0m0.396s
500 connections:
real    0m9.034s
user    0m0.112s
sys     0m0.376s
1000 connections:
real    0m9.303s
user    0m0.176s
sys     0m0.328s
2000 connections
real    0m9.916s
user    0m0.160s
sys     0m0.428s

Except for the case of backend startup test for 500 connections that looks to have some noise, performance degradation reaches 6% for 2000 connections, and less than 1% for 250 connections. This is better than last time.

For the CREATE/DROP case, performance drop reaches 40% for 2000 connections (32% during last tests). I also noticed a lower performance drop for 250 connections now (3~4%) compared to the 1st time (9%).

I compiled the main results on tables here:
http://michael.otacoo.com/postgresql-2/postgres-9-4-devel-mvcc-catalog-access-take-2-2/

The results of last time are also available here:
http://michael.otacoo.com/postgresql-2/postgres-9-4-devel-mvcc-catalog-access-2/

Regards,

Michael

Re: MVCC catalog access

From

Greg Stark

Date:

05 June 2013, 17:28:56

On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> We've had a number of discussions about the evils of SnapshotNow.  As
> far as I can tell, nobody likes it and everybody wants it gone, but
> there is concern about the performance impact.

I was always under the impression that the problem was we weren't
quite sure what changes would be needed to make mvcc-snapshots work
for the catalog lookups. The semantics of SnapshotNow aren't terribly
clear either but we have years of experience telling us they seem to
basically work. Most of the problems we've run into we either have
worked around in the catalog accesses. Nobody really knows how many of
the call sites will need different logic to behave properly with mvcc
snapshots.

I thought there were many call sites that were specifically depending
on seeing dirty reads to avoid race conditions with other backends --
which probably just narrowed the race condition or created different
ones. If you clean those all up it will be probably be cleaner and
better but we don't know how many such sites will need to be modified.
I'm not even sure what "clean them up" means. You can replace checks
with things like constraints and locks but the implementation of
constraints and locks will still need to use SnapshotNow surely?

-- 
greg

Re: MVCC catalog access

From

Hannu Krosing

Date:

05 June 2013, 17:46:22

On 06/05/2013 04:28 PM, Greg Stark wrote:
> On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> We've had a number of discussions about the evils of SnapshotNow.  As
>> far as I can tell, nobody likes it and everybody wants it gone, but
>> there is concern about the performance impact.
> I was always under the impression that the problem was we weren't
> quite sure what changes would be needed to make mvcc-snapshots work
> for the catalog lookups. The semantics of SnapshotNow aren't terribly
> clear either but we have years of experience telling us they seem to
> basically work. Most of the problems we've run into we either have
> worked around in the catalog accesses. Nobody really knows how many of
> the call sites will need different logic to behave properly with mvcc
> snapshots.
>
> I thought there were many call sites that were specifically depending
> on seeing dirty reads to avoid race conditions with other backends --
> which probably just narrowed the race condition or created different
> ones. If you clean those all up it will be probably be cleaner and
> better but we don't know how many such sites will need to be modified.
> I'm not even sure what "clean them up" means. You can replace checks
> with things like constraints and locks but the implementation of
> constraints and locks will still need to use SnapshotNow surely?
I guess that anything that does *not* write should be happier with
MVCC catalogue, especially if there has been any DDL after its snapshot.

For writers the ability to compare MVCC and SnapshotNow snapshots
would tell if they need to take extra steps.

But undoubtedly the whole thing would be lot of work.

-- 
Hannu Krosing
PostgreSQL Consultant
Performance, Scalability and High Availability
2ndQuadrant Nordic OÜ

Re: MVCC catalog access

From

Andres Freund

Date:

05 June 2013, 18:07:47

On 2013-06-05 15:28:09 +0100, Greg Stark wrote:
> On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > We've had a number of discussions about the evils of SnapshotNow.  As
> > far as I can tell, nobody likes it and everybody wants it gone, but
> > there is concern about the performance impact.
> I thought there were many call sites that were specifically depending
> on seeing dirty reads to avoid race conditions with other backends --
> which probably just narrowed the race condition or created different
> ones.

But SnapshotNow doesn't allow you to do actual dirty reads? It only
gives you rows back that were actually visible when we checked. The
difference to SnapshotMVCC is that during a scan the picture of which
transactions are committed can change.

> I'm not even sure what "clean them up" means. You can replace checks
> with things like constraints and locks but the implementation of
> constraints and locks will still need to use SnapshotNow surely?

The places that require this should already use HeapTupleSatisfiesDirty
which is something different.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Tom Lane

Date:

05 June 2013, 18:36:10

Andres Freund <andres@2ndquadrant.com> writes:
> On 2013-06-05 15:28:09 +0100, Greg Stark wrote:
>> I thought there were many call sites that were specifically depending
>> on seeing dirty reads to avoid race conditions with other backends --
>> which probably just narrowed the race condition or created different
>> ones.

> But SnapshotNow doesn't allow you to do actual dirty reads?

Yeah.  I believe the issue is that we can't simply do MVCC catalog reads
with a snapshot taken at transaction start time or statement start time,
as we would do if executing an MVCC scan for a user query.  Rather, the
snapshot has to be recent enough to ensure we see the current definition
of any table we've just acquired lock on, *even if that's newer than the
snapshot prevailing for the user's purposes*.  Otherwise we might be
using the wrong rowtype definition or failing to enforce a just-added
constraint.

The last time we talked about this, we were batting around ideas of
keeping a "current snapshot for catalog purposes", which we'd update
or at least invalidate anytime we acquired a new lock.  (In principle,
if that isn't new enough, we have a race condition that we'd better fix
by adding some more locking.)  Robert's results seem to say that that
might be unnecessary optimization, and that it'd be sufficient to just
take a new snap each time we need to do a catalog scan.  TBH I'm not
sure I believe that; it seems to me that this approach is surely going
to create a great deal more contention from concurrent GetSnapshotData
calls.  But at the very least, this says we can experiment with the
behavioral aspects without bothering to build infrastructure for
tracking an appropriate catalog snapshot.
        regards, tom lane

Re: MVCC catalog access

From

Andres Freund

Date:

05 June 2013, 20:12:33

On 2013-06-05 11:35:58 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2013-06-05 15:28:09 +0100, Greg Stark wrote:
> >> I thought there were many call sites that were specifically depending
> >> on seeing dirty reads to avoid race conditions with other backends --
> >> which probably just narrowed the race condition or created different
> >> ones.
> 
> > But SnapshotNow doesn't allow you to do actual dirty reads?
> 
> Yeah.  I believe the issue is that we can't simply do MVCC catalog reads
> with a snapshot taken at transaction start time or statement start time,
> as we would do if executing an MVCC scan for a user query.  Rather, the
> snapshot has to be recent enough to ensure we see the current definition
> of any table we've just acquired lock on, *even if that's newer than the
> snapshot prevailing for the user's purposes*.  Otherwise we might be
> using the wrong rowtype definition or failing to enforce a just-added
> constraint.

Oh, definitely. At least Robert's previous prototype tried to do that
(although I am not sure if it went far enough). And I'd be surprised the
current one wouldn't do so.

> The last time we talked about this, we were batting around ideas of
> keeping a "current snapshot for catalog purposes", which we'd update
> or at least invalidate anytime we acquired a new lock.  (In principle,
> if that isn't new enough, we have a race condition that we'd better fix
> by adding some more locking.)  Robert's results seem to say that that
> might be unnecessary optimization, and that it'd be sufficient to just
> take a new snap each time we need to do a catalog scan.  TBH I'm not
> sure I believe that; it seems to me that this approach is surely going
> to create a great deal more contention from concurrent GetSnapshotData
> calls.

I still have a hard time believing those results as well, but I think we
might have underestimated the effectiveness of the syscache during
workloads which are sufficiently concurrent to make locking in
GetSnapshotData() a problem.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

05 June 2013, 23:49:08

On Wed, Jun 5, 2013 at 10:28 AM, Greg Stark <stark@mit.edu> wrote:
> On Wed, May 22, 2013 at 3:18 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> We've had a number of discussions about the evils of SnapshotNow.  As
>> far as I can tell, nobody likes it and everybody wants it gone, but
>> there is concern about the performance impact.
>
> I was always under the impression that the problem was we weren't
> quite sure what changes would be needed to make mvcc-snapshots work
> for the catalog lookups. The semantics of SnapshotNow aren't terribly
> clear either but we have years of experience telling us they seem to
> basically work. Most of the problems we've run into we either have
> worked around in the catalog accesses. Nobody really knows how many of
> the call sites will need different logic to behave properly with mvcc
> snapshots.

With all respect, I think this is just plain wrong.  SnapshotNow is
just like an up-to-date MVCC snapshot.  The only difference is that an
MVCC snapshot, once established, stays fixed for the lifetime of the
scan.  On the other hand, the SnapshotNow view in the world changes
the instant another transaction commits, meaning that scans can see
multiple versions of a row, or no versions of a row, where any MVCC
scan would have seen just one.  There are very few places that want
that behavior.

Now, I did find a couple that I thought should probably stick with
SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
gathering statistical information, so there's no harm in having the
snapshot change part-way through the scan, and if the scan is long,
the user might actually regard the results under SnapshotNow as more
accurate.  Whether that's the case or not, holding back xmin for those
kinds of scans does not seem wise.

But in most other parts of the code, the changes-in-mid-scan behavior
of SnapshotNow is a huge liability.  The fact that it is fully
up-to-date *as of the time the scan starts* is critical for
correctness.  But the fact that it can then change during the scan is
in almost every case something that we do not want.  The patch
preserves the first property while ditching the second one.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Tom Lane

Date:

06 June 2013, 01:56:32

Robert Haas <robertmhaas@gmail.com> writes:
> Now, I did find a couple that I thought should probably stick with
> SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
> gathering statistical information, so there's no harm in having the
> snapshot change part-way through the scan, and if the scan is long,
> the user might actually regard the results under SnapshotNow as more
> accurate.  Whether that's the case or not, holding back xmin for those
> kinds of scans does not seem wise.

FWIW, I think if we're going to ditch SnapshotNow we should ditch
SnapshotNow, full stop, even removing the tqual.c routines for it.
Then we can require that *any* reference to SnapshotNow is replaced by
an MVCC reference prior to execution, and throw an error if we actually
try to test a tuple with that snapshot.  If we don't do it like that
I think we'll have errors of omission all over the place (I had really
no confidence in your original patch because of that worry).  The fact
that there are a couple of contrib modules for which there might be an
arguable advantage in doing it the old way isn't sufficient reason to
expose ourselves to bugs like that.  If they really want that sort of
uncertain semantics they could use SnapshotDirty, no?
        regards, tom lane

Re: MVCC catalog access

From

Robert Haas

Date:

06 June 2013, 03:43:10

On Wed, Jun 5, 2013 at 6:56 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> Now, I did find a couple that I thought should probably stick with
>> SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
>> gathering statistical information, so there's no harm in having the
>> snapshot change part-way through the scan, and if the scan is long,
>> the user might actually regard the results under SnapshotNow as more
>> accurate.  Whether that's the case or not, holding back xmin for those
>> kinds of scans does not seem wise.
>
> FWIW, I think if we're going to ditch SnapshotNow we should ditch
> SnapshotNow, full stop, even removing the tqual.c routines for it.
> Then we can require that *any* reference to SnapshotNow is replaced by
> an MVCC reference prior to execution, and throw an error if we actually
> try to test a tuple with that snapshot.  If we don't do it like that
> I think we'll have errors of omission all over the place (I had really
> no confidence in your original patch because of that worry).  The fact
> that there are a couple of contrib modules for which there might be an
> arguable advantage in doing it the old way isn't sufficient reason to
> expose ourselves to bugs like that.  If they really want that sort of
> uncertain semantics they could use SnapshotDirty, no?

I had the same thought, initially.  I went through the exercise of
doing a grep for SnapshotNow and trying to eliminate as many
references as possible, but there were a few that I couldn't convince
myself to rip out.  However, if you'd like to apply the patch and grep
for SnapshotNow and suggest what to do about the remaining cases (or
hack the patch up yourself) I think that would be great.  I'd love to
see it completely gone if we can see our way clear to that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

06 June 2013, 11:33:49

On 2013-06-05 18:56:28 -0400, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
> > Now, I did find a couple that I thought should probably stick with
> > SnapshotNow, specifically pgrowlocks and pgstattuple. Those are just
> > gathering statistical information, so there's no harm in having the
> > snapshot change part-way through the scan, and if the scan is long,
> > the user might actually regard the results under SnapshotNow as more
> > accurate.  Whether that's the case or not, holding back xmin for those
> > kinds of scans does not seem wise.
> 
> FWIW, I think if we're going to ditch SnapshotNow we should ditch
> SnapshotNow, full stop, even removing the tqual.c routines for it.
> Then we can require that *any* reference to SnapshotNow is replaced by
> an MVCC reference prior to execution, and throw an error if we actually
> try to test a tuple with that snapshot.

I suggest simply renaming it to something else. Every external project
that decides to follow the renaming either has a proper usecase for it
or the amount of sympathy for them keeping the old behaviour is rather
limited.

> If they really want that sort of uncertain semantics they could use
> SnapshotDirty, no?

Not that happy with that. A scan behaving inconsistently over its
proceedings is something rather different than reading uncommitted
rows. I have the feeling that trouble is waiting that way.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Andres Freund

Date:

06 June 2013, 12:30:26

Hi Robert,

Took a quick look through the patch to understand what your current
revision is actually doing and to facilitate thinking about possible
pain points.

Here are the notes I made during my reading:

On 2013-06-03 14:57:12 -0400, Robert Haas wrote:
> +++ b/src/backend/catalog/catalog.c
> @@ -232,6 +232,10 @@ IsReservedName(const char *name)
>   * know if it's shared.  Fortunately, the set of shared relations is
>   * fairly static, so a hand-maintained list of their OIDs isn't completely
>   * impractical.
> + *
> + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer
> + * true.  Are there other good reasons to hard-code this, or should we revisit
> + * that decision?
>   */

We could just the function by looking in the shared
relmapper. Everything that can be mapped via it is shared.

> --- a/src/backend/commands/cluster.c
> +++ b/src/backend/commands/cluster.c
> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD
>   * against concurrent SnapshotNow scans of pg_index.  Therefore this is unsafe
>   * to execute with less than full exclusive lock on the parent table;
>   * otherwise concurrent executions of RelationGetIndexList could miss indexes.
> + *
> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index
> + * shouldn't be common enough to worry about.  The above comment needs
> + * to be updated, and it may be possible to simplify the logic here in other
> + * ways also.
>   */

You're right, the comment needs to be changed, but I don't think the
effect can. A non-inplace upgrade changes the xmin of the row which is
relevant for indcheckxmin.
(In fact, isn't this update possibly causing problems like delaying the
use of such an index already)


> --- a/src/backend/commands/tablecmds.c
> +++ b/src/backend/commands/tablecmds.c
> @@ -2738,7 +2738,7 @@ AlterTableGetLockLevel(List *cmds)
>       * multiple DDL operations occur in a stream against frequently accessed
>       * tables.
>       *
> -     * 1. Catalog tables are read using SnapshotNow, which has a race bug that
> +     * 1. Catalog tables were read using SnapshotNow, which has a race bug that

Heh.

> --- a/src/backend/utils/time/snapmgr.c
> +++ b/src/backend/utils/time/snapmgr.c
> @@ -207,6 +207,19 @@ GetLatestSnapshot(void)
>  /*
> + * GetInstantSnapshot
> + *        Get a snapshot that is up-to-date as of the current instant,
> + *        but don't set the transaction snapshot.
> + */
> +Snapshot
> +GetInstantSnapshot(void)
> +{
> +    SecondarySnapshot = GetSnapshotData(&SecondarySnapshotData);
> +
> +    return SecondarySnapshot;
> +}

Hm. Looks like this should also change the description of SecondarySnapshot:

/** CurrentSnapshot points to the only snapshot taken in transaction-snapshot* mode, and to the latest one taken in a
read-committedtransaction.* SecondarySnapshot is a snapshot that's always up-to-date as of the current* instant, even
intransaction-snapshot mode.    It should only be used for* special-purpose code (say, RI checking.)*
 
and/* * Checking SecondarySnapshot is probably useless here, but it seems * better to be sure. */

Also looks like BuildEventTriggerCache() in evtcache.c should use
GetInstanSnapshot() now.

I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None
of the callers seem to rely on it's behaviour from a quick look and it
seems rather confusing to have both.

> --- a/src/bin/pg_dump/pg_dump.c
> +++ b/src/bin/pg_dump/pg_dump.c
> @@ -14,13 +14,13 @@
>   *    Note that pg_dump runs in a transaction-snapshot mode transaction,
>   *    so it sees a consistent snapshot of the database including system
>   *    catalogs. However, it relies in part on various specialized backend
> - *    functions like pg_get_indexdef(), and those things tend to run on
> - *    SnapshotNow time, ie they look at the currently committed state.  So
> - *    it is possible to get 'cache lookup failed' error if someone
> - *    performs DDL changes while a dump is happening. The window for this
> - *    sort of thing is from the acquisition of the transaction snapshot to
> - *    getSchemaData() (when pg_dump acquires AccessShareLock on every
> - *    table it intends to dump). It isn't very large, but it can happen.
> + *    functions like pg_get_indexdef(), and those things tend to look at
> + *    the currently committed state.  So it is possible to get 'cache
> + *    lookup failed' error if someone performs DDL changes while a dump is
> + *    happening. The window for this sort of thing is from the acquisition
> + *    of the transaction snapshot to getSchemaData() (when pg_dump acquires
> + *    AccessShareLock on every table it intends to dump). It isn't very large,
> + *    but it can happen.

I think we need another term for what SnapshotNow used to express
here... Imo this description got less clear with this change.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

06 June 2013, 19:49:21

On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer
>> + * true.  Are there other good reasons to hard-code this, or should we revisit
>> + * that decision?
>
> We could just the function by looking in the shared
> relmapper. Everything that can be mapped via it is shared.

I suspect there are several possible sources for this information, but
it's hard to beat a hard-coded list for efficiency, so I wasn't sure
if we should tinker with this or not.

>> --- a/src/backend/commands/cluster.c
>> +++ b/src/backend/commands/cluster.c
>> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD
>>   * against concurrent SnapshotNow scans of pg_index.  Therefore this is unsafe
>>   * to execute with less than full exclusive lock on the parent table;
>>   * otherwise concurrent executions of RelationGetIndexList could miss indexes.
>> + *
>> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index
>> + * shouldn't be common enough to worry about.  The above comment needs
>> + * to be updated, and it may be possible to simplify the logic here in other
>> + * ways also.
>>   */
>
> You're right, the comment needs to be changed, but I don't think the
> effect can. A non-inplace upgrade changes the xmin of the row which is
> relevant for indcheckxmin.

OK.

> (In fact, isn't this update possibly causing problems like delaying the
> use of such an index already)

Well, maybe.  In general, the ephemeral snapshot taken for a catalog
scan can't be any older than the primary snapshot already held.  But
there could be some corner case where that's not true, if we use this
technique somewhere that such a snapshot hasn't already been acquired.

> Hm. Looks like this should also change the description of SecondarySnapshot:
>
> /*
>  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
>  * mode, and to the latest one taken in a read-committed transaction.
>  * SecondarySnapshot is a snapshot that's always up-to-date as of the current
>  * instant, even in transaction-snapshot mode.  It should only be used for
>  * special-purpose code (say, RI checking.)
>  *

I think that's still more or less true, though we could add catalog
scans as another example.

> and
>         /*
>          * Checking SecondarySnapshot is probably useless here, but it seems
>          * better to be sure.
>          */

Yeah, that is not useless any more, for sure.

> Also looks like BuildEventTriggerCache() in evtcache.c should use
> GetInstanSnapshot() now.

OK.

> I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None
> of the callers seem to rely on it's behaviour from a quick look and it
> seems rather confusing to have both.

I assume Tom had some reason for making GetLatestSnapshot() behave the
way it does, so I refrained from doing that.  I might be wrong.

> I think we need another term for what SnapshotNow used to express
> here... Imo this description got less clear with this change.

I thought it was OK but I'm open to suggestions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Jim Nasby

Date:

06 June 2013, 21:48:30

On 6/5/13 3:49 PM, Robert Haas wrote:
> Now, I did find a couple that I thought should probably stick with
> SnapshotNow, specifically pgrowlocks and pgstattuple.

FWIW, I've often wished for a way to make all stat access transactional, across all the stats views. Perhaps that
couldn'tbe done by default, but I'd love something like a function that would make a "snapshot" of all stats data as of
onepoint. Even if that snapshot itself wasn't completely atomic, at least then you could query any stats views however
youwanted and know that the info wasn't changing over time.

The reason I don't think this would work so well if done in userspace is how long it would take. Presumably making a
completebackend-local copy of pg_proc etc and the stats file would be orders of magnitude faster than a bunch of CREATE
TEMPTABLE's.

-- 
Jim C. Nasby, Data Architect                       jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: MVCC catalog access

From

Robert Haas

Date:

07 June 2013, 17:52:19

On Thu, Jun 6, 2013 at 2:48 PM, Jim Nasby <jim@nasby.net> wrote:
> On 6/5/13 3:49 PM, Robert Haas wrote:
>>
>> Now, I did find a couple that I thought should probably stick with
>> SnapshotNow, specifically pgrowlocks and pgstattuple.
>
> FWIW, I've often wished for a way to make all stat access transactional,
> across all the stats views. Perhaps that couldn't be done by default, but
> I'd love something like a function that would make a "snapshot" of all stats
> data as of one point. Even if that snapshot itself wasn't completely atomic,
> at least then you could query any stats views however you wanted and know
> that the info wasn't changing over time.
>
> The reason I don't think this would work so well if done in userspace is how
> long it would take. Presumably making a complete backend-local copy of
> pg_proc etc and the stats file would be orders of magnitude faster than a
> bunch of CREATE TEMP TABLE's.

Well, maybe.  But at any rate this is completely unrelated to the main
topic of this thread.  :-)

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

09 June 2013, 16:08:26

On 2013-06-06 12:49:14 -0400, Robert Haas wrote:
> On Thu, Jun 6, 2013 at 5:30 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> + * XXX: Now that we have MVCC catalog access, the reasoning above is no longer
> >> + * true.  Are there other good reasons to hard-code this, or should we revisit
> >> + * that decision?
> >
> > We could just the function by looking in the shared
> > relmapper. Everything that can be mapped via it is shared.
> 
> I suspect there are several possible sources for this information, but
> it's hard to beat a hard-coded list for efficiency, so I wasn't sure
> if we should tinker with this or not.

I can tell from experience that it makes adding a new shared relation
more of a pain than it already is, but we're hopefully not doing that
all that often.
I just don't think that the mvcc angle has much to do with the decision.

> >> --- a/src/backend/commands/cluster.c
> >> +++ b/src/backend/commands/cluster.c
> >> @@ -480,6 +480,11 @@ check_index_is_clusterable(Relation OldHeap, Oid indexOid, bool recheck, LOCKMOD
> >>   * against concurrent SnapshotNow scans of pg_index.  Therefore this is unsafe
> >>   * to execute with less than full exclusive lock on the parent table;
> >>   * otherwise concurrent executions of RelationGetIndexList could miss indexes.
> >> + *
> >> + * XXX: Now that we have MVCC catalog access, SnapshotNow scans of pg_index
> >> + * shouldn't be common enough to worry about.  The above comment needs
> >> + * to be updated, and it may be possible to simplify the logic here in other
> >> + * ways also.
> >>   */
> >
> > You're right, the comment needs to be changed, but I don't think the
> > effect can. A non-inplace upgrade changes the xmin of the row which is
> > relevant for indcheckxmin.
> 
> OK.
> 
> > (In fact, isn't this update possibly causing problems like delaying the
> > use of such an index already)

> Well, maybe.  In general, the ephemeral snapshot taken for a catalog
> scan can't be any older than the primary snapshot already held.  But
> there could be some corner case where that's not true, if we use this
> technique somewhere that such a snapshot hasn't already been acquired.

I wasn't talking about catalog scans or this patch, I wonder whether the
update we do there won't cause the index not being used for concurrent
(normal) scans since now the xmin is newer while it might be far in the
past before. I.e. we might need to unset indexcheckxmin if xmin is far
enough in the past.

> > Hm. Looks like this should also change the description of SecondarySnapshot:
> >
> > /*
> >  * CurrentSnapshot points to the only snapshot taken in transaction-snapshot
> >  * mode, and to the latest one taken in a read-committed transaction.
> >  * SecondarySnapshot is a snapshot that's always up-to-date as of the current
> >  * instant, even in transaction-snapshot mode.  It should only be used for
> >  * special-purpose code (say, RI checking.)
> >  *
> 
> I think that's still more or less true, though we could add catalog
> scans as another example.

I guess my feeling is that once catalog scans use it, it's not so much
special purpose anymore ;). But I admit that the frequency of usage
doesn't say much about its specificity...

> > I actually wonder if we shouldn't just abolish GetLatestSnapshot(). None
> > of the callers seem to rely on it's behaviour from a quick look and it
> > seems rather confusing to have both.
> 
> I assume Tom had some reason for making GetLatestSnapshot() behave the
> way it does, so I refrained from doing that.  I might be wrong.

At least I don't see any on a quick look - which doesn't say very
much. I think I just dislike having *instant and *latest functions in
there, seems to be confusing to me.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Andres Freund

Date:

17 June 2013, 15:13:05

On 2013-06-03 14:57:12 -0400, Robert Haas wrote:
> On Thu, May 30, 2013 at 1:39 AM, Michael Paquier
> <michael.paquier@gmail.com> wrote:
> > +1.
> 
> Here's a more serious patch for MVCC catalog access.  This one
> involves more data copying than the last one, I think, because the
> previous version did not register the snapshots it took, which I think
> is not safe.  So this needs to be re-tested for performance, which I
> have so far made no attempt to do.

Ok, I am starting to take a bit more serious look.

Minor issues I noticed:
* index.c:index_constraint_create()s - comments need to get updated
* index.c:IndexCheckExclusion() - why do we still use a SnapshotNow? I'd rather not use *Now if it isn't necessary.
* the * CONCURRENTLY infrastructure should be simplified once this has been applied, but I think it makes sense to keep
thatseparate.

* index.c:reindex_index() - SnapshotNow comment should be updated

I still think that renaming SnapshotNow to something like
SnapshotPerTuple to force everyone to reavaluate their usage would be
good.

So, the biggest issue with the patch seems to be performance worries. I
tried to create a worst case scenario:
postgres (patched and HEAD) running with:
-c shared_buffers=4GB \
-c max_connections=2000 \
-c maintenance_work_mem=2GB \
-c checkpoint_segments=300 \
-c wal_buffers=64MB \
-c synchronous_commit=off \
-c autovacuum=off \
-p 5440

With one background pgbench running:
pgbench -p 5440 -h /tmp -f /tmp/readonly-busy.sql -c 1000 -j 10 -T 100 postgres
readonly-busy.sql:
BEGIN;
SELECT txid_current();
SELECT pg_sleep(0.0001);
COMMIT;

I measured the performance of one other pgbench:
pgbench -h /tmp -p 5440 postgres -T 10 -c 100 -j 100 -n -f /tmp/simplequery.sql -C
simplequery.sql:
SELECT * FROM af1, af2 WHERE af1.x = af2.x;
tables:
create table af1 (x) as select g from generate_series(1,4) g;
create table af2 (x) as select g from generate_series(4,7) g;

With that setup one can create quite a noticeable overhead for the mvcc
patch (best of 5):

master-optimize:
tps = 1261.629474 (including connections establishing)
tps = 15121.648834 (excluding connections establishing)

dev-optimize:
tps = 773.719637 (including connections establishing)
tps = 2804.239979 (excluding connections establishing)

Most of the time in both, patched and unpatched is by far spent in
GetSnapshotData. I think the reason this shows a far higher overhead
than what you previously measured is that a) in your test the other
backends were idle, in mine they actually modify PGXACT which causes
noticeable cacheline bouncing b) I have higher numer of connections &
#max_connections

A quick test shows that even with max_connection=600, 400 background,
and 100 foreground pgbenches there's noticeable overhead:
master-optimize:
tps = 2221.226711 (including connections establishing)
tps = 31203.259472 (excluding connections establishing)
dev-optimize:
tps = 1629.734352 (including connections establishing)
tps = 4754.449726 (excluding connections establishing)

Now I grant that's a somewhat harsh test for postgres, but I don't
think it's entirely unreasonable and the performance impact is quite
stark.

> It strikes me as rather unfortunate that the snapshot interface is
> designed in such a way as to require so much data copying.  It seems
> we always take a snapshot by copying from PGXACT/PGPROC into
> CurrentSnapshotData or SecondarySnapshotData, and then copying data a
> second time from there to someplace more permanent.  It would be nice
> to avoid that, at least in common cases.

Sounds doable. But let's do one thing at a atime ;). That copy wasn't
visible in the rather extreme workload from above btw...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

20 June 2013, 16:45:47

On Mon, Jun 17, 2013 at 8:12 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> So, the biggest issue with the patch seems to be performance worries. I
> tried to create a worst case scenario:
> postgres (patched and HEAD) running with:
> -c shared_buffers=4GB \
> -c max_connections=2000 \
> -c maintenance_work_mem=2GB \
> -c checkpoint_segments=300 \
> -c wal_buffers=64MB \
> -c synchronous_commit=off \
> -c autovacuum=off \
> -p 5440
>
> With one background pgbench running:
> pgbench -p 5440 -h /tmp -f /tmp/readonly-busy.sql -c 1000 -j 10 -T 100 postgres
> readonly-busy.sql:
> BEGIN;
> SELECT txid_current();
> SELECT pg_sleep(0.0001);
> COMMIT;
>
> I measured the performance of one other pgbench:
> pgbench -h /tmp -p 5440 postgres -T 10 -c 100 -j 100 -n -f /tmp/simplequery.sql -C
> simplequery.sql:
> SELECT * FROM af1, af2 WHERE af1.x = af2.x;
> tables:
> create table af1 (x) as select g from generate_series(1,4) g;
> create table af2 (x) as select g from generate_series(4,7) g;
>
> With that setup one can create quite a noticeable overhead for the mvcc
> patch (best of 5):
>
> master-optimize:
> tps = 1261.629474 (including connections establishing)
> tps = 15121.648834 (excluding connections establishing)
>
> dev-optimize:
> tps = 773.719637 (including connections establishing)
> tps = 2804.239979 (excluding connections establishing)
>
> Most of the time in both, patched and unpatched is by far spent in
> GetSnapshotData. I think the reason this shows a far higher overhead
> than what you previously measured is that a) in your test the other
> backends were idle, in mine they actually modify PGXACT which causes
> noticeable cacheline bouncing b) I have higher numer of connections &
> #max_connections
>
> A quick test shows that even with max_connection=600, 400 background,
> and 100 foreground pgbenches there's noticeable overhead:
> master-optimize:
> tps = 2221.226711 (including connections establishing)
> tps = 31203.259472 (excluding connections establishing)
> dev-optimize:
> tps = 1629.734352 (including connections establishing)
> tps = 4754.449726 (excluding connections establishing)
>
> Now I grant that's a somewhat harsh test for postgres, but I don't
> think it's entirely unreasonable and the performance impact is quite
> stark.

It's not entirely unreasonable, but it *is* mostly unreasonable.  I
mean, nobody is going to run 1000 connections in the background that
do nothing but thrash PGXACT on a real system.  I just can't get
concerned about that.  What I am concerned about is that there may be
other, more realistic workloads that show similar regressions.  But I
don't know how to find out whether that's actually the case.  On the
IBM POWER box where I tested this, it's not even GetSnapshotData()
that kills you; it's the system CPU scheduler.

The thing about this particular test is that it's artificial -
normally, any operation that wants to modify PGXACT will spend a lot
more time fighting of WALInsertLock and maybe waiting for disk I/O
than is the case here.  Of course, with Heikki's WAL scaling patch and
perhaps other optimizations we expect that other overhead to go down,
which might make the problems here more visible; and some of Heikki's
existing testing has shown significant contention around ProcArrayLock
as things stand.  But I'm still on the fence about whether this is
really a valid test.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

20 June 2013, 17:34:41

On 2013-06-20 09:45:26 -0400, Robert Haas wrote:
> > With that setup one can create quite a noticeable overhead for the mvcc
> > patch (best of 5):
> >
> > master-optimize:
> > tps = 1261.629474 (including connections establishing)
> > tps = 15121.648834 (excluding connections establishing)
> >
> > dev-optimize:
> > tps = 773.719637 (including connections establishing)
> > tps = 2804.239979 (excluding connections establishing)
> >
> > Most of the time in both, patched and unpatched is by far spent in
> > GetSnapshotData. I think the reason this shows a far higher overhead
> > than what you previously measured is that a) in your test the other
> > backends were idle, in mine they actually modify PGXACT which causes
> > noticeable cacheline bouncing b) I have higher numer of connections &
> > #max_connections
> >
> > A quick test shows that even with max_connection=600, 400 background,
> > and 100 foreground pgbenches there's noticeable overhead:
> > master-optimize:
> > tps = 2221.226711 (including connections establishing)
> > tps = 31203.259472 (excluding connections establishing)
> > dev-optimize:
> > tps = 1629.734352 (including connections establishing)
> > tps = 4754.449726 (excluding connections establishing)
> >
> > Now I grant that's a somewhat harsh test for postgres, but I don't
> > think it's entirely unreasonable and the performance impact is quite
> > stark.
> 
> It's not entirely unreasonable, but it *is* mostly unreasonable. 

Well, sure. So are the tests that you ran. But that's *completely*
fine. In contrast to evaluating whether a performance improvement is
worth its complexity we're not trying to measure real world
improvements. We're trying to test the worst cases we can think of, even
if they aren't really interesting by stressing potential pain points. If
we can't find a relevant regression for those using something akin to
microbenchmarks it's less likely that there are performance regressions.

The "not entirely unreasonable" point is just about making sure you're
not testing something entirely irrelevant. Say, performance of a 1TB
database when shared_buffers is set to 64k. Or testing DDL performance
while locking pg_class exclusively.

The test was specifically chosen to:
* do uncached syscache lookups (-C) to mesure the impact of the added GetSnapshotData() calls
* make individual GetSnapshotData() calls slower. (all processes have an xid)
* contend on ProcArrayLock but not much else (high number of clients in the background)

> I
> mean, nobody is going to run 1000 connections in the background that
> do nothing but thrash PGXACT on a real system.  I just can't get
> concerned about that.

In the original mail I did retry it with 400 and the regression is still
pretty big. And the "background" things could also be doing something
that's not that likely to be blocked by global locks. Say, operate on
temporary or unlogged tables. Or just acquire a single row level lock
and then continue to do readonly work in a read committed transaction.

I think we both can come up with scenarios where at least part of the
above scenario is present. But imo that doesn't really matter.

> What I am concerned about is that there may be
> other, more realistic workloads that show similar regressions.  But I
> don't know how to find out whether that's actually the case.

So, given the results from that test and the profile I got where
GetSnapshotData was by far the most expensive thing a more
representative test might be something like a readonly pgbench with a
moderately high number of short lived connections. I wouldn't be
surprised if that still showed performance problems.

If that's not enough something like:
BEGIN;
SELECT * FROM my_client WHERE client_id = :id FOR UPDATE;
SELECT * FROM key_table WHERE key = :random
...
SELECT * FROM key_table WHERE key = :random
COMMIT;

will sure still show the problem.

>  On the
> IBM POWER box where I tested this, it's not even GetSnapshotData()
> that kills you; it's the system CPU scheduler.

I haven't tried yet, but I'd guess the above setup shows the difference
with less than 400 clients. Might make it more reasonable to run there.

> But I'm still on the fence about whether this is really a valid test.

I think it shows that we need to be careful and do further performance
evaluations and/or alleviate the pain by making things cheaper (say, a
"ddl counter" in shared mem, allowing to cache snapshots for the
syscache). If that artificial test hadn't shown problems I'd have voted
for just going ahead and not worry further.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

20 June 2013, 17:59:05

On Thu, Jun 20, 2013 at 10:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> But I'm still on the fence about whether this is really a valid test.
>
> I think it shows that we need to be careful and do further performance
> evaluations and/or alleviate the pain by making things cheaper (say, a
> "ddl counter" in shared mem, allowing to cache snapshots for the
> syscache). If that artificial test hadn't shown problems I'd have voted
> for just going ahead and not worry further.

I tried a general snapshot counter that rolls over every time any
transaction commits, but that doesn't help much.  It's a small
improvement on general workloads, but it's not effective against this
kind of hammering.  A DDL counter would be a bit more expensive
because we'd have to insert an additional branch into
GetSnapshotData() while ProcArrayLock is held, but it might be
tolerable.  Do you have code for this (or some portion of it) already?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

20 June 2013, 18:12:44

On 2013-06-20 10:58:59 -0400, Robert Haas wrote:
> On Thu, Jun 20, 2013 at 10:35 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> But I'm still on the fence about whether this is really a valid test.
> >
> > I think it shows that we need to be careful and do further performance
> > evaluations and/or alleviate the pain by making things cheaper (say, a
> > "ddl counter" in shared mem, allowing to cache snapshots for the
> > syscache). If that artificial test hadn't shown problems I'd have voted
> > for just going ahead and not worry further.
> 
> I tried a general snapshot counter that rolls over every time any
> transaction commits, but that doesn't help much.  It's a small
> improvement on general workloads, but it's not effective against this
> kind of hammering.  A DDL counter would be a bit more expensive
> because we'd have to insert an additional branch into
> GetSnapshotData() while ProcArrayLock is held, but it might be
> tolerable.

I actually wasn't thinking of adding it at that level. More like just
not fetching a new snapshot in systable_* if a) the global ddl counter
hasn't been increased b) our transactions hasn't performed any
ddl. Instead use the one used the last time we fetched one for that
purpose.

The global ddl counter could be increased everytime we commit and the
commit has invalidations queued. The local one everytime we execute
local cache invalidation messages (i.e. around CommandCounterIncrement).

I think something roughly along those lines should be doable without
adding new overhead to global paths. Except maybe some check in
snapmgr.c for that new, longer living, snapshot.

> Do you have code for this (or some portion of it) already?

Nope, sorry.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

28 June 2013, 05:33:53

On Thu, Jun 20, 2013 at 11:13 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> I actually wasn't thinking of adding it at that level. More like just
> not fetching a new snapshot in systable_* if a) the global ddl counter
> hasn't been increased b) our transactions hasn't performed any
> ddl. Instead use the one used the last time we fetched one for that
> purpose.

All right, so here's a patch that does something along those lines.
We have to always take a new snapshot when somebody scans a catalog
that has no syscache, because there won't be any invalidation messages
to work off of in that case.  The only catalog in that category that's
accessed during backend startup (which is mostly what your awful test
case is banging on) is pg_db_role_setting.  We could add a syscache
for that catalog or somehow force invalidation messages to be sent
despite the lack of a syscache, but what I chose to do instead is
refactor things slightly so that we use the same snapshot for all four
scans of pg_db_role_setting, instead of taking a new one each time.  I
think that's unimpeachable on correctness grounds; it's no different
than if we'd finished all four scans in the time it took us to finish
the first one, and then gotten put to sleep by the OS scheduler for as
long as it took us to scan the other three.  Point being that there's
no interlock there.

Anyway, with that change, plus the general mechanism of the patch,
each backend takes just two MVCC scans during startup.  The first
catalog access takes an MVCC snapshot for obvious reasons, and then
there's one additional snapshot for the access to pg_db_role_setting
for the reasons stated above.  Everything else piggybacks on those
snapshots, unless of course an invalidation intervenes.

In my testing, this did not completely eliminate the performance hit
on your test case, but it got it down to 3-4%.  Best of five runs,
with max_connections=2000 and 1000 connections running
readonly-busy.sql:

(patched)
tps = 183.224651 (including connections establishing)
tps = 1091.813178 (excluding connections establishing)
(unpatched)
tps = 190.598045 (including connections establishing)
tps = 1129.422537 (excluding connections establishing)

The difference is 3-4%, which is quite a lot less than what you
measured before, although on different hardware, so results may vary.

Now, I'm not sure this fix actually helps the other test scenarios
very much; for example, it's not going to help the cases that pound on
pg_depend, because that doesn't have a system cache either.  As with
pg_db_role_setting, we could optimize this mechanism by sending
invalidation messages for pg_depend changes, but I'm not sure it's
worth it.

I'm also attaching a fixed version of pg_cxn.c; the last version had a few bugs.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: MVCC catalog access

From

Alvaro Herrera

Date:

28 June 2013, 07:22:14

Robert Haas escribió:

> All right, so here's a patch that does something along those lines.
> We have to always take a new snapshot when somebody scans a catalog
> that has no syscache, because there won't be any invalidation messages
> to work off of in that case.  The only catalog in that category that's
> accessed during backend startup (which is mostly what your awful test
> case is banging on) is pg_db_role_setting.  We could add a syscache
> for that catalog or somehow force invalidation messages to be sent
> despite the lack of a syscache, but what I chose to do instead is
> refactor things slightly so that we use the same snapshot for all four
> scans of pg_db_role_setting, instead of taking a new one each time.  I
> think that's unimpeachable on correctness grounds; it's no different
> than if we'd finished all four scans in the time it took us to finish
> the first one, and then gotten put to sleep by the OS scheduler for as
> long as it took us to scan the other three.  Point being that there's
> no interlock there.

That seems perfectly acceptable to me, yeah.

> The difference is 3-4%, which is quite a lot less than what you
> measured before, although on different hardware, so results may vary.

3-4% on that synthetic benchmark sounds pretty acceptable to me, as
well.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: MVCC catalog access

From

Robert Haas

Date:

29 June 2013, 06:14:30

On Fri, Jun 28, 2013 at 12:22 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
>> The difference is 3-4%, which is quite a lot less than what you
>> measured before, although on different hardware, so results may vary.
>
> 3-4% on that synthetic benchmark sounds pretty acceptable to me, as
> well.

Here's a further update of this patch.  In this version, I added some
mechanism to send a new kind of sinval message that is sent when a
catalog without catcaches is updated; it doesn't apply to all
catalogs, just to whichever ones we want to have this treatment.  That
means we don't need to retake snapshots for those catalogs on every
access, so backend startup requires just one extra MVCC snapshot as
compared with current master.  Assorted cleanup has been done, along
with the removal of a few more SnapshotNow references.

It's still possible to construct test cases that perform badly by
pounding the server with 1000 clients running Andres's
readonly-busy.sql.  Consider the following test case: use a DO block
to create a schema with 10,000 functions in it and then DROP ..
CASCADE.  When the server is unloaded, the extra MVCC overhead is
pretty small.

master
Create: 1010.225 ms, Drop: 444.891 ms
Create: 1001.237 ms, Drop: 444.084 ms
Create: 979.621 ms, Drop: 446.091 ms

patched
Create: 992.366 ms, Drop: 459.334 ms
Create: 992.436 ms, Drop: 459.921 ms
Create: 990.971 ms, Drop: 459.573 ms

The create case is actually running just a hair faster with the patch,
and the drop case is about 3% slower.  Now let's add 1000 clients
running Andres's readonly-busy.sql in the background and retest:

master
Create: 21554.387 ms, Drop: 2594.534 ms
Create: 32189.395 ms, Drop: 2493.213 ms
Create: 30627.964 ms, Drop: 1813.160 ms

patched
Create: 44312.107 ms, Drop: 11718.305 ms
Create: 46683.021 ms, Drop: 11732.284 ms
Create: 50766.615 ms, Drop: 9363.742 ms

Well, now the create is 52% slower and the drop is a whopping 4.7x
slower.  It's worth digging into the reasons just a bit.  I was able
to speed up this case quite a bit - it was 30x slower a few hours ago
- by adding a few new relations to the switch in
RelationInvalidatesSnapshotsOnly().  But the code still takes one MVCC
snapshot per object dropped, because deleteOneObject() calls
CommandCounterIncrement() and that, as it must, invalidates our
previous snapshot.  We could, if we were inclined to spend the effort,
probably work out that although we need to change curcid, the rest of
the snapshot is still OK, but I'm not too convinced that it's worth
adding an even-more-complicated mechanism for this.  We could probably
also optimize the delete code to increment the command counter fewer
times, but I'm not convinced that's worth doing either.

I think my general feeling about this is that we're going to have to
accept that there's no such thing as a free lunch, but maybe that's
OK.  The testing done to date shows that MVCC snapshots are not
terribly expensive when PGXACT isn't heavily updated, even if you take
an awful lot of them, but with enough concurrent activity on PGXACT
they do get expensive enough to care about.  And that's still not a
big problem on normal workloads, but if you combine code that with a
workload that requires an abnormally high number of snapshots compared
to the overall work it does (like a DROP CASCADE on a schema with many
objects but no tables) then you can make it quite slow.  That's not
great, but on the other hand, if it had always been that slow, I'm not
all that sure anyone would have complained.  DDL performance is not
something we've spend a lot of cycles on, and for good reason.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

mvcc-catalog-access-v5.patch

Re: MVCC catalog access

From

Andres Freund

Date:

01 July 2013, 17:04:56

On 2013-06-28 23:14:23 -0400, Robert Haas wrote:
> Here's a further update of this patch.  In this version, I added some
> mechanism to send a new kind of sinval message that is sent when a
> catalog without catcaches is updated; it doesn't apply to all
> catalogs, just to whichever ones we want to have this treatment.  That
> means we don't need to retake snapshots for those catalogs on every
> access, so backend startup requires just one extra MVCC snapshot as
> compared with current master.  Assorted cleanup has been done, along
> with the removal of a few more SnapshotNow references.

This is really cool stuff.

> It's still possible to construct test cases that perform badly by
> pounding the server with 1000 clients running Andres's
> readonly-busy.sql.  Consider the following test case: use a DO block
> to create a schema with 10,000 functions in it and then DROP ..
> CASCADE.  When the server is unloaded, the extra MVCC overhead is
> pretty small.

> Well, now the create is 52% slower and the drop is a whopping 4.7x
> slower.  It's worth digging into the reasons just a bit.  I was able
> to speed up this case quite a bit - it was 30x slower a few hours ago
> - by adding a few new relations to the switch in
> RelationInvalidatesSnapshotsOnly().  But the code still takes one MVCC
> snapshot per object dropped, because deleteOneObject() calls
> CommandCounterIncrement() and that, as it must, invalidates our
> previous snapshot.

I have to say, if the thing that primarily suffers is pretty extreme DDL
in extreme situations I am not really worried. Anybody running anything
close to the territory of such concurrency won't perform that much DDL.

> We could, if we were inclined to spend the effort,
> probably work out that although we need to change curcid, the rest of
> the snapshot is still OK, but I'm not too convinced that it's worth
> adding an even-more-complicated mechanism for this.  We could probably
> also optimize the delete code to increment the command counter fewer
> times, but I'm not convinced that's worth doing either.

I am pretty convinced we shouldn't do either for now.

Something picked up when quickly scanning over the last version of the
patch:

> +/*
> + * Staleness detection for CatalogSnapshot.
> + */
> +static bool CatalogSnapshotStale = true;
>  
>  /*
>   * These are updated by GetSnapshotData.  We initialize them this way
> @@ -177,6 +188,9 @@ GetTransactionSnapshot(void)
>          else
>              CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
>  
> +        /* Don't allow catalog snapshot to be older than xact snapshot. */
> +        CatalogSnapshotStale = true;
> +
>          FirstSnapshotSet = true;
>          return CurrentSnapshot;
>      }
> @@ -184,6 +198,9 @@ GetTransactionSnapshot(void)
>      if (IsolationUsesXactSnapshot())
>          return CurrentSnapshot;
>  
> +    /* Don't allow catalog snapshot to be older than xact snapshot. */
> +    CatalogSnapshotStale = true;
> +
>      CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
>  
>      return CurrentSnapshot;
> @@ -207,6 +224,54 @@ GetLatestSnapshot(void)
>  }

Do we really need to invalidate snapshots in either situation? Isn't it
implied, that if it's still valid, according to a) no invalidation via local
invalidation messages b) no invalidations from other backends, there
shouldn't be any possible differences when you only look at the catalog?

And if it needs to change, we could copy the newly generated snapshot
to the catalog snapshot if it's currently valid.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

01 July 2013, 22:02:49

On Mon, Jul 1, 2013 at 10:04 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> This is really cool stuff.

Thanks.

> I have to say, if the thing that primarily suffers is pretty extreme DDL
> in extreme situations I am not really worried. Anybody running anything
> close to the territory of such concurrency won't perform that much DDL.

/me wipes brow.

> Something picked up when quickly scanning over the last version of the
> patch:
>
>> +/*
>> + * Staleness detection for CatalogSnapshot.
>> + */
>> +static bool CatalogSnapshotStale = true;
>>
>>  /*
>>   * These are updated by GetSnapshotData.  We initialize them this way
>> @@ -177,6 +188,9 @@ GetTransactionSnapshot(void)
>>               else
>>                       CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
>>
>> +             /* Don't allow catalog snapshot to be older than xact snapshot. */
>> +             CatalogSnapshotStale = true;
>> +
>>               FirstSnapshotSet = true;
>>               return CurrentSnapshot;
>>       }
>> @@ -184,6 +198,9 @@ GetTransactionSnapshot(void)
>>       if (IsolationUsesXactSnapshot())
>>               return CurrentSnapshot;
>>
>> +     /* Don't allow catalog snapshot to be older than xact snapshot. */
>> +     CatalogSnapshotStale = true;
>> +
>>       CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
>>
>>       return CurrentSnapshot;
>> @@ -207,6 +224,54 @@ GetLatestSnapshot(void)
>>  }
>
> Do we really need to invalidate snapshots in either situation? Isn't it
> implied, that if it's still valid, according to a) no invalidation via local
> invalidation messages b) no invalidations from other backends, there
> shouldn't be any possible differences when you only look at the catalog?

I had the same thought, removed that code, and then put it back.  The
problem is that if we revive an older snapshot "from the dead", so to
speak, our backend's advertised xmin might need to go backwards, and
that seems unsafe - e.g. suppose another backend has updated a tuple
but not yet committed.  We don't see any invalidation messages so
decide reuse our existing (old) snapshot and begin a scan.  After
we've looked at the page containing the new tuple (and decided not to
see it), vacuum nukes the old tuple (which we then also don't see).
Bad things ensue.  It might be possible to avoid the majority of
problems in this area via an appropriate set of grotty hacks, but I
don't want to go there.

> And if it needs to change, we could copy the newly generated snapshot
> to the catalog snapshot if it's currently valid.

Yeah, I think there's room for further fine-tuning there.  But I think
it would make sense to push the patch at this point, and then if we
find cases that can be further improved, or things that it breaks, we
can fix them.  This area is complicated enough that I wouldn't be
horribly surprised if we end up having to fix a few more problem cases
or even revert the whole thing, but I think we've probably reached the
point where further review has less value than getting the code out
there in front of more people and seeing where (if anywhere) the
wheels come off out in the wild.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

02 July 2013, 16:02:11

On 2013-07-01 15:02:41 -0400, Robert Haas wrote:
> >>   * These are updated by GetSnapshotData.  We initialize them this way
> >> @@ -177,6 +188,9 @@ GetTransactionSnapshot(void)
> >>               else
> >>                       CurrentSnapshot = GetSnapshotData(&CurrentSnapshotData);
> >>
> >> +             /* Don't allow catalog snapshot to be older than xact snapshot. */
> >> +             CatalogSnapshotStale = true;
> >> +
> > Do we really need to invalidate snapshots in either situation? Isn't it
> > implied, that if it's still valid, according to a) no invalidation via local
> > invalidation messages b) no invalidations from other backends, there
> > shouldn't be any possible differences when you only look at the catalog?
>
> I had the same thought, removed that code, and then put it back.  The
> problem is that if we revive an older snapshot "from the dead", so to
> speak, our backend's advertised xmin might need to go backwards, and
> that seems unsafe - e.g. suppose another backend has updated a tuple
> but not yet committed.  We don't see any invalidation messages so
> decide reuse our existing (old) snapshot and begin a scan.  After
> we've looked at the page containing the new tuple (and decided not to
> see it), vacuum nukes the old tuple (which we then also don't see).
> Bad things ensue.  It might be possible to avoid the majority of
> problems in this area via an appropriate set of grotty hacks, but I
> don't want to go there.

Yes, I thought about that and I think it's a problem that can be solved
without too ugly hacks. But, as you say:

> Yeah, I think there's room for further fine-tuning there.  But I think
> it would make sense to push the patch at this point, and then if we
> find cases that can be further improved, or things that it breaks, we
> can fix them.  This area is complicated enough that I wouldn't be
> horribly surprised if we end up having to fix a few more problem cases
> or even revert the whole thing, but I think we've probably reached the
> point where further review has less value than getting the code out
> there in front of more people and seeing where (if anywhere) the
> wheels come off out in the wild.

I am pretty sure that we will have to fix more stuff, but luckily we're
in the beginning of the cycle. And while I'd prefer more eyes on the
patch before it gets applied, especially ones knowledgeable about the
implications this has, I don't really see that happening soon. So
applying is more likely to lead to more review than waiting.

So, from me: +1.

Some things that might be worth changing when committing:
* Could you add a Assert(!RelationHasSysCache(relid)) to RelationInvalidatesSnapshotsOnly? It's not unlikely that it
willbe missed by the next person adding a syscache and that seems like it could have ugly and hard to detect
consequences.
* maybe use bsearch(3) instead of open coding the binary search? We already use it in the backend.
* possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or GetCatalogSnapshot ensuring we're dealing with a
catalogrelation. The consistency mechanisms in GetCatalogSnapshot() only work for those, so there doesn't seem to be a
validusecase for non-catalog relations.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

02 July 2013, 16:31:36

On Tue, Jul 2, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Some things that might be worth changing when committing:
> * Could you add a Assert(!RelationHasSysCache(relid)) to
>   RelationInvalidatesSnapshotsOnly? It's not unlikely that it will be
>   missed by the next person adding a syscache and that seems like it
>   could have ugly and hard to detect consequences.

There's a cross-check in InitCatalogCache() for that very issue.

> * maybe use bsearch(3) instead of open coding the binary search? We
>   already use it in the backend.

I found comments elsewhere indicating that bsearch() was slower than
open-coding it, so I copied the logic used for ScanKeywordLookup().

> * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or
>   GetCatalogSnapshot ensuring we're dealing with a catalog relation. The
>   consistency mechanisms in GetCatalogSnapshot() only work for those, so
>   there doesn't seem to be a valid usecase for non-catalog relations.

It'll actually work find as things stand; it'll just take a new
snapshot every time.

I have a few ideas for getting rid of the remaining uses of
SnapshotNow that I'd like to throw out there:

- In pgrowlocks and pgstattuple, I think it would be fine to use
SnapshotSelf instead of SnapshotNow.  The only difference is that it
includes changes made by the current command that wouldn't otherwise
be visible until CommandCounterIncrement().  That, however, is not
really a problem for their usage.

- In genam.c and execUtils.c, we treat SnapshotNow as a kind of
default snapshot.  That seems like a crappy idea.  I propose that we
either set that pointer to NULL and let the server core dump if the
snapshot doesn't get set or (maybe better) add a new special snapshot
called SnapshotError which just errors out if you try to use it for
anything, and initialize to that.

- I'm not quite sure what to do about get_actual_variable_range().
Taking a new MVCC snapshot there seems like it might be pricey on some
workloads.  However, I wonder if we could use SnapshotDirty.
Presumably, even uncommitted tuples affect the amount of
index-scanning we have to do, so that approach seems to have some
theoretical justification.  But I'm worried there could be unfortunate
consequences to looking at uncommitted data, either now or in the
future.  SnapshotSelf seems less likely to have that problem, but
feels wrong somehow.

- currtid_byreloid() and currtid_byrelname() use SnapshotNow as an
argument to heap_get_latest_tid().  I don't know what these functions
are supposed to be good for, but taking a new snapshot for every
function call seems to guarantee that someone will find a way to use
these functions as a poster child for how to brutalize PGXACT, so I
don't particularly want to do that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

02 July 2013, 16:52:32

On 2013-07-02 09:31:23 -0400, Robert Haas wrote:
> On Tue, Jul 2, 2013 at 9:02 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> > Some things that might be worth changing when committing:
> > * Could you add a Assert(!RelationHasSysCache(relid)) to
> >   RelationInvalidatesSnapshotsOnly? It's not unlikely that it will be
> >   missed by the next person adding a syscache and that seems like it
> >   could have ugly and hard to detect consequences.
> 
> There's a cross-check in InitCatalogCache() for that very issue.

Great.

> > * maybe use bsearch(3) instead of open coding the binary search? We
> >   already use it in the backend.
> 
> I found comments elsewhere indicating that bsearch() was slower than
> open-coding it, so I copied the logic used for ScanKeywordLookup().

Hm. Ok.

> > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or
> >   GetCatalogSnapshot ensuring we're dealing with a catalog relation. The
> >   consistency mechanisms in GetCatalogSnapshot() only work for those, so
> >   there doesn't seem to be a valid usecase for non-catalog relations.
> 
> It'll actually work find as things stand; it'll just take a new
> snapshot every time.

Ok. Doesn't really change my opinion that it's a crappy idea to use it
otherwise ;)

> - In genam.c and execUtils.c, we treat SnapshotNow as a kind of
> default snapshot.  That seems like a crappy idea.  I propose that we
> either set that pointer to NULL and let the server core dump if the
> snapshot doesn't get set or (maybe better) add a new special snapshot
> called SnapshotError which just errors out if you try to use it for
> anything, and initialize to that.

I vote for SnapshotError.

> - I'm not quite sure what to do about get_actual_variable_range().
> Taking a new MVCC snapshot there seems like it might be pricey on some
> workloads.  However, I wonder if we could use SnapshotDirty.
> Presumably, even uncommitted tuples affect the amount of
> index-scanning we have to do, so that approach seems to have some
> theoretical justification.  But I'm worried there could be unfortunate
> consequences to looking at uncommitted data, either now or in the
> future.  SnapshotSelf seems less likely to have that problem, but
> feels wrong somehow.

I don't like using SnapshotDirty either. Can't we just use the currently
active snapshot? Unless I miss something this always will be called
while we have one since when we plan we've done an explicit
PushActiveSnapshot() and if we need to replan stuff during execution
PortalRunSelect() will have pushed one.

> - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an
> argument to heap_get_latest_tid().  I don't know what these functions
> are supposed to be good for, but taking a new snapshot for every
> function call seems to guarantee that someone will find a way to use
> these functions as a poster child for how to brutalize PGXACT, so I
> don't particularly want to do that.

Heikki mentioned that at some point they were added for the odbc
driver. I am not particularly inclined to worry about taking too many
snapshots here.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

02 July 2013, 17:38:21

On Tue, Jul 2, 2013 at 9:52 AM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or
>> >   GetCatalogSnapshot ensuring we're dealing with a catalog relation. The
>> >   consistency mechanisms in GetCatalogSnapshot() only work for those, so
>> >   there doesn't seem to be a valid usecase for non-catalog relations.
>>
>> It'll actually work find as things stand; it'll just take a new
>> snapshot every time.
>
> Ok. Doesn't really change my opinion that it's a crappy idea to use it
> otherwise ;)

I agree, but I don't see an easy way to write the assertion you want
using only the OID.

>> - In genam.c and execUtils.c, we treat SnapshotNow as a kind of
>> default snapshot.  That seems like a crappy idea.  I propose that we
>> either set that pointer to NULL and let the server core dump if the
>> snapshot doesn't get set or (maybe better) add a new special snapshot
>> called SnapshotError which just errors out if you try to use it for
>> anything, and initialize to that.
>
> I vote for SnapshotError.

OK.

>> - I'm not quite sure what to do about get_actual_variable_range().
>> Taking a new MVCC snapshot there seems like it might be pricey on some
>> workloads.  However, I wonder if we could use SnapshotDirty.
>> Presumably, even uncommitted tuples affect the amount of
>> index-scanning we have to do, so that approach seems to have some
>> theoretical justification.  But I'm worried there could be unfortunate
>> consequences to looking at uncommitted data, either now or in the
>> future.  SnapshotSelf seems less likely to have that problem, but
>> feels wrong somehow.
>
> I don't like using SnapshotDirty either. Can't we just use the currently
> active snapshot? Unless I miss something this always will be called
> while we have one since when we plan we've done an explicit
> PushActiveSnapshot() and if we need to replan stuff during execution
> PortalRunSelect() will have pushed one.

We could certainly do that, but I'd be a bit reluctant to do so
without input from Tom.  I imagine there might be cases where it could
cause a regression.

>> - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an
>> argument to heap_get_latest_tid().  I don't know what these functions
>> are supposed to be good for, but taking a new snapshot for every
>> function call seems to guarantee that someone will find a way to use
>> these functions as a poster child for how to brutalize PGXACT, so I
>> don't particularly want to do that.
>
> Heikki mentioned that at some point they were added for the odbc
> driver. I am not particularly inclined to worry about taking too many
> snapshots here.

Well, if it uses it with any regularity I think it's a legitimate
concern.   We have plenty of customers who use ODBC and some of them
allow frightening numbers of concurrent server connections.  Now you
may say that's a bad idea, but so is 1000 backends doing
txid_current() in a loop.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: MVCC catalog access

From

Andres Freund

Date:

02 July 2013, 18:24:06

On 2013-07-02 10:38:17 -0400, Robert Haas wrote:
> On Tue, Jul 2, 2013 at 9:52 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> >> > * possibly paranoid, but I'd add a Assert to heap_beginscan_catalog or
> >> >   GetCatalogSnapshot ensuring we're dealing with a catalog relation. The
> >> >   consistency mechanisms in GetCatalogSnapshot() only work for those, so
> >> >   there doesn't seem to be a valid usecase for non-catalog relations.
> >>
> >> It'll actually work find as things stand; it'll just take a new
> >> snapshot every time.
> >
> > Ok. Doesn't really change my opinion that it's a crappy idea to use it
> > otherwise ;)
> 
> I agree, but I don't see an easy way to write the assertion you want
> using only the OID.

Let's add

/** IsSystemRelationId*             True iff the relation is a system catalog relation.*/
bool
IsSystemRelationId(Oid relid)
{      return relid < FirstNormalObjectId;
}

and change IsSystemRelation() to use that instead of what it does now...

> >> - currtid_byreloid() and currtid_byrelname() use SnapshotNow as an
> >> argument to heap_get_latest_tid().  I don't know what these functions
> >> are supposed to be good for, but taking a new snapshot for every
> >> function call seems to guarantee that someone will find a way to use
> >> these functions as a poster child for how to brutalize PGXACT, so I
> >> don't particularly want to do that.
> >
> > Heikki mentioned that at some point they were added for the odbc
> > driver. I am not particularly inclined to worry about taking too many
> > snapshots here.
> 
> Well, if it uses it with any regularity I think it's a legitimate
> concern.   We have plenty of customers who use ODBC and some of them
> allow frightening numbers of concurrent server connections. 

I've quickly verified that it indeed uses them. I wish I hadn't. Brrr. I
can't even guess what that should do from the surrounding code/function
names. Except that it looks broken under concurrency as long as
SnapshotNow is used (because the query's snapshot won't be as new as
SnapshotNow, even in read committed mode).

Heikki, do you understand the code well enough to explain it without
investing time?

> Now you may say that's a bad idea, but so is 1000 backends doing
> txid_current() in a loop.

Hehe ;).

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Heikki Linnakangas

Date:

02 July 2013, 18:37:56

On 02.07.2013 18:24, Andres Freund wrote:
> I've quickly verified that it indeed uses them. I wish I hadn't. Brrr. I
> can't even guess what that should do from the surrounding code/function
> names. Except that it looks broken under concurrency as long as
> SnapshotNow is used (because the query's snapshot won't be as new as
> SnapshotNow, even in read committed mode).
>
> Heikki, do you understand the code well enough to explain it without
> investing time?

No, sorry. I think it has something to do with updateable cursors, but I 
don't understand the details.

- Heikki

Re: MVCC catalog access

From

Andres Freund

Date:

05 July 2013, 19:28:21

Hi Robert,

On 2013-07-02 09:31:23 -0400, Robert Haas wrote:
> I have a few ideas for getting rid of the remaining uses of
> SnapshotNow that I'd like to throw out there:

Is your current plan to get rid of SnapshotNow entirely? I am wonder
because the changeset extraction needs to care and how the proper fix
for dealing with CatalogSnapshotData looks depends on it...

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services

Re: MVCC catalog access

From

Robert Haas

Date:

06 July 2013, 16:08:00

On Fri, Jul 5, 2013 at 11:27 AM, Andres Freund <andres@2ndquadrant.com> wrote:
> Hi Robert,
>
> On 2013-07-02 09:31:23 -0400, Robert Haas wrote:
>> I have a few ideas for getting rid of the remaining uses of
>> SnapshotNow that I'd like to throw out there:
>
> Is your current plan to get rid of SnapshotNow entirely? I am wonder
> because the changeset extraction needs to care and how the proper fix
> for dealing with CatalogSnapshotData looks depends on it...

I would like to do that, but I haven't quite figured out how to get
rid of the last few instances, per discussion upthread.  I do plan to
spend some more time on it, but likely not this week.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company