Home > mailing lists

Thread: Read Uncommitted

Read Uncommitted

From

Simon Riggs

Date:

18 December 2019, 10:01:34

I present a patch to allow READ UNCOMMITTED that is simple, useful and efficient. This was previously thought to have no useful definition within PostgreSQL, though I have identified a use case for diagnostics and recovery that merits adding a short patch to implement it.

My docs for this are copied here:

In <productname>PostgreSQL</productname>'s <acronym>MVCC</acronym>
architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.

In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused

at the end of their execution for whatever reason then you can
see a consistent result.

The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.

In <productname>PostgreSQL</productname> read uncommitted mode gives
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.

This is a new and surprising thought, so please review the attached patch.

Please notice that almost all of the infrastructure already exists to support this, so this patch does very little. It avoids additional locking on main execution paths and as far as I am aware, does not break anything.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Attachment

read_uncommitted.v1.patch

Re: Read Uncommitted

From

Konstantin Knizhnik

Date:

18 December 2019, 12:11:37

On 18.12.2019 13:01, Simon Riggs wrote:

I present a patch to allow READ UNCOMMITTED that is simple, useful and efficient. This was previously thought to have no useful definition within PostgreSQL, though I have identified a use case for diagnostics and recovery that merits adding a short patch to implement it.

My docs for this are copied here:

In <productname>PostgreSQL</productname>'s <acronym>MVCC</acronym>./configure --prefix=/home/knizhnik/postgresql/dist --enable-debug --enable-cassert CFLAGS=-O0

architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.

In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result.

The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.

In <productname>PostgreSQL</productname> read uncommitted mode gives
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.

This is a new and surprising thought, so please review the attached patch.

Please notice that almost all of the infrastructure already exists to support this, so this patch does very little. It avoids additional locking on main execution paths and as far as I am aware, does not break anything.

--
Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

As far as I understand with "read uncommitted" policy we can see two versions of the same tuple if it was updated by two transactions both of which were started before us and committed during table traversal by transaction with "read uncommitted" policy. Certainly "read uncommitted" means that we are ready to get inconsistent results, but is it really acceptable to multiple versions of the same tuple?

-- 
Konstantin Knizhnik
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Read Uncommitted

From

Tom Lane

Date:

18 December 2019, 14:06:04

Simon Riggs <simon@2ndquadrant.com> writes:
> I present a patch to allow READ UNCOMMITTED that is simple, useful and
> efficient.

Won't this break entirely the moment you try to read a tuple containing
toasted-out-of-line values?  There's no guarantee that the toast-table
entries haven't been vacuumed away.

I suspect it can also be broken by cases involving, eg, dropped columns.
There are a lot of assumptions in the system that no one will ever try
to read dead tuples.

The fact that you can construct a use-case in which it's good for
something doesn't make it safe in general :-(

            regards, tom lane

Re: Read Uncommitted

From

Simon Riggs

Date:

18 December 2019, 15:14:20

On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:

As far as I understand with "read uncommitted" policy we can see two versions of the same tuple if it was updated by two transactions both of which were started before us and committed during table traversal by transaction with "read uncommitted" policy. Certainly "read uncommitted" means that we are ready to get inconsistent results, but is it really acceptable to multiple versions of the same tuple?

"In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused

at the end of their execution for whatever reason then you can
see a consistent result."

I think I already covered your concerns in my suggested docs for this feature.

I'm not suggesting it for general use.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Simon Riggs

Date:

18 December 2019, 15:17:48

On Wed, 18 Dec 2019 at 14:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:
> I present a patch to allow READ UNCOMMITTED that is simple, useful and
> efficient.

Won't this break entirely the moment you try to read a tuple containing
toasted-out-of-line values? There's no guarantee that the toast-table
entries haven't been vacuumed away.

I suspect it can also be broken by cases involving, eg, dropped columns.
There are a lot of assumptions in the system that no one will ever try
to read dead tuples.

This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin normally, this problem would go away.

So this won't happen with the proposed patch.

The fact that you can construct a use-case in which it's good for
something doesn't make it safe in general :-(

I agree that safety is a concern, but I don't see any safety issues in the patch as proposed.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Robert Haas

Date:

18 December 2019, 17:35:28

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> wrote:
> This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin
normally,this problem would go away.
 

Why? As soon as a transaction aborts, the TOAST rows can be vacuumed
away, but the READ UNCOMMITTED transaction might've already seen the
main tuple. This is not even a particularly tight race, necessarily,
since for example the table might be scanned, feeding tuples into a
tuplesort, and then the detoating might happen further up in the query
tree after the sort has completed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Read Uncommitted

From

Simon Riggs

Date:

18 December 2019, 18:06:21

On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> wrote:
> This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin normally, this problem would go away.

Why? As soon as a transaction aborts...

So this is the same discussion as elsewhere about potentially aborted transactions...

AFAIK, the worst that happens in that case is that the reading transaction will end with an ERROR, similar to a serializable error.

And that won't happen in the use cases I've explicitly described this as being useful for, which is where the writing transactions have completed executing.

I'm not claiming any useful behavior outside of those specific use cases; this is not some magic discovery that goes faster - the user has absolutely no reason to run this whatsoever unless they want to see uncommitted data. There is a very explicit warning advising against using it for anything else.

Just consider this part of the recovery toolkit.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Robert Haas

Date:

18 December 2019, 18:26:31

On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> So this is the same discussion as elsewhere about potentially aborted transactions...

Yep.

> AFAIK, the worst that happens in that case is that the reading transaction will end with an ERROR, similar to a
serializableerror.

I'm not convinced of that. There's a big difference between a
serializable error, which is an error that is expected to be
user-facing and was designed with that in mind, and just failing a
bunch of random sanity checks all over the backend. If those sanity
checks happen to be less than comprehensive, which I suspect is
likely, there will probably be scenarios where you can crash a backend
and cause a system-wide restart. And you can probably also return just
plain wrong answers to queries in some scenarios.

> Just consider this part of the recovery toolkit.

I agree that it would be useful to have a recovery toolkit for reading
uncommitted data, but I think a lot more thought needs to be given to
how such a thing should be designed. If you just add something called
READ UNCOMMITTED, people are going to expect it to have *way* saner
semantics than this will. They'll use it routinely, not just as a
last-ditch mechanism to recover otherwise-lost data. And I'm
reasonably confident that will not work out well.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Read Uncommitted

From

Tom Lane

Date:

18 December 2019, 18:37:21

Simon Riggs <simon@2ndquadrant.com> writes:
> So this is the same discussion as elsewhere about potentially aborted
> transactions...
> AFAIK, the worst that happens in that case is that the reading transaction
> will end with an ERROR, similar to a serializable error.

No, the worst case is transactions trying to read invalid data, resulting
in either crashes or exploitable security breaches (in the usual vein of
what can go wrong if you can get the C code to follow an invalid pointer).

This seems possible, for example, if you can get a transaction to read
uncommitted data that was written according to some other rowtype than
what the reading transaction thinks the table rowtype is.  Casting my eyes
through AlterTableGetLockLevel(), it looks like all the easy ways to break
it like that are safe (for now) because they require AccessExclusiveLock.
But I am quite afraid that we'd introduce security holes by future
reductions of required lock levels --- or else that this feature would be
the sole reason why we couldn't reduce the lock level for some DDL
operation.  I'm doubtful that its use-case is worth that.

> And that won't happen in the use cases I've explicitly described this as
> being useful for, which is where the writing transactions have completed
> executing.

My concerns, at least, are not about whether this has any interesting
use-cases.  They're about whether the feature can be abused to cause
security problems.  I think the odds are fair that that'd be true
even today, and higher that it'd become true sometime in the future.

            regards, tom lane

Re: Read Uncommitted

From

Mark Dilger

Date:

18 December 2019, 18:46:55

On 12/18/19 10:06 AM, Simon Riggs wrote:
> On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com 
> <mailto:robertmhaas@gmail.com>> wrote:
> 
> On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com 
> <mailto:simon@2ndquadrant.com>> wrote:
>> This was my first concern when I thought about it, but I realised
> that by taking a snapshot and then calculating xmin normally, this 
> problem would go away.
> 
> Why? As soon as a transaction aborts...
> 
> 
> So this is the same discussion as elsewhere about potentially aborted
>  transactions... AFAIK, the worst that happens in that case is that
> the reading transaction will end with an ERROR, similar to a
> serializable error.
> 
> And that won't happen in the use cases I've explicitly described this
> as being useful for, which is where the writing transactions have
> completed executing.
> 
> I'm not claiming any useful behavior outside of those specific use 
> cases; this is not some magic discovery that goes faster - the user
> has absolutely no reason to run this whatsoever unless they want to
> see uncommitted data. There is a very explicit warning advising
> against using it for anything else.
> 
> Just consider this part of the recovery toolkit.

In that case, don't call it "read uncommitted".  Call it some other
thing entirely.  Users coming from other databases may request
"read uncommitted" isolation expecting something that works.
Currently, that gets promoted to "read committed" and works.  After
your change, that simply breaks and gives them an error.

I was about to write something about security and stability problems,
but Robert and Tom did elsewhere, so I'll just echo their concerns.

Looking at the regression tests, I'm surprised read uncommitted gets
so little test coverage. There's a test in src/test/isolation but
nothing at all in src/test/regression covering this isolation level.

The one in src/test/isolation doesn't look very comprehensive.  I'd
at least expect a test that verifies you don't get a syntax error
when you request READ UNCOMMITTED isolation from SQL.

-- 
Mark Dilger

Re: Read Uncommitted

From

Simon Riggs

Date:

18 December 2019, 18:50:22

On Wed, 18 Dec 2019 at 18:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Simon Riggs <simon@2ndquadrant.com> writes:
> So this is the same discussion as elsewhere about potentially aborted
> transactions...
> AFAIK, the worst that happens in that case is that the reading transaction
> will end with an ERROR, similar to a serializable error.

No, the worst case is transactions trying to read invalid data, resulting
in either crashes or exploitable security breaches (in the usual vein of
what can go wrong if you can get the C code to follow an invalid pointer).

Yes, but we're not following any pointers as a result of this. The output is just rows.

This seems possible, for example, if you can get a transaction to read
uncommitted data that was written according to some other rowtype than
what the reading transaction thinks the table rowtype is. Casting my eyes
through AlterTableGetLockLevel(), it looks like all the easy ways to break
it like that are safe (for now) because they require AccessExclusiveLock.
But I am quite afraid that we'd introduce security holes by future
reductions of required lock levels --- or else that this feature would be
the sole reason why we couldn't reduce the lock level for some DDL
operation. I'm doubtful that its use-case is worth that.

I think we can limit it to Read Only transactions without any limitation on the proposed use cases.

But I'll think some more about that, just in case.

> And that won't happen in the use cases I've explicitly described this as
> being useful for, which is where the writing transactions have completed
> executing.

My concerns, at least, are not about whether this has any interesting
use-cases. They're about whether the feature can be abused to cause
security problems. I think the odds are fair that that'd be true
even today, and higher that it'd become true sometime in the future.

I share your concerns. We have no need or reason to make a quick decision on this patch.

I submit this patch only as a useful tool for recovering data.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Heikki Linnakangas

Date:

18 December 2019, 19:29:22

On 18/12/2019 20:46, Mark Dilger wrote:
> On 12/18/19 10:06 AM, Simon Riggs wrote:
>> Just consider this part of the recovery toolkit.
> 
> In that case, don't call it "read uncommitted".  Call it some other
> thing entirely.  Users coming from other databases may request
> "read uncommitted" isolation expecting something that works.
> Currently, that gets promoted to "read committed" and works.  After
> your change, that simply breaks and gives them an error.

I agree that if we have a user-exposed READ UNCOMMITTED isolation level, 
it shouldn't be just a recovery tool. For a recovery tool, I think a 
set-returning function as part of contrib/pageinspect, for example, 
would be more appropriate. Then it could also try to be more defensive 
against corrupt pages, and be superuser-only.

- Heikki

Re: Read Uncommitted

From

Stephen Frost

Date:

18 December 2019, 19:34:33

Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:
> > Just consider this part of the recovery toolkit.
>
> I agree that it would be useful to have a recovery toolkit for reading
> uncommitted data, but I think a lot more thought needs to be given to
> how such a thing should be designed. If you just add something called
> READ UNCOMMITTED, people are going to expect it to have *way* saner
> semantics than this will. They'll use it routinely, not just as a
> last-ditch mechanism to recover otherwise-lost data. And I'm
> reasonably confident that will not work out well.

+1.

Thanks,

Stephen

Attachment

signature.asc

Re: Read Uncommitted

From

"Finnerty, Jim"

Date:

18 December 2019, 20:21:58

Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and
don'tcare very much about transaction consistency.  E.g. they want to compute SUM(sales) by salesperson, region for the
past5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result.

On 12/18/19, 2:35 PM, "Stephen Frost" <sfrost@snowman.net> wrote:

    Greetings,

    * Robert Haas (robertmhaas@gmail.com) wrote:
    > On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote:
    > > Just consider this part of the recovery toolkit.
    > 
    > I agree that it would be useful to have a recovery toolkit for reading
    > uncommitted data, but I think a lot more thought needs to be given to
    > how such a thing should be designed. If you just add something called
    > READ UNCOMMITTED, people are going to expect it to have *way* saner
    > semantics than this will. They'll use it routinely, not just as a
    > last-ditch mechanism to recover otherwise-lost data. And I'm
    > reasonably confident that will not work out well.

    +1.

    Thanks,

    Stephen

Re: Read Uncommitted

From

Tom Lane

Date:

18 December 2019, 20:35:59

"Finnerty, Jim" <jfinnert@amazon.com> writes:
> Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and
don'tcare very much about transaction consistency.  E.g. they want to compute SUM(sales) by salesperson, region for the
past5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result. 

It's fairly questionable whether there's any real advantage to be gained
by READ UNCOMMITTED in that sort of scenario --- almost all the tuples
you'd be looking at would be hinted as committed-good, ordinarily, so that
bypassing the relevant checks isn't going to save much.  But I take your
point that people would *think* that READ UNCOMMITTED could be used that
way, if they come from some other DBMS.  So this reinforces Mark's point
that if we provide something like this, it shouldn't be called READ
UNCOMMITTED.  That should be reserved for something that has reasonably
consistent, standards-compliant behavior.

            regards, tom lane

Re: Read Uncommitted

From

Robert Haas

Date:

18 December 2019, 21:24:50

On Wed, Dec 18, 2019 at 2:29 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
> it shouldn't be just a recovery tool. For a recovery tool, I think a
> set-returning function as part of contrib/pageinspect, for example,
> would be more appropriate. Then it could also try to be more defensive
> against corrupt pages, and be superuser-only.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Read Uncommitted regression test coverage

From

Mark Dilger

Date:

18 December 2019, 22:05:11

Over in [1], I became concerned that, although postgres supports
Read Uncommitted transaction isolation (by way of Read Committed
mode), there was very little test coverage for it:

On 12/18/19 10:46 AM, Mark Dilger wrote:
> Looking at the regression tests, I'm surprised read uncommitted gets
> so little test coverage. There's a test in src/test/isolation but
> nothing at all in src/test/regression covering this isolation level.
> 
> The one in src/test/isolation doesn't look very comprehensive.  I'd
> at least expect a test that verifies you don't get a syntax error
> when you request READ UNCOMMITTED isolation from SQL.

The attached patch set adds a modicum of test coverage for this.
Do others feel these tests are worth the small run time overhead
they add?

-- 
Mark Dilger

[1] 
https://www.postgresql.org/message-id/CANP8%2Bj%2BmgWfcX9cTPsk7t%2B1kQCxgyGqHTR5R7suht7mCm_x_hA%40mail.gmail.com

Attachment

Re: Read Uncommitted regression test coverage

From

Tom Lane

Date:

18 December 2019, 22:17:05

Mark Dilger <hornschnorter@gmail.com> writes:
>> The one in src/test/isolation doesn't look very comprehensive.  I'd
>> at least expect a test that verifies you don't get a syntax error
>> when you request READ UNCOMMITTED isolation from SQL.

> The attached patch set adds a modicum of test coverage for this.
> Do others feel these tests are worth the small run time overhead
> they add?

No.  As you pointed out yourself, READ UNCOMMITTED is the same as READ
COMMITTED, so there's hardly any point in testing its semantic behavior.
One or two tests that check that it is accepted by the grammar seem
like plenty (and even there, what's there to break?  If bison starts
failing us to that extent, we've got bigger problems.)

Obviously, if we made it behave differently from READ COMMITTED, then
it would need testing ... but the nature and extent of such testing
would depend a lot on what we did to it, so I'm not eager to try to
predict the need in advance.

            regards, tom lane

Re: Read Uncommitted

From

David Steele

Date:

18 December 2019, 23:46:46

On 12/18/19 2:29 PM, Heikki Linnakangas wrote:
> On 18/12/2019 20:46, Mark Dilger wrote:
>> On 12/18/19 10:06 AM, Simon Riggs wrote:
>>> Just consider this part of the recovery toolkit.
>>
>> In that case, don't call it "read uncommitted".  Call it some other
>> thing entirely.  Users coming from other databases may request
>> "read uncommitted" isolation expecting something that works.
>> Currently, that gets promoted to "read committed" and works.  After
>> your change, that simply breaks and gives them an error.
> 
> I agree that if we have a user-exposed READ UNCOMMITTED isolation level, 
> it shouldn't be just a recovery tool. For a recovery tool, I think a 
> set-returning function as part of contrib/pageinspect, for example, 
> would be more appropriate. Then it could also try to be more defensive 
> against corrupt pages, and be superuser-only.

+1.

-- 
-David
david@pgmasters.net

Re: Read Uncommitted

From

Simon Riggs

Date:

19 December 2019, 00:07:16

On Wed, 18 Dec 2019 at 20:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:

"Finnerty, Jim" <jfinnert@amazon.com> writes:
> Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and don't care very much about transaction consistency. E.g. they want to compute SUM(sales) by salesperson, region for the past 5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result.

It's fairly questionable whether there's any real advantage to be gained
by READ UNCOMMITTED in that sort of scenario --- almost all the tuples
you'd be looking at would be hinted as committed-good, ordinarily, so that
bypassing the relevant checks isn't going to save much.

Agreed; this was not intended to give any kind of backdoor benefit and I don't see any, just tears.

But I take your
point that people would *think* that READ UNCOMMITTED could be used that
way, if they come from some other DBMS. So this reinforces Mark's point
that if we provide something like this, it shouldn't be called READ
UNCOMMITTED.

Seems like general agreement on that point from others on this thread.

That should be reserved for something that has reasonably
consistent, standards-compliant behavior.

Since we're discussing it, exactly what standard are we talking about here?

I'm not saying I care about that, just to complete the discussion.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Simon Riggs

Date:

19 December 2019, 00:13:55

On Wed, 18 Dec 2019 at 19:29, Heikki Linnakangas <hlinnaka@iki.fi> wrote:

On 18/12/2019 20:46, Mark Dilger wrote:
> On 12/18/19 10:06 AM, Simon Riggs wrote:
>> Just consider this part of the recovery toolkit.
>
> In that case, don't call it "read uncommitted". Call it some other
> thing entirely. Users coming from other databases may request
> "read uncommitted" isolation expecting something that works.
> Currently, that gets promoted to "read committed" and works. After
> your change, that simply breaks and gives them an error.

I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
it shouldn't be just a recovery tool. For a recovery tool, I think a
set-returning function as part of contrib/pageinspect, for example,
would be more appropriate. Then it could also try to be more defensive
against corrupt pages, and be superuser-only.

So the consensus is for a more-specifically named facility.

I was aiming for something that would allow general SELECTs to run with a snapshot that can see uncommitted xacts, so making it a SRF wouldn't really allow that.

Not really sure where to go with the UI for this.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Andres Freund

Date:

19 December 2019, 02:22:25

Hi,

On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:
> On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
> 
> > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
> > wrote:
> > > This was my first concern when I thought about it, but I realised that
> > by taking a snapshot and then calculating xmin normally, this problem would
> > go away.
> >
> > Why? As soon as a transaction aborts...
> >
> 
> So this is the same discussion as elsewhere about potentially aborted
> transactions...
> AFAIK, the worst that happens in that case is that the reading transaction
> will end with an ERROR, similar to a serializable error.

I don't think that's all that can happen. E.g. the toast identifier
might have been reused, and there might be a different datum in
there. Which then means we'll end up calling operators on data that's
potentially for a different datatype - it's trivial to cause crashes
that way. And, albeit harder, possible to do more than that.

I think there's plenty other problems too, not just toast. There's
e.g. some parts of the system that access catalogs using a normal
snapshot - which might not actually be consistent, because we have
various places where we have to increment the command counter multiple
times as part of a larger catalog manipulation.

Greetings,

Andres Freund

Re: Read Uncommitted regression test coverage

From

Mark Dilger

Date:

19 December 2019, 03:19:20

On 12/18/19 2:17 PM, Tom Lane wrote:
> Mark Dilger <hornschnorter@gmail.com> writes:
>>> The one in src/test/isolation doesn't look very comprehensive.  I'd
>>> at least expect a test that verifies you don't get a syntax error
>>> when you request READ UNCOMMITTED isolation from SQL.
> 
>> The attached patch set adds a modicum of test coverage for this.
>> Do others feel these tests are worth the small run time overhead
>> they add?
> 
> No.  As you pointed out yourself, READ UNCOMMITTED is the same as READ
> COMMITTED, so there's hardly any point in testing its semantic behavior.
> One or two tests that check that it is accepted by the grammar seem
> like plenty (and even there, what's there to break?  If bison starts
> failing us to that extent, we've got bigger problems.)

The lack of testing in the current system is so complete that if you
go into gram.y and remove READ UNCOMMITTED from the grammar, not one
test in check-world fails.

Somebody doing something like what Simon is suggesting might refactor
the code in a way that unintentionally breaks this isolation level, and
we'd not know about it until users started complaining.

The attached patch is pretty cheap.  Perhaps you'll like it better?

-- 
Mark Dilger

Attachment

0003-grammar.patch

Re: Read Uncommitted

From

Simon Riggs

Date:

19 December 2019, 09:50:44

On Thu, 19 Dec 2019 at 02:22, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:
> On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
>
> > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
> > wrote:
> > > This was my first concern when I thought about it, but I realised that
> > by taking a snapshot and then calculating xmin normally, this problem would
> > go away.
> >
> > Why? As soon as a transaction aborts...
> >
>
> So this is the same discussion as elsewhere about potentially aborted
> transactions...
> AFAIK, the worst that happens in that case is that the reading transaction
> will end with an ERROR, similar to a serializable error.

I don't think that's all that can happen. E.g. the toast identifier
might have been reused, and there might be a different datum in
there. Which then means we'll end up calling operators on data that's
potentially for a different datatype - it's trivial to cause crashes
that way. And, albeit harder, possible to do more than that.

On the patch as proposed this wouldn't be possible because a toast row can't be vacuumed and then reused while holding back xmin, at least as I understand it.

I think there's plenty other problems too, not just toast. There's
e.g. some parts of the system that access catalogs using a normal
snapshot - which might not actually be consistent, because we have
various places where we have to increment the command counter multiple
times as part of a larger catalog manipulation.

It seems possible that catalog access would be the thing that makes this difficult. Cache invalidations wouldn't yet have been fired, so that would lead to rather weird errors, and as you say, potential issues from data type changes which would not be acceptable in a facility available to non-superusers.

We could limit that to xacts that don't do DDL, which is a very small % of xacts, but then those xacts are more likely to be ones you'd want to recover or investigate.

So I now withdraw this patch as submitted and won't be resubmitting.

Thanks everyone for your input.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Peter Eisentraut

Date:

19 December 2019, 12:38:19

On 2019-12-18 16:14, Simon Riggs wrote:
> On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik 
> <k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote:
> 
>     As far as I understand with "read uncommitted" policy we can see two
>     versions of the same tuple if it was updated by two transactions
>     both of which were started before us and committed during table
>     traversal by transaction with "read uncommitted" policy. Certainly
>     "read uncommitted" means that we are ready to get inconsistent
>     results, but is it really acceptable to multiple versions of the
>     same tuple?
> 
> 
>      "In general, read uncommitted will return inconsistent results and
>      wrong answers. If you look at the changes made by a transaction
>      while it continues to make changes then you may get partial results
>      from queries, or you may miss index entries that haven't yet been
>      written. However, if you are reading transactions that are paused
>      at the end of their execution for whatever reason then you can
>      see a consistent result."
> 
> I think I already covered your concerns in my suggested docs for this 
> feature.

Independent of the technical concerns, I don't think the SQL standard 
allows the READ UNCOMMITTED level to behave in a way that violates the 
logical requirements of the defined database schema.  So if we wanted to 
add this, we should probably name it something else.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Read Uncommitted

From

Bernd Helmle

Date:

19 December 2019, 12:42:19

Am Donnerstag, den 19.12.2019, 00:13 +0000 schrieb Simon Riggs:
> So the consensus is for a more-specifically named facility.
> 
> I was aiming for something that would allow general SELECTs to run
> with a
> snapshot that can see uncommitted xacts, so making it a SRF wouldn't
> really
> allow that.

There's pg_dirtyread() [1] around for some while, implementing a SRF
for debugging usage on in normal circumstances disappeared data. Its
nice to not have worrying about anything when you faced with such kind
of problems, but for such use cases i think a SRF serves quite well.

[1] https://github.com/df7cb/pg_dirtyread


    Bernd

Re: Read Uncommitted

From

Simon Riggs

Date:

19 December 2019, 12:56:43

On Thu, 19 Dec 2019 at 12:42, Bernd Helmle <mailings@oopsware.de> wrote:

Am Donnerstag, den 19.12.2019, 00:13 +0000 schrieb Simon Riggs:
> So the consensus is for a more-specifically named facility.
>
> I was aiming for something that would allow general SELECTs to run
> with a
> snapshot that can see uncommitted xacts, so making it a SRF wouldn't
> really
> allow that.

There's pg_dirtyread() [1] around for some while, implementing a SRF
for debugging usage on in normal circumstances disappeared data. Its
nice to not have worrying about anything when you faced with such kind
of problems, but for such use cases i think a SRF serves quite well.

[1] https://github.com/df7cb/pg_dirtyread

As an example of an SRF for debugging purposes, sure, but then we already had the example of pageinspect, which I wrote BTW, so wasn't unfamiliar with the thought.

Note that pg_dirtyread has got nothing to do with the use cases I described.

--

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Mark Dilger

Date:

19 December 2019, 15:08:06

On 12/19/19 1:50 AM, Simon Riggs wrote:
> It seems possible that catalog access would be the thing that makes this 
> difficult. Cache invalidations wouldn't yet have been fired, so that 
> would lead to rather weird errors, and as you say, potential issues from 
> data type changes which would not be acceptable in a facility available 
> to non-superusers.
> 
> We could limit that to xacts that don't do DDL, which is a very small % 
> of xacts, but then those xacts are more likely to be ones you'd want to 
> recover or investigate.
> 
> So I now withdraw this patch as submitted and won't be resubmitting.

Oh, I'm sorry to hear that.  I thought this feature sounded useful, and
we were working out what its limitations were.  What I gathered from
the discussion so far was:

   - It should be called something other than READ UNCOMMITTED
   - It should only be available to superusers, at least for the initial
     implementation
   - Extra care might be required to lock catalogs to avoid unsafe
     operations that could lead to backends crashing or security
     vulnerabilities
   - Toast tables need to be handled with care

For the record, in case we revisit this idea in the future, which were
the obstacles that killed this patch?

Tom's point on that third item:

 > But I am quite afraid that we'd introduce security holes by future
 > reductions of required lock levels --- or else that this feature would be
 > the sole reason why we couldn't reduce the lock level for some DDL
 > operation.  I'm doubtful that its use-case is worth that."

Anybody running SET TRANSACTION ISOLATION LEVEL RECOVERY might
have to get ExclusiveLock on most of the catalog tables.  But that
would only be done if somebody starts a transaction using this
isolation level, which is not normal, so it shouldn't be a problem
under normal situations.  If the lock level reduction that Tom
mentions was implemented, there would be no problem, as long as the
lock level you reduce to still blocks against ExclusiveLock, which
surely it must.  If the transaction running in RECOVERY level isolation
can't get the locks, then it blocks and doesn't help the administrator
who is trying to use this feature, but that is no worse than the
present situation where the feature is entirely absent.  When no
catalog changes are in flight, the administrator gets the locks and
can continue inspecting the in-process changes of other transactions.

Robert's point on that fourth item:

 > As soon as a transaction aborts, the TOAST rows can be vacuumed
 > away, but the READ UNCOMMITTED transaction might've already seen the
 > main tuple. This is not even a particularly tight race, necessarily,
 > since for example the table might be scanned, feeding tuples into a
 > tuplesort, and then the detoating might happen further up in the query
 > tree after the sort has completed.

I don't know if this could be fixed without adding overhead to toast
processing for non-RECOVERY transactions, but perhaps it doesn't need
to be fixed at all.  Perhaps you just accept that in RECOVERY mode you
can't see toast data, and instead get NULLs for all such rows.  Now,
that could have security implications if somebody defines a policy
where NULL in a toast column means "allow" rather than "deny" for
some issue, but if this RECOVERY mode is limited to superusers, that
isn't such a big objection.

There may be a number of other gotchas still to be resolved, but
abandoning the patch at this stage strikes me as premature.

-- 
Mark Dilger

Re: Read Uncommitted

From

Mark Dilger

Date:

19 December 2019, 15:22:52


On 12/19/19 7:08 AM, Mark Dilger wrote:
> and instead get NULLs for all such rows

To clarify, I mean the toasted column is null for rows
where the value was stored in the toast table rather
than stored inline.  I'd prefer some special value
that means "this datum unavailable" so that it could
be distinguished from an actual null, but no such value
comes to mind.

-- 
Mark Dilger

Re: Read Uncommitted

From

Andres Freund

Date:

19 December 2019, 15:36:35

Hi,

On 2019-12-19 09:50:44 +0000, Simon Riggs wrote:
> On Thu, 19 Dec 2019 at 02:22, Andres Freund <andres@anarazel.de> wrote:
> 
> > Hi,
> >
> > On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:
> > > On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
> > >
> > > > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
> > > > wrote:
> > > > > This was my first concern when I thought about it, but I realised
> > that
> > > > by taking a snapshot and then calculating xmin normally, this problem
> > would
> > > > go away.
> > > >
> > > > Why? As soon as a transaction aborts...
> > > >
> > >
> > > So this is the same discussion as elsewhere about potentially aborted
> > > transactions...
> > > AFAIK, the worst that happens in that case is that the reading
> > transaction
> > > will end with an ERROR, similar to a serializable error.
> >
> > I don't think that's all that can happen. E.g. the toast identifier
> > might have been reused, and there might be a different datum in
> > there. Which then means we'll end up calling operators on data that's
> > potentially for a different datatype - it's trivial to cause crashes
> > that way. And, albeit harder, possible to do more than that.
> >
> 
> On the patch as proposed this wouldn't be possible because a toast row
> can't be vacuumed and then reused while holding back xmin, at least as I
> understand it.

Vacuum and pruning remove rows where xmin didn't commit, without testing
against the horizon. Which makes sense, because normally there's so far
no snapshot including them. Unless we were to weaken that logic -
which'd have bloat impacts - a snapshot wouldn't guarantee anything
about the non-removal of such tuples, unless I am missing something.

Greetings,

Andres Freund

Re: Read Uncommitted

From

Andres Freund

Date:

19 December 2019, 16:02:48

Hi,

On 2019-12-19 07:08:06 -0800, Mark Dilger wrote:
> > As soon as a transaction aborts, the TOAST rows can be vacuumed
> > away, but the READ UNCOMMITTED transaction might've already seen the
> > main tuple. This is not even a particularly tight race, necessarily,
> > since for example the table might be scanned, feeding tuples into a
> > tuplesort, and then the detoating might happen further up in the query
> > tree after the sort has completed.
> 
> I don't know if this could be fixed without adding overhead to toast
> processing for non-RECOVERY transactions, but perhaps it doesn't need
> to be fixed at all.  Perhaps you just accept that in RECOVERY mode you
> can't see toast data, and instead get NULLs for all such rows.  Now,
> that could have security implications if somebody defines a policy
> where NULL in a toast column means "allow" rather than "deny" for
> some issue, but if this RECOVERY mode is limited to superusers, that
> isn't such a big objection.

I mean, that's just a small part of the issue. You can get *different*
data back for toast columns - incompatible with the datatype, leading to
crashes. You can get *different* data back for the same query, running
it twice, because data that was just inserted can get pruned away if the
inserting transaction aborted.

> There may be a number of other gotchas still to be resolved, but
> abandoning the patch at this stage strikes me as premature.

I think iff we'd want this feature, you'd have to actually use a much
larger hammer, and change the snapshot logic to include information
about which aborted transactions are visible, and whose rows cannot be
removed. And then vacuuming/hot pruning need to be changed to respect
that. And note that'll affect *all* sessions, not just the one wanting
to use READ UNCOMMITTED.

Greetings,

Andres Freund

Re: Read Uncommitted

From

Craig Ringer

Date:

20 December 2019, 02:11:50

On Thu, 19 Dec 2019 at 23:36, Andres Freund <andres@anarazel.de> wrote:

Hi,

> On the patch as proposed this wouldn't be possible because a toast row
> can't be vacuumed and then reused while holding back xmin, at least as I
> understand it.

Vacuum and pruning remove rows where xmin didn't commit, without testing
against the horizon. Which makes sense, because normally there's so far
no snapshot including them. Unless we were to weaken that logic -
which'd have bloat impacts - a snapshot wouldn't guarantee anything
about the non-removal of such tuples, unless I am missing something.

My understanding from reading the above is that Simon didn't propose to make aborted txns visible, only in-progress uncommitted txns.

Vacuum only removes such rows if the xact is (a) explicitly aborted in clog or (b) provably not still running. It checks RecentXmin and the running xids arrays to handle xacts that went away after a crash. Per TransactionIdIsInProgress() as used by HeapTupleSatisfiesVacuum(). I see that it's not *quite* as simple as using the RecentXmin threhold, as xacts newer than RecentXmin may also be seen as not in-progress if they're absent in the shmem xact arrays and there's no overflow.

But that's OK so long as the only xacts that some sort of read-uncommitted feature allows to become visible are ones that satisfy TransactionIdIsInProgress(); they cannot have been vacuumed.

The bigger issue, and the one that IMO makes it impractical to spell this as "READ UNCOMMITTED", is that an uncommitted txn might've changed the catalogs so there is no one snapshot that is valid for interpreting all possible tuples. It'd have to see only txns that have no catalog changes, or be narrowed to see just *one specific txn* that had catalog changes. That makes it iffy to spell it as "READ UNCOMMITTED" since we can't actually make all uncommitted txns visible at once.

I think the suggestions for a SRF based approach might make sense. Perhaps a few functions:

* a function to list all in-progress xids

* a function to list in-progress xids with/without catalog changes (if possible, unsure if we know that until the commit record is written)

* a function (or procedure?) to execute a read-only SELECT or WITH query within a faked-up snapshot for some in-progress xact and return a SETOF RECORD with results. If the txn has catalog changes this would probably have to coalesce each result field with non-builtin data types to text, or do some fancy validation to compare the definition in the txn snapshot with the latest normal snapshot used by the calling session. Ideally this function could take an array of xids and would query with them all visible unless there were catalog changes in any of them, then it'd ERROR.

* a function to generate the SQL text for an alias clause that maps the RECORD returned by the above function, so you can semi-conveniently query it. (I don't think we have a way for a C callable function to supply a dynamic resultset type at plan-time to avoid the need for this, do we? Perhaps if we use a procedure not a function?)

--

Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: Read Uncommitted

From

Tom Lane

Date:

20 December 2019, 04:18:06

Craig Ringer <craig@2ndquadrant.com> writes:
> My understanding from reading the above is that Simon didn't propose to
> make aborted txns visible, only in-progress uncommitted txns.

Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
at any moment, and there's no interlock that would prevent its generated
data from being removed out from under you at any moment after that.
So there's a race condition, and as Robert observed, the window isn't
necessarily small.

> The bigger issue, and the one that IMO makes it impractical to spell this
> as "READ UNCOMMITTED", is that an uncommitted txn might've changed the
> catalogs so there is no one snapshot that is valid for interpreting all
> possible tuples.

In theory that should be okay, because any such tuples would be in
tables you can't access due to the in-progress txn having taken
AccessExclusiveLock on tables it changes the rowtype of.  But we keep
looking for ways to reduce the locking requirements for ALTER TABLE
actions --- and as I said upthread, it feels like this feature might
someday be the sole reason why we can't safely reduce lock strength
for some form of ALTER.  I can't make a concrete argument for that
though ... maybe it's not really any worse than the situation just
after failure of any DDL-performing txn.  But that would bear closer
study than I think it's had.

> I think the suggestions for a SRF based approach might make sense.

Yeah, I'd rather do it that way than via a transaction isolation
level.  The isolation-level approach seems like people will expect
stronger semantics than we could actually provide.

            regards, tom lane

Re: Read Uncommitted

From

Craig Ringer

Date:

20 December 2019, 04:33:02

On Fri, 20 Dec 2019 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Craig Ringer <craig@2ndquadrant.com> writes:
> My understanding from reading the above is that Simon didn't propose to
> make aborted txns visible, only in-progress uncommitted txns.

Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
at any moment, and there's no interlock that would prevent its generated
data from being removed out from under you at any moment after that.
So there's a race condition, and as Robert observed, the window isn't
necessarily small.

Absolutely. Many of the same issues arise in the work on logical decoding of in-progress xacts for optimistic logical decoding.

Unless such an interlock is added (with all the problems that entails, again per the in-progress logical decoding thread) that limits this to:

* running in recovery when stopped at a recovery target; or

* peeking at the contents of individual prepared xacts that we can prevent someone else concurrently aborting/committing

That'd actually cover the only things I'd personally actually want a feature like this for anyway.

In any case, Simon's yanked the proposal. I'd like to have some way to peek at the contents of individual uncommited xacts, but it's clearly not going to be anything called READ UNCOMMITTED that applies to all uncommitted xacts at once...

> I think the suggestions for a SRF based approach might make sense.

Yeah, I'd rather do it that way than via a transaction isolation
level. The isolation-level approach seems like people will expect
stronger semantics than we could actually provide.

Yeah. Definitely not an isolation level.

I'll be interesting to see if this sparks any more narrowly scoped and targeted ideas, anyway. Thanks for taking the time to think about it.

--

Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise