Thread: Read Uncommitted
I present a patch to allow READ UNCOMMITTED that is simple, useful and efficient. This was previously thought to have no useful definition within PostgreSQL, though I have identified a use case for diagnostics and recovery that merits adding a short patch to implement it.
My docs for this are copied here:
In <productname>PostgreSQL</productname>'s <acronym>MVCC</acronym>
architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.
In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.
In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result.
The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.
see a consistent result.
The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.
In <productname>PostgreSQL</productname> read uncommitted mode gives
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.
This is a new and surprising thought, so please review the attached patch.
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.
This is a new and surprising thought, so please review the attached patch.
Please notice that almost all of the infrastructure already exists to support this, so this patch does very little. It avoids additional locking on main execution paths and as far as I am aware, does not break anything.
--
--
Attachment
On 18.12.2019 13:01, Simon Riggs wrote:
I present a patch to allow READ UNCOMMITTED that is simple, useful and efficient. This was previously thought to have no useful definition within PostgreSQL, though I have identified a use case for diagnostics and recovery that merits adding a short patch to implement it.My docs for this are copied here:In <productname>PostgreSQL</productname>'s <acronym>MVCC</acronym>./configure --prefix=/home/knizhnik/postgresql/dist --enable-debug --enable-cassert CFLAGS=-O0
architecture, readers are not blocked by writers, so in general
you should have no need for this transaction isolation level.
In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are pausedat the end of their execution for whatever reason then you can
see a consistent result.
The main use case for this transaction isolation level is for
investigating or recovering data. Examples of this would be when
inspecting the writes made by a locked or hanging transaction, when
you are running queries on a standby node that is currently paused,
such as when a standby node has halted at a recovery target with
<literal>recovery_target_inclusive = false</literal> or when you
need to inspect changes made by an in-doubt prepared transaction to
decide whether to commit or abort that transaction.In <productname>PostgreSQL</productname> read uncommitted mode gives
a consistent snapshot of the currently running transactions at the
time the snapshot was taken. Transactions starting after that time
will not be visible, even though they are not yet committed.
This is a new and surprising thought, so please review the attached patch.Please notice that almost all of the infrastructure already exists to support this, so this patch does very little. It avoids additional locking on main execution paths and as far as I am aware, does not break anything.
--
As far as I understand with "read uncommitted" policy we can see two versions of the same tuple if it was updated by two transactions both of which were started before us and committed during table traversal by transaction with "read uncommitted" policy. Certainly "read uncommitted" means that we are ready to get inconsistent results, but is it really acceptable to multiple versions of the same tuple?
-- Konstantin Knizhnik Postgres Professional: http://www.postgrespro.com The Russian Postgres Company
Simon Riggs <simon@2ndquadrant.com> writes: > I present a patch to allow READ UNCOMMITTED that is simple, useful and > efficient. Won't this break entirely the moment you try to read a tuple containing toasted-out-of-line values? There's no guarantee that the toast-table entries haven't been vacuumed away. I suspect it can also be broken by cases involving, eg, dropped columns. There are a lot of assumptions in the system that no one will ever try to read dead tuples. The fact that you can construct a use-case in which it's good for something doesn't make it safe in general :-( regards, tom lane
On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik <k.knizhnik@postgrespro.ru> wrote:
As far as I understand with "read uncommitted" policy we can see two versions of the same tuple if it was updated by two transactions both of which were started before us and committed during table traversal by transaction with "read uncommitted" policy. Certainly "read uncommitted" means that we are ready to get inconsistent results, but is it really acceptable to multiple versions of the same tuple?
"In general, read uncommitted will return inconsistent results and
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
wrong answers. If you look at the changes made by a transaction
while it continues to make changes then you may get partial results
from queries, or you may miss index entries that haven't yet been
written. However, if you are reading transactions that are paused
at the end of their execution for whatever reason then you can
see a consistent result."
see a consistent result."
I think I already covered your concerns in my suggested docs for this feature.
I'm not suggesting it for general use.
On Wed, 18 Dec 2019 at 14:06, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
> I present a patch to allow READ UNCOMMITTED that is simple, useful and
> efficient.
Won't this break entirely the moment you try to read a tuple containing
toasted-out-of-line values? There's no guarantee that the toast-table
entries haven't been vacuumed away.
I suspect it can also be broken by cases involving, eg, dropped columns.
There are a lot of assumptions in the system that no one will ever try
to read dead tuples.
This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin normally, this problem would go away.
So this won't happen with the proposed patch.
The fact that you can construct a use-case in which it's good for
something doesn't make it safe in general :-(
I agree that safety is a concern, but I don't see any safety issues in the patch as proposed.
On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> wrote: > This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin normally,this problem would go away. Why? As soon as a transaction aborts, the TOAST rows can be vacuumed away, but the READ UNCOMMITTED transaction might've already seen the main tuple. This is not even a particularly tight race, necessarily, since for example the table might be scanned, feeding tuples into a tuplesort, and then the detoating might happen further up in the query tree after the sort has completed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> wrote:
> This was my first concern when I thought about it, but I realised that by taking a snapshot and then calculating xmin normally, this problem would go away.
Why? As soon as a transaction aborts...
So this is the same discussion as elsewhere about potentially aborted transactions...
AFAIK, the worst that happens in that case is that the reading transaction will end with an ERROR, similar to a serializable error.
And that won't happen in the use cases I've explicitly described this as being useful for, which is where the writing transactions have completed executing.
I'm not claiming any useful behavior outside of those specific use cases; this is not some magic discovery that goes faster - the user has absolutely no reason to run this whatsoever unless they want to see uncommitted data. There is a very explicit warning advising against using it for anything else.
Just consider this part of the recovery toolkit.
On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote: > So this is the same discussion as elsewhere about potentially aborted transactions... Yep. > AFAIK, the worst that happens in that case is that the reading transaction will end with an ERROR, similar to a serializableerror. I'm not convinced of that. There's a big difference between a serializable error, which is an error that is expected to be user-facing and was designed with that in mind, and just failing a bunch of random sanity checks all over the backend. If those sanity checks happen to be less than comprehensive, which I suspect is likely, there will probably be scenarios where you can crash a backend and cause a system-wide restart. And you can probably also return just plain wrong answers to queries in some scenarios. > Just consider this part of the recovery toolkit. I agree that it would be useful to have a recovery toolkit for reading uncommitted data, but I think a lot more thought needs to be given to how such a thing should be designed. If you just add something called READ UNCOMMITTED, people are going to expect it to have *way* saner semantics than this will. They'll use it routinely, not just as a last-ditch mechanism to recover otherwise-lost data. And I'm reasonably confident that will not work out well. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Simon Riggs <simon@2ndquadrant.com> writes: > So this is the same discussion as elsewhere about potentially aborted > transactions... > AFAIK, the worst that happens in that case is that the reading transaction > will end with an ERROR, similar to a serializable error. No, the worst case is transactions trying to read invalid data, resulting in either crashes or exploitable security breaches (in the usual vein of what can go wrong if you can get the C code to follow an invalid pointer). This seems possible, for example, if you can get a transaction to read uncommitted data that was written according to some other rowtype than what the reading transaction thinks the table rowtype is. Casting my eyes through AlterTableGetLockLevel(), it looks like all the easy ways to break it like that are safe (for now) because they require AccessExclusiveLock. But I am quite afraid that we'd introduce security holes by future reductions of required lock levels --- or else that this feature would be the sole reason why we couldn't reduce the lock level for some DDL operation. I'm doubtful that its use-case is worth that. > And that won't happen in the use cases I've explicitly described this as > being useful for, which is where the writing transactions have completed > executing. My concerns, at least, are not about whether this has any interesting use-cases. They're about whether the feature can be abused to cause security problems. I think the odds are fair that that'd be true even today, and higher that it'd become true sometime in the future. regards, tom lane
On 12/18/19 10:06 AM, Simon Riggs wrote: > On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com > <mailto:simon@2ndquadrant.com>> wrote: >> This was my first concern when I thought about it, but I realised > that by taking a snapshot and then calculating xmin normally, this > problem would go away. > > Why? As soon as a transaction aborts... > > > So this is the same discussion as elsewhere about potentially aborted > transactions... AFAIK, the worst that happens in that case is that > the reading transaction will end with an ERROR, similar to a > serializable error. > > And that won't happen in the use cases I've explicitly described this > as being useful for, which is where the writing transactions have > completed executing. > > I'm not claiming any useful behavior outside of those specific use > cases; this is not some magic discovery that goes faster - the user > has absolutely no reason to run this whatsoever unless they want to > see uncommitted data. There is a very explicit warning advising > against using it for anything else. > > Just consider this part of the recovery toolkit. In that case, don't call it "read uncommitted". Call it some other thing entirely. Users coming from other databases may request "read uncommitted" isolation expecting something that works. Currently, that gets promoted to "read committed" and works. After your change, that simply breaks and gives them an error. I was about to write something about security and stability problems, but Robert and Tom did elsewhere, so I'll just echo their concerns. Looking at the regression tests, I'm surprised read uncommitted gets so little test coverage. There's a test in src/test/isolation but nothing at all in src/test/regression covering this isolation level. The one in src/test/isolation doesn't look very comprehensive. I'd at least expect a test that verifies you don't get a syntax error when you request READ UNCOMMITTED isolation from SQL. -- Mark Dilger
On Wed, 18 Dec 2019 at 18:37, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Simon Riggs <simon@2ndquadrant.com> writes:
> So this is the same discussion as elsewhere about potentially aborted
> transactions...
> AFAIK, the worst that happens in that case is that the reading transaction
> will end with an ERROR, similar to a serializable error.
No, the worst case is transactions trying to read invalid data, resulting
in either crashes or exploitable security breaches (in the usual vein of
what can go wrong if you can get the C code to follow an invalid pointer).
Yes, but we're not following any pointers as a result of this. The output is just rows.
This seems possible, for example, if you can get a transaction to read
uncommitted data that was written according to some other rowtype than
what the reading transaction thinks the table rowtype is. Casting my eyes
through AlterTableGetLockLevel(), it looks like all the easy ways to break
it like that are safe (for now) because they require AccessExclusiveLock.
But I am quite afraid that we'd introduce security holes by future
reductions of required lock levels --- or else that this feature would be
the sole reason why we couldn't reduce the lock level for some DDL
operation. I'm doubtful that its use-case is worth that.
I think we can limit it to Read Only transactions without any limitation on the proposed use cases.
But I'll think some more about that, just in case.
> And that won't happen in the use cases I've explicitly described this as
> being useful for, which is where the writing transactions have completed
> executing.
My concerns, at least, are not about whether this has any interesting
use-cases. They're about whether the feature can be abused to cause
security problems. I think the odds are fair that that'd be true
even today, and higher that it'd become true sometime in the future.
I share your concerns. We have no need or reason to make a quick decision on this patch.
I submit this patch only as a useful tool for recovering data.
On 18/12/2019 20:46, Mark Dilger wrote: > On 12/18/19 10:06 AM, Simon Riggs wrote: >> Just consider this part of the recovery toolkit. > > In that case, don't call it "read uncommitted". Call it some other > thing entirely. Users coming from other databases may request > "read uncommitted" isolation expecting something that works. > Currently, that gets promoted to "read committed" and works. After > your change, that simply breaks and gives them an error. I agree that if we have a user-exposed READ UNCOMMITTED isolation level, it shouldn't be just a recovery tool. For a recovery tool, I think a set-returning function as part of contrib/pageinspect, for example, would be more appropriate. Then it could also try to be more defensive against corrupt pages, and be superuser-only. - Heikki
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote: > > Just consider this part of the recovery toolkit. > > I agree that it would be useful to have a recovery toolkit for reading > uncommitted data, but I think a lot more thought needs to be given to > how such a thing should be designed. If you just add something called > READ UNCOMMITTED, people are going to expect it to have *way* saner > semantics than this will. They'll use it routinely, not just as a > last-ditch mechanism to recover otherwise-lost data. And I'm > reasonably confident that will not work out well. +1. Thanks, Stephen
Attachment
Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and don'tcare very much about transaction consistency. E.g. they want to compute SUM(sales) by salesperson, region for the past5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result. On 12/18/19, 2:35 PM, "Stephen Frost" <sfrost@snowman.net> wrote: Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Wed, Dec 18, 2019 at 1:06 PM Simon Riggs <simon@2ndquadrant.com> wrote: > > Just consider this part of the recovery toolkit. > > I agree that it would be useful to have a recovery toolkit for reading > uncommitted data, but I think a lot more thought needs to be given to > how such a thing should be designed. If you just add something called > READ UNCOMMITTED, people are going to expect it to have *way* saner > semantics than this will. They'll use it routinely, not just as a > last-ditch mechanism to recover otherwise-lost data. And I'm > reasonably confident that will not work out well. +1. Thanks, Stephen
"Finnerty, Jim" <jfinnert@amazon.com> writes: > Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and don'tcare very much about transaction consistency. E.g. they want to compute SUM(sales) by salesperson, region for the past5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result. It's fairly questionable whether there's any real advantage to be gained by READ UNCOMMITTED in that sort of scenario --- almost all the tuples you'd be looking at would be hinted as committed-good, ordinarily, so that bypassing the relevant checks isn't going to save much. But I take your point that people would *think* that READ UNCOMMITTED could be used that way, if they come from some other DBMS. So this reinforces Mark's point that if we provide something like this, it shouldn't be called READ UNCOMMITTED. That should be reserved for something that has reasonably consistent, standards-compliant behavior. regards, tom lane
On Wed, Dec 18, 2019 at 2:29 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote: > I agree that if we have a user-exposed READ UNCOMMITTED isolation level, > it shouldn't be just a recovery tool. For a recovery tool, I think a > set-returning function as part of contrib/pageinspect, for example, > would be more appropriate. Then it could also try to be more defensive > against corrupt pages, and be superuser-only. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Over in [1], I became concerned that, although postgres supports Read Uncommitted transaction isolation (by way of Read Committed mode), there was very little test coverage for it: On 12/18/19 10:46 AM, Mark Dilger wrote: > Looking at the regression tests, I'm surprised read uncommitted gets > so little test coverage. There's a test in src/test/isolation but > nothing at all in src/test/regression covering this isolation level. > > The one in src/test/isolation doesn't look very comprehensive. I'd > at least expect a test that verifies you don't get a syntax error > when you request READ UNCOMMITTED isolation from SQL. The attached patch set adds a modicum of test coverage for this. Do others feel these tests are worth the small run time overhead they add? -- Mark Dilger [1] https://www.postgresql.org/message-id/CANP8%2Bj%2BmgWfcX9cTPsk7t%2B1kQCxgyGqHTR5R7suht7mCm_x_hA%40mail.gmail.com
Attachment
Mark Dilger <hornschnorter@gmail.com> writes: >> The one in src/test/isolation doesn't look very comprehensive. I'd >> at least expect a test that verifies you don't get a syntax error >> when you request READ UNCOMMITTED isolation from SQL. > The attached patch set adds a modicum of test coverage for this. > Do others feel these tests are worth the small run time overhead > they add? No. As you pointed out yourself, READ UNCOMMITTED is the same as READ COMMITTED, so there's hardly any point in testing its semantic behavior. One or two tests that check that it is accepted by the grammar seem like plenty (and even there, what's there to break? If bison starts failing us to that extent, we've got bigger problems.) Obviously, if we made it behave differently from READ COMMITTED, then it would need testing ... but the nature and extent of such testing would depend a lot on what we did to it, so I'm not eager to try to predict the need in advance. regards, tom lane
On 12/18/19 2:29 PM, Heikki Linnakangas wrote: > On 18/12/2019 20:46, Mark Dilger wrote: >> On 12/18/19 10:06 AM, Simon Riggs wrote: >>> Just consider this part of the recovery toolkit. >> >> In that case, don't call it "read uncommitted". Call it some other >> thing entirely. Users coming from other databases may request >> "read uncommitted" isolation expecting something that works. >> Currently, that gets promoted to "read committed" and works. After >> your change, that simply breaks and gives them an error. > > I agree that if we have a user-exposed READ UNCOMMITTED isolation level, > it shouldn't be just a recovery tool. For a recovery tool, I think a > set-returning function as part of contrib/pageinspect, for example, > would be more appropriate. Then it could also try to be more defensive > against corrupt pages, and be superuser-only. +1. -- -David david@pgmasters.net
On Wed, 18 Dec 2019 at 20:36, Tom Lane <tgl@sss.pgh.pa.us> wrote:
"Finnerty, Jim" <jfinnert@amazon.com> writes:
> Many will want to use it to do aggregation, e.g. a much more efficient COUNT(*), because they want performance and don't care very much about transaction consistency. E.g. they want to compute SUM(sales) by salesperson, region for the past 5 years, and don't care very much if some concurrent transaction aborted in the middle of computing this result.
It's fairly questionable whether there's any real advantage to be gained
by READ UNCOMMITTED in that sort of scenario --- almost all the tuples
you'd be looking at would be hinted as committed-good, ordinarily, so that
bypassing the relevant checks isn't going to save much.
Agreed; this was not intended to give any kind of backdoor benefit and I don't see any, just tears.
But I take your
point that people would *think* that READ UNCOMMITTED could be used that
way, if they come from some other DBMS. So this reinforces Mark's point
that if we provide something like this, it shouldn't be called READ
UNCOMMITTED.
Seems like general agreement on that point from others on this thread.
That should be reserved for something that has reasonably
consistent, standards-compliant behavior.
Since we're discussing it, exactly what standard are we talking about here?
I'm not saying I care about that, just to complete the discussion.
On Wed, 18 Dec 2019 at 19:29, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 18/12/2019 20:46, Mark Dilger wrote:
> On 12/18/19 10:06 AM, Simon Riggs wrote:
>> Just consider this part of the recovery toolkit.
>
> In that case, don't call it "read uncommitted". Call it some other
> thing entirely. Users coming from other databases may request
> "read uncommitted" isolation expecting something that works.
> Currently, that gets promoted to "read committed" and works. After
> your change, that simply breaks and gives them an error.
I agree that if we have a user-exposed READ UNCOMMITTED isolation level,
it shouldn't be just a recovery tool. For a recovery tool, I think a
set-returning function as part of contrib/pageinspect, for example,
would be more appropriate. Then it could also try to be more defensive
against corrupt pages, and be superuser-only.
So the consensus is for a more-specifically named facility.
I was aiming for something that would allow general SELECTs to run with a snapshot that can see uncommitted xacts, so making it a SRF wouldn't really allow that.
Not really sure where to go with the UI for this.
Hi, On 2019-12-18 18:06:21 +0000, Simon Riggs wrote: > On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote: > > > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> > > wrote: > > > This was my first concern when I thought about it, but I realised that > > by taking a snapshot and then calculating xmin normally, this problem would > > go away. > > > > Why? As soon as a transaction aborts... > > > > So this is the same discussion as elsewhere about potentially aborted > transactions... > AFAIK, the worst that happens in that case is that the reading transaction > will end with an ERROR, similar to a serializable error. I don't think that's all that can happen. E.g. the toast identifier might have been reused, and there might be a different datum in there. Which then means we'll end up calling operators on data that's potentially for a different datatype - it's trivial to cause crashes that way. And, albeit harder, possible to do more than that. I think there's plenty other problems too, not just toast. There's e.g. some parts of the system that access catalogs using a normal snapshot - which might not actually be consistent, because we have various places where we have to increment the command counter multiple times as part of a larger catalog manipulation. Greetings, Andres Freund
On 12/18/19 2:17 PM, Tom Lane wrote: > Mark Dilger <hornschnorter@gmail.com> writes: >>> The one in src/test/isolation doesn't look very comprehensive. I'd >>> at least expect a test that verifies you don't get a syntax error >>> when you request READ UNCOMMITTED isolation from SQL. > >> The attached patch set adds a modicum of test coverage for this. >> Do others feel these tests are worth the small run time overhead >> they add? > > No. As you pointed out yourself, READ UNCOMMITTED is the same as READ > COMMITTED, so there's hardly any point in testing its semantic behavior. > One or two tests that check that it is accepted by the grammar seem > like plenty (and even there, what's there to break? If bison starts > failing us to that extent, we've got bigger problems.) The lack of testing in the current system is so complete that if you go into gram.y and remove READ UNCOMMITTED from the grammar, not one test in check-world fails. Somebody doing something like what Simon is suggesting might refactor the code in a way that unintentionally breaks this isolation level, and we'd not know about it until users started complaining. The attached patch is pretty cheap. Perhaps you'll like it better? -- Mark Dilger
Attachment
On Thu, 19 Dec 2019 at 02:22, Andres Freund <andres@anarazel.de> wrote:
Hi,
On 2019-12-18 18:06:21 +0000, Simon Riggs wrote:
> On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote:
>
> > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com>
> > wrote:
> > > This was my first concern when I thought about it, but I realised that
> > by taking a snapshot and then calculating xmin normally, this problem would
> > go away.
> >
> > Why? As soon as a transaction aborts...
> >
>
> So this is the same discussion as elsewhere about potentially aborted
> transactions...
> AFAIK, the worst that happens in that case is that the reading transaction
> will end with an ERROR, similar to a serializable error.
I don't think that's all that can happen. E.g. the toast identifier
might have been reused, and there might be a different datum in
there. Which then means we'll end up calling operators on data that's
potentially for a different datatype - it's trivial to cause crashes
that way. And, albeit harder, possible to do more than that.
On the patch as proposed this wouldn't be possible because a toast row can't be vacuumed and then reused while holding back xmin, at least as I understand it.
I think there's plenty other problems too, not just toast. There's
e.g. some parts of the system that access catalogs using a normal
snapshot - which might not actually be consistent, because we have
various places where we have to increment the command counter multiple
times as part of a larger catalog manipulation.
It seems possible that catalog access would be the thing that makes this difficult. Cache invalidations wouldn't yet have been fired, so that would lead to rather weird errors, and as you say, potential issues from data type changes which would not be acceptable in a facility available to non-superusers.
We could limit that to xacts that don't do DDL, which is a very small % of xacts, but then those xacts are more likely to be ones you'd want to recover or investigate.
So I now withdraw this patch as submitted and won't be resubmitting.
Thanks everyone for your input.
On 2019-12-18 16:14, Simon Riggs wrote: > On Wed, 18 Dec 2019 at 12:11, Konstantin Knizhnik > <k.knizhnik@postgrespro.ru <mailto:k.knizhnik@postgrespro.ru>> wrote: > > As far as I understand with "read uncommitted" policy we can see two > versions of the same tuple if it was updated by two transactions > both of which were started before us and committed during table > traversal by transaction with "read uncommitted" policy. Certainly > "read uncommitted" means that we are ready to get inconsistent > results, but is it really acceptable to multiple versions of the > same tuple? > > > "In general, read uncommitted will return inconsistent results and > wrong answers. If you look at the changes made by a transaction > while it continues to make changes then you may get partial results > from queries, or you may miss index entries that haven't yet been > written. However, if you are reading transactions that are paused > at the end of their execution for whatever reason then you can > see a consistent result." > > I think I already covered your concerns in my suggested docs for this > feature. Independent of the technical concerns, I don't think the SQL standard allows the READ UNCOMMITTED level to behave in a way that violates the logical requirements of the defined database schema. So if we wanted to add this, we should probably name it something else. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Am Donnerstag, den 19.12.2019, 00:13 +0000 schrieb Simon Riggs: > So the consensus is for a more-specifically named facility. > > I was aiming for something that would allow general SELECTs to run > with a > snapshot that can see uncommitted xacts, so making it a SRF wouldn't > really > allow that. There's pg_dirtyread() [1] around for some while, implementing a SRF for debugging usage on in normal circumstances disappeared data. Its nice to not have worrying about anything when you faced with such kind of problems, but for such use cases i think a SRF serves quite well. [1] https://github.com/df7cb/pg_dirtyread Bernd
On Thu, 19 Dec 2019 at 12:42, Bernd Helmle <mailings@oopsware.de> wrote:
Am Donnerstag, den 19.12.2019, 00:13 +0000 schrieb Simon Riggs:
> So the consensus is for a more-specifically named facility.
>
> I was aiming for something that would allow general SELECTs to run
> with a
> snapshot that can see uncommitted xacts, so making it a SRF wouldn't
> really
> allow that.
There's pg_dirtyread() [1] around for some while, implementing a SRF
for debugging usage on in normal circumstances disappeared data. Its
nice to not have worrying about anything when you faced with such kind
of problems, but for such use cases i think a SRF serves quite well.
[1] https://github.com/df7cb/pg_dirtyread
As an example of an SRF for debugging purposes, sure, but then we already had the example of pageinspect, which I wrote BTW, so wasn't unfamiliar with the thought.
Note that pg_dirtyread has got nothing to do with the use cases I described.
On 12/19/19 1:50 AM, Simon Riggs wrote: > It seems possible that catalog access would be the thing that makes this > difficult. Cache invalidations wouldn't yet have been fired, so that > would lead to rather weird errors, and as you say, potential issues from > data type changes which would not be acceptable in a facility available > to non-superusers. > > We could limit that to xacts that don't do DDL, which is a very small % > of xacts, but then those xacts are more likely to be ones you'd want to > recover or investigate. > > So I now withdraw this patch as submitted and won't be resubmitting. Oh, I'm sorry to hear that. I thought this feature sounded useful, and we were working out what its limitations were. What I gathered from the discussion so far was: - It should be called something other than READ UNCOMMITTED - It should only be available to superusers, at least for the initial implementation - Extra care might be required to lock catalogs to avoid unsafe operations that could lead to backends crashing or security vulnerabilities - Toast tables need to be handled with care For the record, in case we revisit this idea in the future, which were the obstacles that killed this patch? Tom's point on that third item: > But I am quite afraid that we'd introduce security holes by future > reductions of required lock levels --- or else that this feature would be > the sole reason why we couldn't reduce the lock level for some DDL > operation. I'm doubtful that its use-case is worth that." Anybody running SET TRANSACTION ISOLATION LEVEL RECOVERY might have to get ExclusiveLock on most of the catalog tables. But that would only be done if somebody starts a transaction using this isolation level, which is not normal, so it shouldn't be a problem under normal situations. If the lock level reduction that Tom mentions was implemented, there would be no problem, as long as the lock level you reduce to still blocks against ExclusiveLock, which surely it must. If the transaction running in RECOVERY level isolation can't get the locks, then it blocks and doesn't help the administrator who is trying to use this feature, but that is no worse than the present situation where the feature is entirely absent. When no catalog changes are in flight, the administrator gets the locks and can continue inspecting the in-process changes of other transactions. Robert's point on that fourth item: > As soon as a transaction aborts, the TOAST rows can be vacuumed > away, but the READ UNCOMMITTED transaction might've already seen the > main tuple. This is not even a particularly tight race, necessarily, > since for example the table might be scanned, feeding tuples into a > tuplesort, and then the detoating might happen further up in the query > tree after the sort has completed. I don't know if this could be fixed without adding overhead to toast processing for non-RECOVERY transactions, but perhaps it doesn't need to be fixed at all. Perhaps you just accept that in RECOVERY mode you can't see toast data, and instead get NULLs for all such rows. Now, that could have security implications if somebody defines a policy where NULL in a toast column means "allow" rather than "deny" for some issue, but if this RECOVERY mode is limited to superusers, that isn't such a big objection. There may be a number of other gotchas still to be resolved, but abandoning the patch at this stage strikes me as premature. -- Mark Dilger
On 12/19/19 7:08 AM, Mark Dilger wrote: > and instead get NULLs for all such rows To clarify, I mean the toasted column is null for rows where the value was stored in the toast table rather than stored inline. I'd prefer some special value that means "this datum unavailable" so that it could be distinguished from an actual null, but no such value comes to mind. -- Mark Dilger
Hi, On 2019-12-19 09:50:44 +0000, Simon Riggs wrote: > On Thu, 19 Dec 2019 at 02:22, Andres Freund <andres@anarazel.de> wrote: > > > Hi, > > > > On 2019-12-18 18:06:21 +0000, Simon Riggs wrote: > > > On Wed, 18 Dec 2019 at 17:35, Robert Haas <robertmhaas@gmail.com> wrote: > > > > > > > On Wed, Dec 18, 2019 at 10:18 AM Simon Riggs <simon@2ndquadrant.com> > > > > wrote: > > > > > This was my first concern when I thought about it, but I realised > > that > > > > by taking a snapshot and then calculating xmin normally, this problem > > would > > > > go away. > > > > > > > > Why? As soon as a transaction aborts... > > > > > > > > > > So this is the same discussion as elsewhere about potentially aborted > > > transactions... > > > AFAIK, the worst that happens in that case is that the reading > > transaction > > > will end with an ERROR, similar to a serializable error. > > > > I don't think that's all that can happen. E.g. the toast identifier > > might have been reused, and there might be a different datum in > > there. Which then means we'll end up calling operators on data that's > > potentially for a different datatype - it's trivial to cause crashes > > that way. And, albeit harder, possible to do more than that. > > > > On the patch as proposed this wouldn't be possible because a toast row > can't be vacuumed and then reused while holding back xmin, at least as I > understand it. Vacuum and pruning remove rows where xmin didn't commit, without testing against the horizon. Which makes sense, because normally there's so far no snapshot including them. Unless we were to weaken that logic - which'd have bloat impacts - a snapshot wouldn't guarantee anything about the non-removal of such tuples, unless I am missing something. Greetings, Andres Freund
Hi, On 2019-12-19 07:08:06 -0800, Mark Dilger wrote: > > As soon as a transaction aborts, the TOAST rows can be vacuumed > > away, but the READ UNCOMMITTED transaction might've already seen the > > main tuple. This is not even a particularly tight race, necessarily, > > since for example the table might be scanned, feeding tuples into a > > tuplesort, and then the detoating might happen further up in the query > > tree after the sort has completed. > > I don't know if this could be fixed without adding overhead to toast > processing for non-RECOVERY transactions, but perhaps it doesn't need > to be fixed at all. Perhaps you just accept that in RECOVERY mode you > can't see toast data, and instead get NULLs for all such rows. Now, > that could have security implications if somebody defines a policy > where NULL in a toast column means "allow" rather than "deny" for > some issue, but if this RECOVERY mode is limited to superusers, that > isn't such a big objection. I mean, that's just a small part of the issue. You can get *different* data back for toast columns - incompatible with the datatype, leading to crashes. You can get *different* data back for the same query, running it twice, because data that was just inserted can get pruned away if the inserting transaction aborted. > There may be a number of other gotchas still to be resolved, but > abandoning the patch at this stage strikes me as premature. I think iff we'd want this feature, you'd have to actually use a much larger hammer, and change the snapshot logic to include information about which aborted transactions are visible, and whose rows cannot be removed. And then vacuuming/hot pruning need to be changed to respect that. And note that'll affect *all* sessions, not just the one wanting to use READ UNCOMMITTED. Greetings, Andres Freund
On Thu, 19 Dec 2019 at 23:36, Andres Freund <andres@anarazel.de> wrote:
Hi,
> On the patch as proposed this wouldn't be possible because a toast row
> can't be vacuumed and then reused while holding back xmin, at least as I
> understand it.
Vacuum and pruning remove rows where xmin didn't commit, without testing
against the horizon. Which makes sense, because normally there's so far
no snapshot including them. Unless we were to weaken that logic -
which'd have bloat impacts - a snapshot wouldn't guarantee anything
about the non-removal of such tuples, unless I am missing something.
My understanding from reading the above is that Simon didn't propose to make aborted txns visible, only in-progress uncommitted txns.
Vacuum only removes such rows if the xact is (a) explicitly aborted in clog or (b) provably not still running. It checks RecentXmin and the running xids arrays to handle xacts that went away after a crash. Per TransactionIdIsInProgress() as used by HeapTupleSatisfiesVacuum(). I see that it's not *quite* as simple as using the RecentXmin threhold, as xacts newer than RecentXmin may also be seen as not in-progress if they're absent in the shmem xact arrays and there's no overflow.
But that's OK so long as the only xacts that some sort of read-uncommitted feature allows to become visible are ones that satisfy TransactionIdIsInProgress(); they cannot have been vacuumed.
The bigger issue, and the one that IMO makes it impractical to spell this as "READ UNCOMMITTED", is that an uncommitted txn might've changed the catalogs so there is no one snapshot that is valid for interpreting all possible tuples. It'd have to see only txns that have no catalog changes, or be narrowed to see just *one specific txn* that had catalog changes. That makes it iffy to spell it as "READ UNCOMMITTED" since we can't actually make all uncommitted txns visible at once.
I think the suggestions for a SRF based approach might make sense. Perhaps a few functions:
* a function to list all in-progress xids
* a function to list in-progress xids with/without catalog changes (if possible, unsure if we know that until the commit record is written)
* a function (or procedure?) to execute a read-only SELECT or WITH query within a faked-up snapshot for some in-progress xact and return a SETOF RECORD with results. If the txn has catalog changes this would probably have to coalesce each result field with non-builtin data types to text, or do some fancy validation to compare the definition in the txn snapshot with the latest normal snapshot used by the calling session. Ideally this function could take an array of xids and would query with them all visible unless there were catalog changes in any of them, then it'd ERROR.
* a function to generate the SQL text for an alias clause that maps the RECORD returned by the above function, so you can semi-conveniently query it. (I don't think we have a way for a C callable function to supply a dynamic resultset type at plan-time to avoid the need for this, do we? Perhaps if we use a procedure not a function?)
Craig Ringer <craig@2ndquadrant.com> writes: > My understanding from reading the above is that Simon didn't propose to > make aborted txns visible, only in-progress uncommitted txns. Yeah, but an "in-progress uncommitted txn" can become an "aborted txn" at any moment, and there's no interlock that would prevent its generated data from being removed out from under you at any moment after that. So there's a race condition, and as Robert observed, the window isn't necessarily small. > The bigger issue, and the one that IMO makes it impractical to spell this > as "READ UNCOMMITTED", is that an uncommitted txn might've changed the > catalogs so there is no one snapshot that is valid for interpreting all > possible tuples. In theory that should be okay, because any such tuples would be in tables you can't access due to the in-progress txn having taken AccessExclusiveLock on tables it changes the rowtype of. But we keep looking for ways to reduce the locking requirements for ALTER TABLE actions --- and as I said upthread, it feels like this feature might someday be the sole reason why we can't safely reduce lock strength for some form of ALTER. I can't make a concrete argument for that though ... maybe it's not really any worse than the situation just after failure of any DDL-performing txn. But that would bear closer study than I think it's had. > I think the suggestions for a SRF based approach might make sense. Yeah, I'd rather do it that way than via a transaction isolation level. The isolation-level approach seems like people will expect stronger semantics than we could actually provide. regards, tom lane
On Fri, 20 Dec 2019 at 12:18, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Craig Ringer <craig@2ndquadrant.com> writes:
> My understanding from reading the above is that Simon didn't propose to
> make aborted txns visible, only in-progress uncommitted txns.
Yeah, but an "in-progress uncommitted txn" can become an "aborted txn"
at any moment, and there's no interlock that would prevent its generated
data from being removed out from under you at any moment after that.
So there's a race condition, and as Robert observed, the window isn't
necessarily small.
Absolutely. Many of the same issues arise in the work on logical decoding of in-progress xacts for optimistic logical decoding.
Unless such an interlock is added (with all the problems that entails, again per the in-progress logical decoding thread) that limits this to:
* running in recovery when stopped at a recovery target; or
* peeking at the contents of individual prepared xacts that we can prevent someone else concurrently aborting/committing
That'd actually cover the only things I'd personally actually want a feature like this for anyway.
In any case, Simon's yanked the proposal. I'd like to have some way to peek at the contents of individual uncommited xacts, but it's clearly not going to be anything called READ UNCOMMITTED that applies to all uncommitted xacts at once...
> I think the suggestions for a SRF based approach might make sense.
Yeah, I'd rather do it that way than via a transaction isolation
level. The isolation-level approach seems like people will expect
stronger semantics than we could actually provide.
Yeah. Definitely not an isolation level.
I'll be interesting to see if this sparks any more narrowly scoped and targeted ideas, anyway. Thanks for taking the time to think about it.