Thread: Commits 8de72b and 5457a1 (COPY FREEZE)
http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62 Part of that patch was reverted later: http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=5457a130d3a66db807d1e0ee2b8e829321809b83 I assume that refers to the discussion here: http://archives.postgresql.org/pgsql-hackers/2012-02/msg01177.php But that was quite a while ago, so is there a more recent discussion that prompted this commit now? I am a little confused about the case where setting HEAP_XMIN_COMMITTED when loading the table in the same transaction is wrong. There was some discussion about subtransactions, but those problems only seemed to appear when the CREATE and the INSERT/COPY happened in different subtransactions. So, I guess my question is, why the partial revert? Regards,Jeff Davis
On 4 December 2012 01:34, Jeff Davis <pgsql@j-davis.com> wrote: >I assume that refers to the discussion here: > > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01177.php > > But that was quite a while ago, so is there a more recent discussion > that prompted this commit now? Yes, this was discussed within the last month, thread "TRUNCATE SERIALIZABLE..." The patch for that was already posted, so I committed it. > I am a little confused about the case where setting HEAP_XMIN_COMMITTED > when loading the table in the same transaction is wrong. There was some > discussion about subtransactions, but those problems only seemed to > appear when the CREATE and the INSERT/COPY happened in different > subtransactions. > > So, I guess my question is, why the partial revert? Well, first, because I realised that it wasn't wanted. I was re-reading the threads trying to figure out a way to help the checksum patch further. Second, because I realised why it wasn't wanted. Setting HEAP_XMIN_COMMITTED causes MVCC violations within the transaction issuing the COPY. Accepting the MVCC violation requires explicit consent from user. If we have that, we may as well set everything we can. So there's no middle ground. Nothing, or all frozen. We could change that, but it would require some complex tweaks to tqual routines, which I tried and was not happy with, since they would need to get more complex. It's possible a route through that minefield exists. I'm inclined to believe other approaches are more likely to yield benefit. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Tue, 2012-12-04 at 10:15 +0000, Simon Riggs wrote: > On 4 December 2012 01:34, Jeff Davis <pgsql@j-davis.com> wrote: > > > I assume that refers to the discussion here: > > > > http://archives.postgresql.org/pgsql-hackers/2012-02/msg01177.php > > > > But that was quite a while ago, so is there a more recent discussion > > that prompted this commit now? > > Yes, this was discussed within the last month, thread "TRUNCATE SERIALIZABLE..." > > The patch for that was already posted, so I committed it. Thank you for pointing me toward that thread. While experimenting with COPY FREEZE, I noticed something: The simple case of BEGIN; CREATE TABLE ...; COPY ... WITH (FREEZE); doesn't meet the pre-conditions. It only meets the conditions if preceded by a TRUNCATE, which all of the tests do. I looked into it, and I think the test: ... && cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId() should be: ... &&(cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId() || cstate->rel->rd_createSubid == GetCurrentSubTransactionId()) (haven't tested). Another option to consider is that rd_newRelfilenodeSubid could be set on newly-created tables as well as newly-truncated tables. Also, I recommend a hint along with the NOTICE when the pre-conditions aren't met to tell the user what they need to do. Perhaps something like: "Create or truncate the table in the same transaction as COPY FREEZE." The documentation could expand on that, clarifying that a TRUNCATE in the same transaction (perhaps even ALTER?) allows a COPY FREEZE. The code comment could be improved a little, while we're at it: "Note that the stronger test of exactly which subtransaction created it is crucial for correctness of this optimisation." is a little vague, can you explain using the reasoning from the thread? I would say something like: "The transaction and subtransaction creating or truncating the table must match that of the COPY FREEZE. Otherwise, it could mix tuples from different transactions together, and it would be impossible to distinguish them for the purposes of atomicity." > > I am a little confused about the case where setting HEAP_XMIN_COMMITTED > > when loading the table in the same transaction is wrong. There was some > > discussion about subtransactions, but those problems only seemed to > > appear when the CREATE and the INSERT/COPY happened in different > > subtransactions. > > > > So, I guess my question is, why the partial revert? > > Well, first, because I realised that it wasn't wanted. I was > re-reading the threads trying to figure out a way to help the checksum > patch further. > > Second, because I realised why it wasn't wanted. Setting > HEAP_XMIN_COMMITTED causes MVCC violations within the transaction > issuing the COPY. After reading that thread, I still don't understand why it's unsafe to set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would think that a sufficiently narrow case -- such as CTAS outside of a transaction block -- would be safe, along with some slightly broader cases (like BEGIN; CREATE TABLE; INSERT/COPY). But perhaps that requires more discussion. I was just curious if there was a concrete problem that I was missing. Regards,Jeff Davis
On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote: > After reading that thread, I still don't understand why it's unsafe to > set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would > think that a sufficiently narrow case -- such as CTAS outside of a > transaction block -- would be safe, along with some slightly broader > cases (like BEGIN; CREATE TABLE; INSERT/COPY). I haven't looked at the committed patch - which seemed a bit precipitous to me given the stage the discussion was at - but I believe the general issue with HEAP_XMIN_COMMITTED is that there might be other snapshots in the same transaction, for example from open cursors. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote: >> After reading that thread, I still don't understand why it's unsafe to >> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would >> think that a sufficiently narrow case -- such as CTAS outside of a >> transaction block -- would be safe, along with some slightly broader >> cases (like BEGIN; CREATE TABLE; INSERT/COPY). > I haven't looked at the committed patch - which seemed a bit > precipitous to me given the stage the discussion was at - but I > believe the general issue with HEAP_XMIN_COMMITTED is that there might > be other snapshots in the same transaction, for example from open > cursors. From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED couldn't possibly be from its own transaction, and thus it doesn't make the tests that would be appropriate for a tuple that is from the current transaction. Maybe it's all right anyway (i.e. if we should always treat such a tuple as good) but I don't recall exactly what's tested in those paths. regards, tom lane
On Wed, 2012-12-05 at 19:43 -0500, Tom Lane wrote: > From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED > couldn't possibly be from its own transaction, and thus it doesn't make > the tests that would be appropriate for a tuple that is from the current > transaction. Maybe it's all right anyway (i.e. if we should always treat > such a tuple as good) but I don't recall exactly what's tested in those > paths. Oh, I see, it probably warrants some more discussion. It seems like we could solve these problems if we narrow the conditions enough. Anyway, the partial revert looks good, and the commit message seems appropriate (essentially, the code got ahead of the discussion). Regards,Jeff Davis
On 6 December 2012 00:43, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Robert Haas <robertmhaas@gmail.com> writes: >> On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote: >>> After reading that thread, I still don't understand why it's unsafe to >>> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would >>> think that a sufficiently narrow case -- such as CTAS outside of a >>> transaction block -- would be safe, along with some slightly broader >>> cases (like BEGIN; CREATE TABLE; INSERT/COPY). > >> I haven't looked at the committed patch - which seemed a bit >> precipitous to me given the stage the discussion was at - but I >> believe the general issue with HEAP_XMIN_COMMITTED is that there might >> be other snapshots in the same transaction, for example from open >> cursors. > > From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED > couldn't possibly be from its own transaction, and thus it doesn't make > the tests that would be appropriate for a tuple that is from the current > transaction. Maybe it's all right anyway (i.e. if we should always treat > such a tuple as good) but I don't recall exactly what's tested in those > paths. Yes. We'd need to add in a call to TransactionIdIsCurrentTransactionId(xmin), which is basically just wasted path in 99% of use cases. I've looked at optimising TransactionIdIsCurrentTransactionId() with what appears some vague success. Attached patch gives more consistent response. The other thing to do is to have a backend local flag that gets set when we use the HEAP_XMIN_COMMITTED anywhere. When not set we just skip past the TransactionIdIsCurrent test altogether. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services
Attachment
On 4 December 2012 20:38, Jeff Davis <pgsql@j-davis.com> wrote: > The simple case of BEGIN; CREATE TABLE ...; COPY ... WITH (FREEZE); > doesn't meet the pre-conditions. It only meets the conditions if > preceded by a TRUNCATE, which all of the tests do. I looked into it, and > I think the test: > > ... && > cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId() > > should be: > > ... && > (cstate->rel->rd_newRelfilenodeSubid == GetCurrentSubTransactionId() || > cstate->rel->rd_createSubid == GetCurrentSubTransactionId()) > > (haven't tested). Another option to consider is that > rd_newRelfilenodeSubid could be set on newly-created tables as well as > newly-truncated tables. I'll expand the test above and add new regression. I focused on correctness ahead of a wide use case and missed this. > Also, I recommend a hint along with the NOTICE when the pre-conditions > aren't met to tell the user what they need to do. Perhaps something > like: > > "Create or truncate the table in the same transaction as COPY > FREEZE." OK, will add hint. > The documentation could expand on that, clarifying that a TRUNCATE in > the same transaction (perhaps even ALTER?) allows a COPY FREEZE. > > The code comment could be improved a little, while we're at it: > > "Note that the stronger test of exactly which subtransaction created > it is crucial for correctness of this optimisation." > > is a little vague, can you explain using the reasoning from the thread? > I would say something like: > > "The transaction and subtransaction creating or truncating the table > must match that of the COPY FREEZE. Otherwise, it could mix tuples from > different transactions together, and it would be impossible to > distinguish them for the purposes of atomicity." OK, will try. > After reading that thread, I still don't understand why it's unsafe to > set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would > think that a sufficiently narrow case -- such as CTAS outside of a > transaction block -- would be safe, along with some slightly broader > cases (like BEGIN; CREATE TABLE; INSERT/COPY). It *could* be safe, but needs changes to visibility rules, which as explained I had not been able to optimise sufficiently. The code is easy enough to add back in at the time it is sufficiently well optimised and we all agree. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Hi, On 2012-12-03 17:34:01 -0800, Jeff Davis wrote: > http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62 On the subject of that patch. I am not a big fan of only emitting a NOTICE if FREEZE wasn't properly used: + if (cstate->freeze && (hi_options & HEAP_INSERT_FROZEN) == 0) + ereport(NOTICE, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("FREEZE option specified but pre-conditions not met"))); + Imo it should fail. Imagine adding FREEZE to the loading part of your application and not noticing the optimization broke because somebody else changed something in another part of the loading procedure. Not sure whether that discussed previously. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 6 December 2012 13:12, Andres Freund <andres@2ndquadrant.com> wrote: > Hi, > > On 2012-12-03 17:34:01 -0800, Jeff Davis wrote: >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62 > > On the subject of that patch. I am not a big fan of only emitting a NOTICE if > FREEZE wasn't properly used: > > + if (cstate->freeze && (hi_options & HEAP_INSERT_FROZEN) == 0) > + ereport(NOTICE, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("FREEZE option specified but pre-conditions not met"))); > + > > Imo it should fail. Imagine adding FREEZE to the loading part of your > application and not noticing the optimization broke because somebody > else changed something in another part of the loading procedure. > > Not sure whether that discussed previously. It was. Only Robert and I spoke about that. Also imagine having to analyze your code in detail to reevaluate the exact optimisation required each time. This way you get to add FREEZE and have it work always, with feedback if its not optimized. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 2012-12-06 14:07:32 +0000, Simon Riggs wrote: > On 6 December 2012 13:12, Andres Freund <andres@2ndquadrant.com> wrote: > > Hi, > > > > On 2012-12-03 17:34:01 -0800, Jeff Davis wrote: > >> http://git.postgresql.org/gitweb/?p=postgresql.git;a=commitdiff;h=8de72b66a2edcf12c812de0a73bd50b6b7d81d62 > > > > On the subject of that patch. I am not a big fan of only emitting a NOTICE if > > FREEZE wasn't properly used: > > > > + if (cstate->freeze && (hi_options & HEAP_INSERT_FROZEN) == 0) > > + ereport(NOTICE, > > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > > + errmsg("FREEZE option specified but pre-conditions not met"))); > > + > > > > Imo it should fail. Imagine adding FREEZE to the loading part of your > > application and not noticing the optimization broke because somebody > > else changed something in another part of the loading procedure. > > > > Not sure whether that discussed previously. > > It was. Only Robert and I spoke about that. > > Also imagine having to analyze your code in detail to reevaluate the > exact optimisation required each time. This way you get to add FREEZE > and have it work always, with feedback if its not optimized. I remain unconvinced by that argument, but if I am alone with this ok. Could we at least make it a WARNING? Nobody ever reads NOTICEs because it contains so much noise. And this is isn't noise. Its a bug on the client side. Greetings, Andres Freund --Andres Freund http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On 6 December 2012 14:12, Andres Freund <andres@2ndquadrant.com> wrote: > I remain unconvinced by that argument, but if I am alone with this > ok. Could we at least make it a WARNING? Nobody ever reads NOTICEs > because it contains so much noise. And this is isn't noise. Its a bug > on the client side. It's not a bug. Requesting a useful, but not critical optimisation is just a hint. The preconditions are not easy to understand, so I see no reason to punish people that misunderstand, or cause programs to fail in ways that need detailed understanding to make them work again. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
* Robert Haas (robertmhaas@gmail.com) wrote: > I haven't looked at the committed patch - which seemed a bit Disclaimer- neither have I, but.. When I last recall this discussion (likely in some bar in Europe), the problem was also that an independent session would be able to: a) see that the table exists (due to SnapshotNow being used for lookups) b) see rows in the table Part 'a' is something which I think would be good to fix (but hard), and it sounds like this patch might make part 'b' happen, which I think makes the part 'a' problem worse. I'm guessing this isn't a problem for the TRUNCATE case because the second session would still be looking at the pre-TRUNCATE files on disk, right? Is that reference to exactly which files on disk to look at being used to address the CREATE problem too..? That said, I love the idea of having a way to drop tuples in w/o having to go back and rewrite them three times.. Thanks, Stephen
* Simon Riggs (simon@2ndQuadrant.com) wrote: > It's not a bug. Requesting a useful, but not critical optimisation is > just a hint. The preconditions are not easy to understand, so I see no > reason to punish people that misunderstand, or cause programs to fail > in ways that need detailed understanding to make them work again. I tend to agree with Andres on this one. This feels a bit like accepting a command but then not actually following-through on it if it turns out we can't actually do it. If it's truely an optimization (and I suspect my other email/question might provide insight into that), then it should be something we can 'just do' without needing to be asked to do it, along the same lines of not WAL'ing when the appropriate conditions are met (table created in this transaction, etc, etc). Thanks, Stephen
On Thu, 2012-12-06 at 11:55 -0500, Stephen Frost wrote: > * Robert Haas (robertmhaas@gmail.com) wrote: > When I last recall this discussion (likely in some bar in Europe), the > problem was also that an independent session would be able to: > > a) see that the table exists (due to SnapshotNow being used for lookups) > b) see rows in the table > > Part 'a' is something which I think would be good to fix (but hard), > and it sounds like this patch might make part 'b' happen, which I think > makes the part 'a' problem worse. I'm guessing this isn't a problem for > the TRUNCATE case because the second session would still be looking at > the pre-TRUNCATE files on disk, right? Is that reference to exactly > which files on disk to look at being used to address the CREATE problem > too..? That isn't a problem, because the other session won't see the tuple in pg_class until the creating transaction commits, at which point the rows have committed, too (because this would only kick in when the rows are loaded in the same transaction as the CREATE). So, yes, it's like TRUNCATE in that regard. Regards,Jeff Davis
On 6 December 2012 17:02, Stephen Frost <sfrost@snowman.net> wrote: > * Simon Riggs (simon@2ndQuadrant.com) wrote: >> It's not a bug. Requesting a useful, but not critical optimisation is >> just a hint. The preconditions are not easy to understand, so I see no >> reason to punish people that misunderstand, or cause programs to fail >> in ways that need detailed understanding to make them work again. > > I tend to agree with Andres on this one. This feels a bit like > accepting a command but then not actually following-through on it > if it turns out we can't actually do it. If it's truely an optimization > (and I suspect my other email/question might provide insight into that), > then it should be something we can 'just do' without needing to be asked > to do it, along the same lines of not WAL'ing when the appropriate > conditions are met (table created in this transaction, etc, etc). That depends whether its a command or a do-if-possible hint. Its documented as the latter. Similar to the way VACUUM tries to truncate a relation, but gives up if it can't. And on an even closer example, VACUUM FREEZE itself, which doesn't guarantee that all rows are frozen at the end of it... -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2012-12-06 at 18:16 +0000, Simon Riggs wrote: > > I tend to agree with Andres on this one. This feels a bit like > > accepting a command but then not actually following-through on it > > if it turns out we can't actually do it. If it's truely an optimization > > (and I suspect my other email/question might provide insight into that), > > then it should be something we can 'just do' without needing to be asked > > to do it, along the same lines of not WAL'ing when the appropriate > > conditions are met (table created in this transaction, etc, etc). > > That depends whether its a command or a do-if-possible hint. Its > documented as the latter. > > Similar to the way VACUUM tries to truncate a relation, but gives up > if it can't. And on an even closer example, VACUUM FREEZE itself, > which doesn't guarantee that all rows are frozen at the end of it... Also, if the set of conditions changes in the future, we would have a problem if that caused new errors to appear. I think a WARNING might make more sense than a NOTICE, but I don't have a strong opinion about that. Regards,Jeff Davis
Jeff, * Jeff Davis (pgsql@j-davis.com) wrote: > That isn't a problem, because the other session won't see the tuple in > pg_class until the creating transaction commits, at which point the rows > have committed, too (because this would only kick in when the rows are > loaded in the same transaction as the CREATE). See, that's what I originally thought but was corrected on it at one point (by Tom, iirc). I just did a simple test against 9.2.1 using serializable mode. In this mode, if you do something like this: session a --------- begin; session b --------- begin; create table q (a integer); insert into q values (1); commit; select * from q; You'll get an empty table. That's not great, but it's life- once something is in pg_class, all sessions can see it because the table lookups are done using SnapshotNow and aren't truely transactional, but at least you can't see any rows in the table because the individual rows are marked with the transaction ID which created them and we can't see them in our transaction that started before the table was created. It sounds like, with this patch/change, this behavior would change. Now, I'm not necessairly against this, but it's clearly something different than what we had before and might be an issue for some. If, in the general case, we're actually 'ok' with this, I'd ask why we don't simply do it by default..? If we're *not* 'ok' with it, perhaps we shouldn't be doing it at all... If we fix the bigger issue (which, as I understand it from various discussions with Tom and Robert, is very difficult), then this whole problem goes away entirely. I always figured/expected that to be how we'd fix this, not just punt on the issue and tell the user "sure, you can do this, hope you know what you're doing...". Thanks, Stephen
On Thu, 2012-12-06 at 14:18 -0500, Stephen Frost wrote: > begin; > You need to do a SELECT here to actually get a snapshot. > session b > --------- > begin; > create table q (a integer); > insert into q values (1); > commit; > > select * from q; > > > You'll get an empty table. That's not great, but it's life- once > something is in pg_class, all sessions can see it because the table > lookups are done using SnapshotNow and aren't truely transactional, but > at least you can't see any rows in the table because the individual rows > are marked with the transaction ID which created them and we can't see > them in our transaction that started before the table was created. > > It sounds like, with this patch/change, this behavior would change. No, it would not change. Session A would see that the table exists and see that the rows' inserting transaction (in Session B) committed. That is correct because the inserting transaction *did* commit, and it's the same as we have now. However, the rows will *not* be visible, because the serializable snapshot doesn't contain the inserting transaction. Think about the current behavior: right after the commit, another select could come along and set all those hint bits anyway. Even if the hint bits aren't set, it will do a CLOG lookup and find that the transaction committed. The change being proposed is just to set those hint bits preemptively, because the fate of the INSERT is identical to the fate of the CREATE (they are in the same transaction). There will be no observable problem outside of that CREATE+INSERT transaction. The only catch is what to do about visibility of the tuples when still inside the transaction (which is not a problem for transactions doing a simple load). The interesting thing about HEAP_XMIN_COMMITTED is that it can be set preemptively if we know that the transaction will actually commit (aside from the visibility issues within the transaction). Even if the transaction doesn't commit, it would still be possible to clean out the dead tuples with a VACUUM, because no information has really been lost in the process. So there may yet be some kind of safe protocol to set these even during a load into an existing table... Regards,Jeff Davis
* Jeff Davis (pgsql@j-davis.com) wrote: > However, the rows will *not* be visible, because the serializable > snapshot doesn't contain the inserting transaction. That's what we've got now and what would be expected, however... > Think about the current behavior: right after the commit, another select > could come along and set all those hint bits anyway. Even if the hint > bits aren't set, it will do a CLOG lookup and find that the transaction > committed. The command is 'FREEZE', which sounded to me like the transaction ID would be set to FrozenXID, meaning that we wouldn't be able to tell if the inserting transaction was before or after ours... Your analysis of the hint bits themselves sounds reasonable but it seems independent of the issue regarding setting the actual transaction ID. Thanks, Stephen
On Thu, 2012-12-06 at 20:12 -0500, Stephen Frost wrote: > The command is 'FREEZE', which sounded to me like the transaction ID > would be set to FrozenXID, meaning that we wouldn't be able to tell if > the inserting transaction was before or after ours... Freezing does lose information, but I thought that this sub-thread was about the HEAP_XMIN_COMMITTED optimization that was in the first version of the commit but removed in a later commit. Setting HEAP_XMIN_COMMITTED does not lose information. > Your analysis of the hint bits themselves sounds reasonable but it seems > independent of the issue regarding setting the actual transaction ID. Upon re-reading, my last paragraph was worded a little too loosely. "The interesting thing about HEAP_XMIN_COMMITTED is that it can be set preemptively if we know that the transaction will actually commit (aside from the visibility issues within the transaction)." That should really be: "aside from the visibility issues before it does commit". Anyway, the HEAP_XMIN_COMMITTED loading optimizations require more discussion, but I think they are worth pursuing. The simpler form is when the table is created and loaded in the same transaction, but there may be some more sophisticated approaches, as well. Regards,Jeff Davis
Jeff, * Jeff Davis (pgsql@j-davis.com) wrote: > On Thu, 2012-12-06 at 20:12 -0500, Stephen Frost wrote: > > The command is 'FREEZE', which sounded to me like the transaction ID > > would be set to FrozenXID, meaning that we wouldn't be able to tell if > > the inserting transaction was before or after ours... > > Freezing does lose information, but I thought that this sub-thread was > about the HEAP_XMIN_COMMITTED optimization that was in the first version > of the commit but removed in a later commit. Setting HEAP_XMIN_COMMITTED > does not lose information. I'm less concerned about the hint bits and more concerned about the implications of the FrozenXID being used, which would make the rows visible to other transactions even if they began before the rows were loaded. > That should really be: "aside from the visibility issues before it does > commit". My concern is more about what happens to transactions which are opened before this transaction commits and that they might be able to see data in the table. As I mentioned before, I'm not *convinced* that this is really a big deal, or even a problem for this patch, but it's something to *consider* and think about because, as much as we like to say that DDL is transaction-safe, it's *not* completely MVCC and things like creating tables and committing them will show up as visible even in transactions that shouldn't be able to see them. Due to that, we need to think about what happens with data in those tables, etc. > Anyway, the HEAP_XMIN_COMMITTED loading optimizations require more > discussion, but I think they are worth pursuing. The simpler form is > when the table is created and loaded in the same transaction, but there > may be some more sophisticated approaches, as well. Sure. Thanks, Stephen
On Thu, 2012-12-06 at 20:49 -0500, Stephen Frost wrote: > I'm less concerned about the hint bits and more concerned about the > implications of the FrozenXID being used, which would make the rows > visible to other transactions even if they began before the rows were > loaded. That is documented in the committed patch -- it's a trade, basically saying that you lose isolation but avoid extra writes. It seems reasonable that the user gets this behavior if specifically requested. > My concern is more about what happens to transactions which are opened > before this transaction commits and that they might be able to see data > in the table. In the simple approach that only affects loads in the same transaction as the create, this is not an issue. The only issue there is for other reads in the same transaction. I think you already know that, but I am repeating for clarity. > As I mentioned before, I'm not *convinced* that this is really a big > deal, or even a problem for this patch, but it's something to *consider* > and think about because, as much as we like to say that DDL is > transaction-safe, it's *not* completely MVCC and things like creating > tables and committing them will show up as visible even in transactions > that shouldn't be able to see them. Due to that, we need to think about > what happens with data in those tables, etc. Agreed. I don't think anyone is suggesting that we change the semantics of existing commands, unless it's to improve the serializability of DDL. Regards,Jeff Davis
* Jeff Davis (pgsql@j-davis.com) wrote: > That is documented in the committed patch -- it's a trade, basically > saying that you lose isolation but avoid extra writes. It seems > reasonable that the user gets this behavior if specifically requested. Strictly speaking, it could actually be two different uesrs.. > In the simple approach that only affects loads in the same transaction > as the create, this is not an issue. The only issue there is for other > reads in the same transaction. I think you already know that, but I am > repeating for clarity. Actually, no, I'm not convinced of that. If a seperate transaction starts before the create/insert, and is still open when the create/insert transaction commits, wouldn't that seperate transaction see rows in the newly created table? That's more-or-less the specific issue I'm bringing up as a potential concern. If it's just a FrozenXID stored in the heap and there isn't anything telling the 2nd transaction to not consider this table that exists through SnapshotNow, how are we going to know to not look at the tuples? Or do we accept that it's "ok" for those to be visible? How I wish we could fix the table visibility and not have to worry about this issue at all, which would also remove the need for some special keyword to be used to tell us it's 'ok' to use this optimization... Thanks, Stephen
On 4 December 2012 20:38, Jeff Davis <pgsql@j-davis.com> wrote: > Even if it is, I would > think that a sufficiently narrow case -- such as CTAS outside of a > transaction block -- would be safe CTAS would be safe to do that with. CLUSTER and VACUUM FULL are already done. Outside of a transaction block would be automatic. Inside of a transaction block we could use a parameter called initially_frozen. This would act same as FREEZE on COPY. Parameter would not be recorded into the reloptions field on pg_class, since it is a one-time option and has no meaning after end of xact. > along with some slightly broader > cases (like BEGIN; CREATE TABLE; INSERT/COPY). Historically we've chosen not to allow bulk insert options on INSERT, so that is a separate discussion. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Thu, 2012-12-06 at 22:34 -0500, Stephen Frost wrote: > * Jeff Davis (pgsql@j-davis.com) wrote: > > That is documented in the committed patch -- it's a trade, basically > > saying that you lose isolation but avoid extra writes. It seems > > reasonable that the user gets this behavior if specifically requested. > > Strictly speaking, it could actually be two different uesrs.. That's a good point. Somewhat like SERIALIZABLE, unless everyone is on board, it doesn't really work. > > In the simple approach that only affects loads in the same transaction > > as the create, this is not an issue. The only issue there is for other > > reads in the same transaction. I think you already know that, but I am > > repeating for clarity. > > Actually, no, I'm not convinced of that. If a seperate transaction > starts before the create/insert, and is still open when the create/insert > transaction commits, wouldn't that seperate transaction see rows in the > newly created table? Not if all you do is set HEAP_XMIN_COMMITTED. Committed <> visible. > That's more-or-less the specific issue I'm bringing up as a potential > concern. If it's just a FrozenXID stored in the heap and there isn't > anything telling the 2nd transaction to not consider this table that > exists through SnapshotNow, how are we going to know to not look at the > tuples? Or do we accept that it's "ok" for those to be visible? I am talking about HEAP_XMIN_COMMITTED. I think it's understood that freezing has much stricter requirements for safety in various situations because it loses information. > How I wish we could fix the table visibility and not have to worry about > this issue at all, which would also remove the need for some special > keyword to be used to tell us it's 'ok' to use this optimization... I think we need to be clear about which one we're discussing: HEAP_XMIN_COMMITTED or FrozenXID. I believe discussing them together has caused a significant amount of confusion in this thread. Most of your concerns seem to be related to freezing, and I'm mostly interested in HEAP_XMIN_COMMITTED optimizations. So I think we're talking past each other. Regards,Jeff Davis
Jeff, * Jeff Davis (pgsql@j-davis.com) wrote: > Most of your concerns seem to be related to freezing, and I'm mostly > interested in HEAP_XMIN_COMMITTED optimizations. So I think we're > talking past each other. My concern is about this patch/capability as a whole. I agree that the hint bit setting may be fine by itself, I'm not terribly concerned with that. Now, if you (and others) aren't concerned about the overall behavior that is being introduced here, that's fine, but it was my understanding from previous discussions that making tuples visible to all transactions, even those started before the committing transaction which are set more strictly than 'read-committed', wasn't 'ok'. Now, what I've honestly been hoping for on this thread was for someone to speak up and point out why I'm wrong about this concern and explain how this patch addresses that issue. If that's been done, I've missed it.. Thanks, Stephen
On 7 December 2012 23:51, Stephen Frost <sfrost@snowman.net> wrote: > Jeff, > > * Jeff Davis (pgsql@j-davis.com) wrote: >> Most of your concerns seem to be related to freezing, and I'm mostly >> interested in HEAP_XMIN_COMMITTED optimizations. So I think we're >> talking past each other. > > My concern is about this patch/capability as a whole. I agree that the > hint bit setting may be fine by itself, I'm not terribly concerned with > that. Now, if you (and others) aren't concerned about the overall > behavior that is being introduced here, that's fine, but it was my > understanding from previous discussions that making tuples visible to > all transactions, even those started before the committing transaction > which are set more strictly than 'read-committed', wasn't 'ok'. > > Now, what I've honestly been hoping for on this thread was for someone > to speak up and point out why I'm wrong about this concern and explain > how this patch addresses that issue. If that's been done, I've missed > it.. Visibility of pre-hinted data is an issue and because of that we previously agreed that it would be allowed only when explicitly requested by the user, which has been implemented as the FREEZE option on COPY. This then makes it identical to TRUNCATE, where a separate command differentiates MVCC-compliant row removal (DELETE) from non-MVCC row removal (TRUNCATE). So the committed feature does address the visibility issue. Jeff has been speaking about an extension to that behaviour, which would be to allow HEAP_XMIN_COMMITTED to be set in some cases also. The committed feature specifically does not do that. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Simon, * Simon Riggs (simon@2ndQuadrant.com) wrote: > Visibility of pre-hinted data is an issue and because of that we > previously agreed that it would be allowed only when explicitly > requested by the user, which has been implemented as the FREEZE option > on COPY. This then makes it identical to TRUNCATE, where a separate > command differentiates MVCC-compliant row removal (DELETE) from > non-MVCC row removal (TRUNCATE). To be honest, I really don't buy off on this. Yes, TRUNCATE (a top-level, individually permissioned command) can violate MVCC and make things which might have been visible to a given transaction no longer visible to that transaction. I find that very different from making rows which should *not* be visible suddenly appear. > So the committed feature does address the visibility issue. Not hardly. It lets a user completely violate the basic rules of the overall database. The *correct* solution to this problem is to actually *fix* it, by setting it up such that tables created after a particular transaction starts aren't visible and similar. Not by punting to the user with very little control over it (or so I'm guessing). Will we accept a patch to do an INSERT or UPDATE FREEZE...? I realize this might be a bit of a stretch, but why not allow February 31st to be accepted as a date also, should the user request it? This is quite a slippery slope, in my view. > Jeff has > been speaking about an extension to that behaviour, which would be to > allow HEAP_XMIN_COMMITTED to be set in some cases also. The committed > feature specifically does not do that. It's not clear to me why you *wouldn't* just go ahead and set everything you can at the same time...? It hardly seems like it could be worse than what is apparently going to happen with this command... Thanks, Stephen
On 8 December 2012 22:18, Stephen Frost <sfrost@snowman.net> wrote: >> So the committed feature does address the visibility issue. > > Not hardly. It lets a user completely violate the basic rules of the > overall database. The *correct* solution to this problem is to actually > *fix* it, by setting it up such that tables created after a particular > transaction starts aren't visible and similar. Agreed, but that is also be a silent change of current behaviour. But the above will only work for CREATE TABLE, not for TRUNCATE. I've invested a lot of time in trying to improve the situation and investigated many routes forwards, none of them very satisfying. Until someone comes up with a better plan, FREEZE is a pragmatic way forwards that improves things now rather than waiting for the perfect solution. And if we want checksums anytime soon we need ways to ameliorate the effect of hints on checksums, which this does, soemwhat. Better plans, with code, welcome. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Wed, Dec 05, 2012 at 07:43:08PM -0500, Tom Lane wrote: > Robert Haas <robertmhaas@gmail.com> writes: > > On Tue, Dec 4, 2012 at 3:38 PM, Jeff Davis <pgsql@j-davis.com> wrote: > >> After reading that thread, I still don't understand why it's unsafe to > >> set HEAP_XMIN_COMMITTED in those conditions. Even if it is, I would > >> think that a sufficiently narrow case -- such as CTAS outside of a > >> transaction block -- would be safe, along with some slightly broader > >> cases (like BEGIN; CREATE TABLE; INSERT/COPY). > > > I haven't looked at the committed patch - which seemed a bit > > precipitous to me given the stage the discussion was at - but I > > believe the general issue with HEAP_XMIN_COMMITTED is that there might > > be other snapshots in the same transaction, for example from open > > cursors. > > From memory, the tqual.c code assumes that any tuple with XMIN_COMMITTED > couldn't possibly be from its own transaction, and thus it doesn't make > the tests that would be appropriate for a tuple that is from the current > transaction. Maybe it's all right anyway (i.e. if we should always treat > such a tuple as good) but I don't recall exactly what's tested in those > paths. I don't see semantics preservable by freezing, yet omitting HEAP_XMIN_COMMITTED. The "HeapTupleHeaderGetCmin(tuple) >= snapshot->curcid" test is the one at risk. HeapTupleSatisfiesMVCC() does skip that test for HEAP_XMIN_COMMITTED tuples, but seeing xmin==FrozenTransactionId hampers it all the more. What if one of the preconditions for the optimization were the equivalent of CheckTableNotInUse()? I cannot immediately think of a older-cmin-scan source not caught thereby. Unmodified HeapTupleSatisfiesMVCC() will then suffice. Happily, it's not a restriction users will regularly encounter. Thanks, nm
On Fri, Dec 07, 2012 at 06:51:18PM -0500, Stephen Frost wrote: > * Jeff Davis (pgsql@j-davis.com) wrote: > > Most of your concerns seem to be related to freezing, and I'm mostly > > interested in HEAP_XMIN_COMMITTED optimizations. So I think we're > > talking past each other. > > My concern is about this patch/capability as a whole. I agree that the > hint bit setting may be fine by itself, I'm not terribly concerned with > that. Now, if you (and others) aren't concerned about the overall > behavior that is being introduced here, that's fine, but it was my > understanding from previous discussions that making tuples visible to > all transactions, even those started before the committing transaction > which are set more strictly than 'read-committed', wasn't 'ok'. > > Now, what I've honestly been hoping for on this thread was for someone > to speak up and point out why I'm wrong about this concern and explain > how this patch addresses that issue. If that's been done, I've missed > it.. I favor[1] unconditionally letting older snapshots see the new rows after the CREATE+COPY transaction commits. To recap, making affected scans see an empty table is as wrong as making them see those rows. Robert also listed[2] that as a credible option, and I don't recall anyone opining against it in previous discussions. I did perceive an undercurrent preference, all other things being equal, for an optimization free from semantic side-effects. I shared that preference, but investigations showed that we must compromise something. Though I like the idea of making new relations appear nonexistent to older snapshots, let's keep that as an independent patch. Otherwise, the chance of having none of the above in PostgreSQL 9.3 escalates significantly. Thanks, nm [1] http://archives.postgresql.org/message-id/20120302205834.GC29160@tornado.leadboat.com [2] http://archives.postgresql.org/message-id/CA+TgmoYh_KXErp9eOejMV6EJJaczeZZcSj3kRtq=yg1SjiMidg@mail.gmail.com
Simon, * Simon Riggs (simon@2ndQuadrant.com) wrote: > Agreed, but that is also be a silent change of current behaviour. Sure, proper MVCC for catalog entries would be a change, but I think it's one which would actually reduce the surprises for our users. Today we tell people we have transactional DDL, and that's somewhat true, but it's not MVCC-safe and that can lead to surprises. > But the above will only work for CREATE TABLE, not for TRUNCATE. I'm trying to figure out why there are all the constraints around this command to begin with. If we're going to support this, why do we require the user to create or truncate the table in the same transaction? Clearly that's going to reduce the usefulness of this feature and it's not clear to me why that constraint is needed, code-wise. Also, what about adding FREEZE options to INSERT and maybe even UPDATE? Surely it would reduce the overhead associated with those commands as well. > I've invested a lot of time in trying to improve the situation and > investigated many routes forwards, none of them very satisfying. Until > someone comes up with a better plan, FREEZE is a pragmatic way > forwards that improves things now rather than waiting for the perfect > solution. I agree that the perfect can sometimes be the enemy of the good, but I still feel that this is quite a slippery slope that's going to end up getting ourselves into trouble- regardless of how much we document it or set up constraints around it. > And if we want checksums anytime soon we need ways to > ameliorate the effect of hints on checksums, which this does, > soemwhat. I don't buy into this argument in the least. > Better plans, with code, welcome. While I appreciate the mentality that those-who-code-win, I don't believe that it can be used in an argument based on *correctness*. Thanks, Stephen
* Noah Misch (noah@leadboat.com) wrote: > On Fri, Dec 07, 2012 at 06:51:18PM -0500, Stephen Frost wrote: > > Now, what I've honestly been hoping for on this thread was for someone > > to speak up and point out why I'm wrong about this concern and explain > > how this patch addresses that issue. If that's been done, I've missed > > it.. [...] So, apparently I'm not wrong about my concern, but no one seems to share my feelings on this change. I continue to hold that this could end up being a slippery slope for us to go down wrt 'correctness' vs. 'do whatever the user wants'. If we keep this to only COPY and where the table has to be truncated/created in the same transaction (which requires the user to have sufficient privileges to do non-MVCC-safe things on the table already), perhaps it's alright. It'll definitely reduce the interest in finding a real solution though, which is unfortunate. Thanks, Stephen
On Mon, Dec 10, 2012 at 08:32:53AM -0500, Stephen Frost wrote: > * Noah Misch (noah@leadboat.com) wrote: > > On Fri, Dec 07, 2012 at 06:51:18PM -0500, Stephen Frost wrote: > > > Now, what I've honestly been hoping for on this thread was for someone > > > to speak up and point out why I'm wrong about this concern and explain > > > how this patch addresses that issue. If that's been done, I've missed > > > it.. > > [...] > > So, apparently I'm not wrong about my concern, but no one seems to share > my feelings on this change. > > I continue to hold that this could end up being a slippery slope for us > to go down wrt 'correctness' vs. 'do whatever the user wants'. I agree we should be reticent to compromise correctness for convenience. Compromising mere bug-compatibility, trading one incorrect behavior for another incorrect behavior, is not as bad. Furthermore, today's behavior in question is not something I can see applications deliberately and successfully relying upon. > If we > keep this to only COPY and where the table has to be truncated/created > in the same transaction (which requires the user to have sufficient > privileges to do non-MVCC-safe things on the table already), perhaps > it's alright. Extending it to cases not involving a just-created or just-truncated table really would compromise correctness; errors could leave the table in an otherwise-impossible state. Let's indeed not go there. I see no particular need to restrict this to COPY; that's just the most important case by far. As a side note, the calculus for whether to extend the optimization to INSERT and UPDATE differs from that for WAL avoidance. WAL avoidance can be a substantial loss when the total change is small. For pre-hinting/freezing, the worst case is having needlessly checked a few local variables to rule out the optimization. > It'll definitely reduce the interest in finding a real > solution though, which is unfortunate. That effect seems likely, but I do not find it unfortunate. The change variant I have advocated does not stand in contrast to some "real solution" to PostgreSQL's treatment for readers of tables created or truncated by a transaction not in the reader's snapshot. The two topics interact at arm's length. Bundling them into one patch, artificially making them to stand or fall as a pair, is not a win for PostgreSQL. That does raise another disadvantage of making the change syntax-controlled: if we someday implement the other improvement, COPY FREEZE will have minimal reason not to be the default. FREEZE then becomes a relic noise word. Thanks, nm
On Sun, Dec 9, 2012 at 3:06 PM, Noah Misch <noah@leadboat.com> wrote: > I favor[1] unconditionally letting older snapshots see the new rows after the > CREATE+COPY transaction commits. To recap, making affected scans see an empty > table is as wrong as making them see those rows. Robert also listed[2] that > as a credible option, and I don't recall anyone opining against it in previous > discussions. I did perceive an undercurrent preference, all other things > being equal, for an optimization free from semantic side-effects. I shared > that preference, but investigations showed that we must compromise something. You know, I hadn't been taking that option terribly seriously, but maybe we ought to reconsider it. It would certainly be simpler, and as you point out, it's not really any worse from an MVCC point of view than anything else we do. Moreover, it would make this available to clients like pg_dump without further hackery. I think the current behavior, where we treat FREEZE as a hint, is just awful. Regardless of whether the behavior is automatic or manually requested, the idea that you might get the optimization or not depending on the timing of relcache flushes seems very much undesirable. I mean, if the optimization is actually important for performance, then you want to get it when you ask for it. If it isn't, then why bother having it at all? Let's say that COPY FREEZE normally doubles performance on a data load that therefore takes 8 hours - somebody who suddenly loses that benefit because of a relcache flush that they can't prevent or control and ends up with a 16 hour data load is going to pop a gasket. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Noah, * Noah Misch (noah@leadboat.com) wrote: > I agree we should be reticent to compromise correctness for convenience. > Compromising mere bug-compatibility, trading one incorrect behavior for > another incorrect behavior, is not as bad. Furthermore, today's behavior in > question is not something I can see applications deliberately and successfully > relying upon. I actually don't agree with the notion that one bad bug should allow us to introduce additional such bugs. I agree that it's unlikely that applications are depending on today's behavior of TRUNCATE making concurrent transactions see an empty table, but it does *not* follow that applications *won't* start depending on this new behavior of COPY FREEZE. > Extending it to cases not involving a just-created or just-truncated table > really would compromise correctness; errors could leave the table in an > otherwise-impossible state. Let's indeed not go there. Even if we could fix that, I'd be against allowing it arbitrairly for any regular user INSERT or UPDATE; I'm still not particularly happy with this approach for COPY. > > It'll definitely reduce the interest in finding a real > > solution though, which is unfortunate. > > That effect seems likely, but I do not find it unfortunate. The change > variant I have advocated does not stand in contrast to some "real solution" to > PostgreSQL's treatment for readers of tables created or truncated by a > transaction not in the reader's snapshot. The two topics interact at arm's > length. Bundling them into one patch, artificially making them to stand or > fall as a pair, is not a win for PostgreSQL. Having proper MVCC support for DDL *would* be a win for PostgreSQL and this *does* reduce the chances of that ever happening. > That does raise another disadvantage of making the change syntax-controlled: > if we someday implement the other improvement, COPY FREEZE will have minimal > reason not to be the default. FREEZE then becomes a relic noise word. Indeed, that's certainly unfortunate as well. Really, though, it just goes to show how much of a hack this is rather than a real solution. Thanks, Stephen
* Robert Haas (robertmhaas@gmail.com) wrote: > You know, I hadn't been taking that option terribly seriously, but > maybe we ought to reconsider it. It would certainly be simpler, and > as you point out, it's not really any worse from an MVCC point of view > than anything else we do. Moreover, it would make this available to > clients like pg_dump without further hackery. I really don't agree with this notion that the behavior of TRUNCATE, a top-level, seperately permissioned command, makes it OK to introduce other busted behavior in existing commands. > I think the current behavior, where we treat FREEZE as a hint, is just > awful. I agree that it's pretty grotty, but I had assumed it was at least deterministic, ala TRUNCATE/COPY and WAL... If it isn't, then this certainly gets really ugly really quickly. I don't think that means we should go ahead and try to always optimize it though- even when it isn't explicit, there will be an expectation that it's going to work when all the 'right' conditions are met. I know that's certainly how I feel about TRUNCATE/COPY and WAL'ing. Thanks, Stephen
On Mon, Dec 10, 2012 at 08:04:55PM -0500, Robert Haas wrote: > I think the current behavior, where we treat FREEZE as a hint, is just > awful. Regardless of whether the behavior is automatic or manually > requested, the idea that you might get the optimization or not > depending on the timing of relcache flushes seems very much > undesirable. I mean, if the optimization is actually important for > performance, then you want to get it when you ask for it. If it > isn't, then why bother having it at all? Let's say that COPY FREEZE > normally doubles performance on a data load that therefore takes 8 > hours - somebody who suddenly loses that benefit because of a relcache > flush that they can't prevent or control and ends up with a 16 hour > data load is going to pop a gasket. Until these threads, I did not know that a relcache invalidation could trip up the WAL avoidance optimization, and now this. I poked at the relevant relcache.c code, and it already takes pains to preserve the needed facts. The header comment of RelationCacheInvalidate() indicates that entries bearing an rd_newRelfilenodeSubid can safely survive the invalidation, but the code does not implement that. I think the comment is right, and this is just an oversight in the code going back to its beginning (fba8113c). I doubt the comment at the declaration of rd_createSubid in rel.h, though I can't presently say what correct comment should replace it. CLUSTER does preserve the old value, at least for the main table relation. CLUSTER probably should *set* rd_newRelfilenodeSubid. Thanks, nm
Attachment
On Mon, Dec 10, 2012 at 08:54:04PM -0500, Stephen Frost wrote: > I agree that it's unlikely that > applications are depending on today's behavior of TRUNCATE making > concurrent transactions see an empty table, but it does *not* follow > that applications *won't* start depending on this new behavior of COPY > FREEZE. That is a good point. I'm not sure whether that should worry us enough to implement an error in the scenario before letting COPY write frozen tuples. But it's the strongest argument I've seen for imposing that prerequisite. > Even if we could fix that, I'd be against allowing it arbitrairly for > any regular user INSERT or UPDATE; I'm still not particularly happy with > this approach for COPY. Sure; COPY is the 99% important case here.
On Mon, Dec 10, 2012 at 7:02 PM, Stephen Frost <sfrost@snowman.net> wrote: > > I continue to hold that this could end up being a slippery slope for us > to go down wrt 'correctness' vs. 'do whatever the user wants'. If we > keep this to only COPY and where the table has to be truncated/created > in the same transaction (which requires the user to have sufficient > privileges to do non-MVCC-safe things on the table already), perhaps > it's alright. It'll definitely reduce the interest in finding a real > solution though, which is unfortunate. > I wonder if something more complete can be done by forcing COPY FREEZE or whatever we call it to take an exclusive lock on the table and run loading as an append-only operation. By taking a strong lock, we will block out any concurrent read/writes to the table. If an error occurs while loading the data, the table will be truncated at the previously recorded size. We may need some additional book keeping and WAL logging to handle crash recovery. To solve the visibility issue for old snapshots that should not see the new data, we can store some additional visibility information in the pg_class itself. For example, we can store the size of the table before the COPY FREEZE started and the XID of the COPY FREEZE operation. An old snapshot that can not see the XID, can not see the tuples inserted in the new blocks either. Once the COPY FREEZE finishes and the lock on the relation is released, new transactions can start writing to the table and write past the old size of the table. But that should be OK. If an old snapshot can't see the tuples inserted by COPY FREEZE, AFAIK it can't see any of those other tuples as well. I'm sure there will still be challenges with this approach. But I wonder if we can guarantee correctness by proper use of synchronization and still avoid multiple writes for most common data loading scenarios. Thanks, Pavan -- Pavan Deolasee http://www.linkedin.com/in/pavandeolasee
On Mon, 2012-12-10 at 08:16 -0500, Stephen Frost wrote: > I'm trying to figure out why there are all the constraints around this > command to begin with. If we're going to support this, why do we > require the user to create or truncate the table in the same > transaction? Clearly that's going to reduce the usefulness of this > feature and it's not clear to me why that constraint is needed, > code-wise. There is a very specific reason: If you insert frozen tuples into a table that already has tuples in it, then you no longer have just an isolation issue, you have an *atomicity* issue (and probably a number of other serious issues) because you can't roll back. Doing it in the same transaction as the table was created leaves you with a way to roll back: just delete the table (which will happen implicitly because the creation is part of the same transaction anyway). Perhaps we can take some partial steps toward MVCC-safe access? For instance, how about trying to recheck the xmin of a pg_class tuple when starting a scan? Regards,Jeff Davis
On Mon, Dec 10, 2012 at 08:04:55PM -0500, Robert Haas wrote: > You know, I hadn't been taking that option terribly seriously, but > maybe we ought to reconsider it. It would certainly be simpler, and > as you point out, it's not really any worse from an MVCC point of view > than anything else we do. Moreover, it would make this available to > clients like pg_dump without further hackery. > > I think the current behavior, where we treat FREEZE as a hint, is just > awful. Regardless of whether the behavior is automatic or manually > requested, the idea that you might get the optimization or not > depending on the timing of relcache flushes seems very much > undesirable. I mean, if the optimization is actually important for > performance, then you want to get it when you ask for it. If it > isn't, then why bother having it at all? Let's say that COPY FREEZE > normally doubles performance on a data load that therefore takes 8 > hours - somebody who suddenly loses that benefit because of a relcache > flush that they can't prevent or control and ends up with a 16 hour > data load is going to pop a gasket. Why was this patch applied when there are obviously so many concerns about its behavior? Was that not clear at commit time? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + It's impossible for everything to be true. +
On 11 December 2012 03:01, Noah Misch <noah@leadboat.com> wrote: > On Mon, Dec 10, 2012 at 08:04:55PM -0500, Robert Haas wrote: >> I think the current behavior, where we treat FREEZE as a hint, is just >> awful. Regardless of whether the behavior is automatic or manually >> requested, the idea that you might get the optimization or not >> depending on the timing of relcache flushes seems very much >> undesirable. I mean, if the optimization is actually important for >> performance, then you want to get it when you ask for it. If it >> isn't, then why bother having it at all? Let's say that COPY FREEZE >> normally doubles performance on a data load that therefore takes 8 >> hours - somebody who suddenly loses that benefit because of a relcache >> flush that they can't prevent or control and ends up with a 16 hour >> data load is going to pop a gasket. > > Until these threads, I did not know that a relcache invalidation could trip up > the WAL avoidance optimization, and now this. I poked at the relevant > relcache.c code, and it already takes pains to preserve the needed facts. The > header comment of RelationCacheInvalidate() indicates that entries bearing an > rd_newRelfilenodeSubid can safely survive the invalidation, but the code does > not implement that. I think the comment is right, and this is just an > oversight in the code going back to its beginning (fba8113c). I think the comment is right also and so the patch is good. I will apply, barring objections. The information is only ever non-zero inside a single backend. There isn't any reason I can see why we wouldn't be able to remember this information in all cases, perhaps with a few push-ups. > I doubt the comment at the declaration of rd_createSubid in rel.h, though I > can't presently say what correct comment should replace it. rd_createSubid certainly does *not* get blown away by a message overflow as copy.c claims. I can't see any other way for a message overflow to cause it to be reset. I can no longer see a reason for us to regard the rd_createSubid as merely a hint. So we should change copy.c also. > CLUSTER does > preserve the old value, at least for the main table relation. CLUSTER > probably should *set* rd_newRelfilenodeSubid. I can't see a reason not to do this in terms of correctness. However, I can't see any reason why you'd want to CLUSTER a table and then load more data into it in the same transaction, so I'm tempted to just leave that as is to avoid introducing other bugs. But I dare say people will think we should fix it there also. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Fri, Dec 21, 2012 at 06:47:56PM +0000, Simon Riggs wrote: > On 11 December 2012 03:01, Noah Misch <noah@leadboat.com> wrote: > > Until these threads, I did not know that a relcache invalidation could trip up > > the WAL avoidance optimization, and now this. I poked at the relevant > > relcache.c code, and it already takes pains to preserve the needed facts. The > > header comment of RelationCacheInvalidate() indicates that entries bearing an > > rd_newRelfilenodeSubid can safely survive the invalidation, but the code does > > not implement that. I think the comment is right, and this is just an > > oversight in the code going back to its beginning (fba8113c). > > I think the comment is right also and so the patch is good. I will > apply, barring objections. > > The information is only ever non-zero inside a single backend. There > isn't any reason I can see why we wouldn't be able to remember this > information in all cases, perhaps with a few push-ups. > > > I doubt the comment at the declaration of rd_createSubid in rel.h, though I > > can't presently say what correct comment should replace it. > > rd_createSubid certainly does *not* get blown away by a message > overflow as copy.c claims. I can't see any other way for a message > overflow to cause it to be reset. > > I can no longer see a reason for us to regard the rd_createSubid as > merely a hint. So we should change copy.c also. I thought of one case where we do currently forget rd_newRelfilenodeSubid: BEGIN; TRUNCATE t; SAVEPOINT save; TRUNCATE t; ROLLBACK TO save; I don't mind that one too much. > > CLUSTER does > > preserve the old value, at least for the main table relation. CLUSTER > > probably should *set* rd_newRelfilenodeSubid. > > I can't see a reason not to do this in terms of correctness. > > However, I can't see any reason why you'd want to CLUSTER a table and > then load more data into it in the same transaction, so I'm tempted to > just leave that as is to avoid introducing other bugs. > > But I dare say people will think we should fix it there also. I could see using that capability occasionally, but I wouldn't mix such a change in with the goals of this thread. Thanks, nm
On 21 December 2012 20:10, Noah Misch <noah@leadboat.com> wrote: > I thought of one case where we do currently forget rd_newRelfilenodeSubid: > > BEGIN; > TRUNCATE t; > SAVEPOINT save; > TRUNCATE t; > ROLLBACK TO save; That's a weird one. Aborting a subtransacton that sets it, when it was already set. The loss of rd_newRelfilenodeSubid in that case is deterministic, but tracking the full complexity of multiple relations and multiple nested subxids isn't worth the trouble for such rare cases [assumption]. I'd go for just setting an its_too_complex flag (with better name) that we can use to trigger a message in COPY to say that FREEZE option won't be honoured. That would then be completely consistent, rather than the lack of deterministic behaviour that Robert rightly objects to. -- Simon Riggs http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
On Sat, Dec 22, 2012 at 12:42:43AM +0000, Simon Riggs wrote: > On 21 December 2012 20:10, Noah Misch <noah@leadboat.com> wrote: > > I thought of one case where we do currently forget rd_newRelfilenodeSubid: > > > > BEGIN; > > TRUNCATE t; > > SAVEPOINT save; > > TRUNCATE t; > > ROLLBACK TO save; > > That's a weird one. Aborting a subtransacton that sets it, when it was > already set. > > The loss of rd_newRelfilenodeSubid in that case is deterministic, but > tracking the full complexity of multiple relations and multiple nested > subxids isn't worth the trouble for such rare cases [assumption]. > > I'd go for just setting an its_too_complex flag (with better name) > that we can use to trigger a message in COPY to say that FREEZE option > won't be honoured. That would then be completely consistent, rather > than the lack of deterministic behaviour that Robert rightly objects > to. I wouldn't bother. The behavior here is deterministic, the cause clearly traceable to the specific commands issued. Stable software won't suddenly miss the optimization for no visible reason.
[ blast-from-the-past department ] Simon Riggs <simon@2ndQuadrant.com> writes: > On 11 December 2012 03:01, Noah Misch <noah@leadboat.com> wrote: >> Until these threads, I did not know that a relcache invalidation could trip up >> the WAL avoidance optimization, and now this. I poked at the relevant >> relcache.c code, and it already takes pains to preserve the needed facts. The >> header comment of RelationCacheInvalidate() indicates that entries bearing an >> rd_newRelfilenodeSubid can safely survive the invalidation, but the code does >> not implement that. I think the comment is right, and this is just an >> oversight in the code going back to its beginning (fba8113c). > I think the comment is right also and so the patch is good. I will > apply, barring objections. I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've gotten a failure, your session is hosed: regression=# select 1; ?column? ---------- 1 (1 row) regression=# reindex index pg_class_oid_index; psql: ERROR: could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes regression=# select 1; psql: ERROR: could not open file "base/16384/35344": No such file or directory The problem is that the nailed relcache entry for pg_class_oid_index got updated to point to the new relfilenode, and that change did not get undone by transaction abort. There seem to be several bugs contributing to that, but the one I'm asking about right now is that commit ae9aba69a caused RelationCacheInvalidate to skip over relations that have rd_newRelfilenodeSubid set, as indeed pg_class_oid_index will have at the instant of the failure. This seems quite wrong, because it prevents us from rebuilding the entry with corrected values. In particular notice that the change causes us to skip the RelationInitPhysicalAddr call that would normally be applied to a nailed mapped index in that loop. That's completely fatal in this case --- it keeps us from restoring the correct relfilenode that the mapper would now tell us, if we only bothered to ask. I think perhaps what we should do is revert ae9aba69a and instead do relcacheInvalsReceived++; - if (RelationHasReferenceCountZero(relation)) + if (RelationHasReferenceCountZero(relation) && + relation->rd_newRelfilenodeSubid == InvalidSubTransactionId) { /* Delete this entry immediately */ Assert(!relation->rd_isnailed); so that entries with rd_newRelfilenodeSubid nonzero are preserved but are rebuilt. There might be an argument for treating rd_createSubid the same way, not sure. That field wouldn't ever be set on a system catalog, so it's less of a hazard, but there's something to be said for treating the two fields alike. Thoughts? regards, tom lane
Hi, On 2019-05-01 14:44:12 -0400, Tom Lane wrote: > I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found > while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've > gotten a failure, your session is hosed: > > regression=# select 1; > ?column? > ---------- > 1 > (1 row) > > regression=# reindex index pg_class_oid_index; > psql: ERROR: could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes > regression=# select 1; > psql: ERROR: could not open file "base/16384/35344": No such file or directory Yea, I noticed that too. Lead me astray for a while, because it triggered apparent REINDEX failures for pg_index too, even though not actually related to the reindex. > The problem is that the nailed relcache entry for pg_class_oid_index got > updated to point to the new relfilenode, and that change did not get > undone by transaction abort. There seem to be several bugs contributing > to that, but the one I'm asking about right now is that commit ae9aba69a > caused RelationCacheInvalidate to skip over relations that have > rd_newRelfilenodeSubid set, as indeed pg_class_oid_index will have at the > instant of the failure. That commit message is quite unhelpful about what exactly this was intended to fix. I assume it was intended to make COPY FREEZE usage predictable, but... > This seems quite wrong, because it prevents us from rebuilding the > entry with corrected values. In particular notice that the change > causes us to skip the RelationInitPhysicalAddr call that would > normally be applied to a nailed mapped index in that loop. That's > completely fatal in this case --- it keeps us from restoring the > correct relfilenode that the mapper would now tell us, if we only > bothered to ask. Indeed. I'm a bit surprised that doesn't lead to more problems. Not sure I understand where the RelationCacheInvalidate() call is coming from in this case though. Shouldn't the entry have been invalidated individually through ATEOXact_Inval(false)? > I think perhaps what we should do is revert ae9aba69a and instead do > > relcacheInvalsReceived++; > > - if (RelationHasReferenceCountZero(relation)) > + if (RelationHasReferenceCountZero(relation) && > + relation->rd_newRelfilenodeSubid == InvalidSubTransactionId) > { > /* Delete this entry immediately */ > Assert(!relation->rd_isnailed); > > so that entries with rd_newRelfilenodeSubid nonzero are preserved but are > rebuilt. Seems like a reasonable approach. > There might be an argument for treating rd_createSubid the same way, > not sure. That field wouldn't ever be set on a system catalog, so > it's less of a hazard, but there's something to be said for treating > the two fields alike. Seems sensible to treat it the same way. Not sure if there's an actual problem where the current treatment could cause a problem, but seems worth improving. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2019-05-01 14:44:12 -0400, Tom Lane wrote: >> This seems quite wrong, because it prevents us from rebuilding the >> entry with corrected values. In particular notice that the change >> causes us to skip the RelationInitPhysicalAddr call that would >> normally be applied to a nailed mapped index in that loop. That's >> completely fatal in this case --- it keeps us from restoring the >> correct relfilenode that the mapper would now tell us, if we only >> bothered to ask. > Indeed. I'm a bit surprised that doesn't lead to more problems. > Not sure I understand where the RelationCacheInvalidate() call is coming > from in this case though. Shouldn't the entry have been invalidated > individually through ATEOXact_Inval(false)? In CLOBBER_CACHE_ALWAYS mode it's likely that we get a RelationCacheInvalidate call first. Note that the change I'm talking about here is not sufficient to fix the failure; there are more problems behind it. I just wanted to know if there was something I was missing about that old patch. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2019-05-01 14:44:12 -0400, Tom Lane wrote: >> There might be an argument for treating rd_createSubid the same way, >> not sure. That field wouldn't ever be set on a system catalog, so >> it's less of a hazard, but there's something to be said for treating >> the two fields alike. > Seems sensible to treat it the same way. Not sure if there's an actual > problem where the current treatment could cause a problem, but seems > worth improving. So if I try to treat rd_createSubid the same way, it passes regression in normal mode, but CLOBBER_CACHE_ALWAYS is a total disaster: all table creations fail, eg regression=# CREATE TABLE BOOLTBL1 (f1 bool); psql: ERROR: relation 92773 deleted while still in use What is happening is that once heap_create has created a relcache entry for the new relation, any RelationCacheInvalidate call causes us to try to reload that relcache entry, and that fails because there's no pg_class or pg_attribute entries for the new relation yet. This gets back to the thing we were poking at in the other thread, which is whether or not a relcache entry is *always* reconstructible from on-disk state. I had in the back of my mind "um, what about initial table creation?" but hadn't looked closely. It seems that the only reason that it works is that RelationCacheInvalidate is unwilling to touch new-in-transaction relcache entries. The ideal fix for this, perhaps, would be to rejigger things far enough that we would not try to create a relcache entry for a new rel until it was supported by some minimal set of catalog entries. But that's more change than I really want to undertake right now, and for sure I wouldn't want to back-patch it. Anyway, I believe the original justification for skipping new-in-transaction entries here, which is that they couldn't possibly be subjects of any cross-backend notification messages. I think the argument that that applies to rd_newRelfilenodeSubid is a whole lot weaker though. But I'm going to let this go for the moment, because it's not clear whether a patch like this is actually relevant to our immediate problem. There seem to be some other order-of-operations issues in invalidation processing, and maybe fixing those would suffice. More later. regards, tom lane
Andres Freund <andres@anarazel.de> writes: > On 2019-05-01 14:44:12 -0400, Tom Lane wrote: >> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found >> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've >> gotten a failure, your session is hosed: >> >> regression=# select 1; >> ?column? >> ---------- >> 1 >> (1 row) >> >> regression=# reindex index pg_class_oid_index; >> psql: ERROR: could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes >> regression=# select 1; >> psql: ERROR: could not open file "base/16384/35344": No such file or directory > Yea, I noticed that too. Lead me astray for a while, because it > triggered apparent REINDEX failures for pg_index too, even though not > actually related to the reindex. It seems that the reason we fail to recover is that we detect the error during CommandEndInvalidationMessages, after we've updated the relcache entry for pg_class_oid_index; of course at that point, any additional pg_class access is going to fail because it'll try to use the not-yet-existent index. However, when we then abort the transaction, we fail to revert pg_class_oid_index's relcache entry because *there is not yet an invalidation queue entry telling us to do so*. This is, therefore, an order-of-operations bug in CommandEndInvalidationMessages. It should attach the "current command" queue entries to PriorCmdInvalidMsgs *before* it processes them, not after. In this way they'll be available to be reprocessed by AtEOXact_Inval or AtEOSubXact_Inval if we fail during CCI. (A different way we could attack this is to make AtEOXact_Inval and AtEOSubXact_Inval process "current" messages, as they do not today. But I think that's a relatively inefficient solution, as it would force those messages to be reprocessed even in the much-more-common case where we abort before reaching CommandEndInvalidationMessages.) The attached quick-hack patch changes that around, and seems to survive testing (though I've not completed a CCA run with it). The basic change I had to make to make this work is to make AppendInvalidationMessageList actually append the current-commands list to the prior-cmds list, not prepend as it's really always done. (In that way, we can still scan the current-commands list just by starting at its head.) The comments for struct InvalidationChunk explicitly disclaim any significance to the order of the entries, so this should be okay, and I'm not seeing any problem in testing. It's been a long time since I wrote that code, but I think the reason I did it like that was the thought that in a long transaction, the prior-cmds list might be much longer than the current-commands list, so iterating down the prior-cmds list could add an O(N^2) cost. The easy answer to that, of course, is to make the lists doubly-linked, and I'd do that before committing; this version is just for testing. (It's short on comments, too.) The thing I was worried about in RelationCacheInvalidate does seem to be a red herring, at least fixing it is not necessary to make the broken-session-state problem go away. Comments? regards, tom lane diff --git a/src/backend/utils/cache/inval.c b/src/backend/utils/cache/inval.c index f09e3a9..8894bdf 100644 *** a/src/backend/utils/cache/inval.c --- b/src/backend/utils/cache/inval.c *************** AddInvalidationMessage(InvalidationChunk *** 264,287 **** } /* ! * Append one list of invalidation message chunks to another, resetting ! * the source chunk-list pointer to NULL. */ static void AppendInvalidationMessageList(InvalidationChunk **destHdr, InvalidationChunk **srcHdr) { ! InvalidationChunk *chunk = *srcHdr; ! if (chunk == NULL) return; /* nothing to do */ ! while (chunk->next != NULL) ! chunk = chunk->next; ! chunk->next = *destHdr; ! *destHdr = *srcHdr; *srcHdr = NULL; } --- 264,294 ---- } /* ! * Append the "source" list of invalidation message chunks to the "dest" ! * list, resetting the source chunk-list pointer to NULL. ! * Note that if the caller hangs onto the previous source pointer, ! * the source list is still separately traversable afterwards. */ static void AppendInvalidationMessageList(InvalidationChunk **destHdr, InvalidationChunk **srcHdr) { ! InvalidationChunk *chunk; ! if (*srcHdr == NULL) return; /* nothing to do */ ! chunk = *destHdr; ! if (chunk == NULL) ! *destHdr = *srcHdr; ! else ! { ! while (chunk->next != NULL) ! chunk = chunk->next; ! chunk->next = *srcHdr; ! } *srcHdr = NULL; } *************** xactGetCommittedInvalidationMessages(Sha *** 858,867 **** */ oldcontext = MemoryContextSwitchTo(CurTransactionContext); - ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, - MakeSharedInvalidMessagesArray); ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, MakeSharedInvalidMessagesArray); MemoryContextSwitchTo(oldcontext); Assert(!(numSharedInvalidMessagesArray > 0 && --- 865,874 ---- */ oldcontext = MemoryContextSwitchTo(CurTransactionContext); ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, MakeSharedInvalidMessagesArray); + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, + MakeSharedInvalidMessagesArray); MemoryContextSwitchTo(oldcontext); Assert(!(numSharedInvalidMessagesArray > 0 && *************** AtEOSubXact_Inval(bool isCommit) *** 1084,1089 **** --- 1091,1098 ---- void CommandEndInvalidationMessages(void) { + InvalidationListHeader currentCmdInvalidMsgs; + /* * You might think this shouldn't be called outside any transaction, but * bootstrap does it, and also ABORT issued when not in a transaction. So *************** CommandEndInvalidationMessages(void) *** 1092,1101 **** if (transInvalInfo == NULL) return; ! ProcessInvalidationMessages(&transInvalInfo->CurrentCmdInvalidMsgs, ! LocalExecuteInvalidationMessage); AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, &transInvalInfo->CurrentCmdInvalidMsgs); } --- 1101,1113 ---- if (transInvalInfo == NULL) return; ! currentCmdInvalidMsgs = transInvalInfo->CurrentCmdInvalidMsgs; ! AppendInvalidationMessages(&transInvalInfo->PriorCmdInvalidMsgs, &transInvalInfo->CurrentCmdInvalidMsgs); + + ProcessInvalidationMessages(¤tCmdInvalidMsgs, + LocalExecuteInvalidationMessage); }
On Wed, May 01, 2019 at 07:01:23PM -0400, Tom Lane wrote: > > On 2019-05-01 14:44:12 -0400, Tom Lane wrote: > >> I'm busy looking into the REINDEX-on-pg_class mess, and one thing I found > >> while testing HEAD is that with CLOBBER_CACHE_ALWAYS on, once you've > >> gotten a failure, your session is hosed: > >> > >> regression=# select 1; > >> ?column? > >> ---------- > >> 1 > >> (1 row) > >> > >> regression=# reindex index pg_class_oid_index; > >> psql: ERROR: could not read block 0 in file "base/16384/35344": read only 0 of 8192 bytes > >> regression=# select 1; > >> psql: ERROR: could not open file "base/16384/35344": No such file or directory > It seems that the reason we fail to recover is that we detect the error > during CommandEndInvalidationMessages, after we've updated the relcache > entry for pg_class_oid_index; of course at that point, any additional > pg_class access is going to fail because it'll try to use the > not-yet-existent index. However, when we then abort the transaction, > we fail to revert pg_class_oid_index's relcache entry because *there is > not yet an invalidation queue entry telling us to do so*. > > This is, therefore, an order-of-operations bug in > CommandEndInvalidationMessages. It should attach the "current command" > queue entries to PriorCmdInvalidMsgs *before* it processes them, not > after. In this way they'll be available to be reprocessed by > AtEOXact_Inval or AtEOSubXact_Inval if we fail during CCI. That makes sense. > The attached quick-hack patch changes that around, and seems to survive > testing (though I've not completed a CCA run with it). The basic change > I had to make to make this work is to make AppendInvalidationMessageList > actually append the current-commands list to the prior-cmds list, not > prepend as it's really always done. (In that way, we can still scan the > current-commands list just by starting at its head.) The comments for > struct InvalidationChunk explicitly disclaim any significance to the order > of the entries, so this should be okay, and I'm not seeing any problem > in testing. It's been a long time since I wrote that code, but I think > the reason I did it like that was the thought that in a long transaction, > the prior-cmds list might be much longer than the current-commands list, > so iterating down the prior-cmds list could add an O(N^2) cost. The > easy answer to that, of course, is to make the lists doubly-linked, > and I'd do that before committing; this version is just for testing. > (It's short on comments, too.) Looks reasonable so far. > The thing I was worried about in RelationCacheInvalidate does seem > to be a red herring, at least fixing it is not necessary to make > the broken-session-state problem go away. Your earlier proposal would have made RelationCacheInvalidate() work more like RelationFlushRelation() when rd_newRelfilenodeSubid is set. That's a good direction, all else being equal, though I'm not aware of a specific bug reachable today. I think RelationCacheInvalidate() would then need the reference count bits that RelationFlushRelation() has. > *************** xactGetCommittedInvalidationMessages(Sha > *** 858,867 **** > */ > oldcontext = MemoryContextSwitchTo(CurTransactionContext); > > - ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, > - MakeSharedInvalidMessagesArray); > ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, > MakeSharedInvalidMessagesArray); > MemoryContextSwitchTo(oldcontext); > > Assert(!(numSharedInvalidMessagesArray > 0 && > --- 865,874 ---- > */ > oldcontext = MemoryContextSwitchTo(CurTransactionContext); > > ProcessInvalidationMessagesMulti(&transInvalInfo->PriorCmdInvalidMsgs, > MakeSharedInvalidMessagesArray); > + ProcessInvalidationMessagesMulti(&transInvalInfo->CurrentCmdInvalidMsgs, > + MakeSharedInvalidMessagesArray); Why this order change?
Noah Misch <noah@leadboat.com> writes: > On Wed, May 01, 2019 at 07:01:23PM -0400, Tom Lane wrote: >> The thing I was worried about in RelationCacheInvalidate does seem >> to be a red herring, at least fixing it is not necessary to make >> the broken-session-state problem go away. > Your earlier proposal would have made RelationCacheInvalidate() work more like > RelationFlushRelation() when rd_newRelfilenodeSubid is set. That's a good > direction, all else being equal, though I'm not aware of a specific bug > reachable today. I think RelationCacheInvalidate() would then need the > reference count bits that RelationFlushRelation() has. Yeah. I'm not actually convinced that treating rd_createSubid and rd_newRelfilenodeSubid alike here is appropriate, though. If rd_createSubid is set then we certainly can assume that no other sessions can see/modify the relation, but we cannot make the same assumption when rd_newRelfilenodeSubid is set. The comment argues, in essence, that it's okay if we have AEL on the relation, but I'm not 100% convinced about that ... still, I can't construct a counterexample at the moment. > Why this order change? Because of the comment just above: * ... Maintain the order that they * would be processed in by AtEOXact_Inval(), to ensure emulated behaviour * in redo is as similar as possible to original. We want the same bugs, * if any, not new ones. In principle the order of processing inval events should not matter (if it does, then this patch is much more dangerous than it looks). But I concur with this comment that it's best if standby servers apply the events in the same order the master would; and this patch does cause that order to change. regards, tom lane