Thread: segmentation fault using currtid and partitioned tables
Hi, Another one caught by sqlsmith, on the regression database run this query (using any non-partitioned table works fine): """ select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid; """ This works on 11 (well it gives an error because the file doesn't exists) but crash the server on 12+ attached the stack trace from master -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Jaime Casanova <jaime.casanova@2ndquadrant.com> writes: > Another one caught by sqlsmith, on the regression database run this > query (using any non-partitioned table works fine): > select currtid('pagg_tab'::regclass::oid, '(0,156)'::tid) >= '(1,158)'::tid; Hm, so (1) currtid_byreloid and currtid_byrelname lack any check to see if they're dealing with a relkind that lacks storage. (2) The proximate cause of the crash is that rd_tableam is zero, so that the interface functions in tableam.h just crash hard. This seems like a pretty bad thing; does anyone want to promise that there are no other oversights of the same ilk elsewhere, and never will be? I think it might be a good idea to make relations-without-storage set up rd_tableam as a vector of dummy functions that will throw some suitable complaint about "relation lacks storage". NULL is a horrible default for this. regards, tom lane
On Sun, Apr 05, 2020 at 12:51:56PM -0400, Tom Lane wrote: > I think it might be a good idea to make relations-without-storage > set up rd_tableam as a vector of dummy functions that will throw > some suitable complaint about "relation lacks storage". NULL is > a horrible default for this. Yeah, that's not good, but I am not really comfortable with the concept of implying that (pg_class.relam == InvalidOid) maps to a dummy AM callback set instead of NULL for rd_tableam. That feels less natural. As mentioned upthread, the error that we get in ~11 is confusing as well when using a relation that has no storage: ERROR: 58P01: could not open file "base/16384/16385": No such file or directory I have been looking at the tree and the use of the table AM APIs, and those TID lookups are really a particular case compared to the other callers of the table AM callbacks. Anyway, I have not spotted similar problems, so I find very tempting the option to just add some RELKIND_HAS_STORAGE() to tid.c where it matters and call it a day. Andres, what do you think? -- Michael
Attachment
On Wed, Apr 08, 2020 at 04:13:31PM +0900, Michael Paquier wrote: > I have been looking at the tree and the use of the table AM APIs, and > those TID lookups are really a particular case compared to the other > callers of the table AM callbacks. Anyway, I have not spotted similar > problems, so I find very tempting the option to just add some > RELKIND_HAS_STORAGE() to tid.c where it matters and call it a day. Playing more with this stuff, it happens that we have zero code coverage for currtid() and currtid2(), and the main user of those functions I can find around is the ODBC driver: https://coverage.postgresql.org/src/backend/utils/adt/tid.c.gcov.html There are multiple cases to consider, particularly for views: - Case of a view with ctid as attribute taken from table. - Case of a view with ctid as attribute with incorrect attribute type. It is worth noting that all those code paths can trigger various elog() errors, which is not something that a user should be able to do using a SQL-callable function. There are also two code paths for cases where a view has no or more-than-one SELECT rules, which cannot normally be reached. All in that, I propose something like the attached to patch the surroundings with tests to cover everything I could think of, which I guess had better be backpatched? While on it, I have noticed that we lack coverage for max(tid) and min(tid), so I have included a bonus test. Another issue is that we don't have any documentation for those functions, in which case the best fit is a subsection for TID operators under "Functions and Operators"? For now, I am adding a patch to next CF so as we don't forget about this set of issues. Any thoughts? -- Michael
Attachment
On 2020-Apr-09, Michael Paquier wrote: > Playing more with this stuff, it happens that we have zero code > coverage for currtid() and currtid2(), and the main user of those > functions I can find around is the ODBC driver: > https://coverage.postgresql.org/src/backend/utils/adt/tid.c.gcov.html Yeah, they're there solely for ODBC as far as I know. > There are multiple cases to consider, particularly for views: > - Case of a view with ctid as attribute taken from table. > - Case of a view with ctid as attribute with incorrect attribute > type. > It is worth noting that all those code paths can trigger various > elog() errors, which is not something that a user should be able to do > using a SQL-callable function. There are also two code paths for > cases where a view has no or more-than-one SELECT rules, which cannot > normally be reached. > All in that, I propose something like the attached to patch the > surroundings with tests to cover everything I could think of, which I > guess had better be backpatched? I don't know, but this stuff is so unused that your patch seems excessive ... and I think we'd rather not backpatch something so large. I propose we do something less invasive in the backbranches, like just throw elog() errors (nothing fancy) where necessary to avoid the crashes. Even for pg12 it seems that that should be sufficient. For pg13 and beyond, I liked Tom's idea of installing dummy functions for tables without storage -- that seems safer. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, May 22, 2020 at 07:32:57PM -0400, Alvaro Herrera wrote: > I don't know, but this stuff is so unused that your patch seems > excessive ... and I think we'd rather not backpatch something so large. > I propose we do something less invasive in the backbranches, like just > throw elog() errors (nothing fancy) where necessary to avoid the > crashes. Even for pg12 it seems that that should be sufficient. Even knowing that those trigger a bunch of elog()s which are not something that should be user-triggerable? :) Perhaps you are right though, and that we don't need to spend this much energy into improving the error messages so I am fine to discard this part. At the end, in order to remove the crashes, you just need to keep around the two RELKIND_HAS_STORAGE() checks. But I would rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED) instead of elog(), and keep the test coverage of the previous patch (including the tests for the aggregates I noticed were missing). Would you be fine with that? > For pg13 and beyond, I liked Tom's idea of installing dummy functions > for tables without storage -- that seems safer. Not sure about that for v13. That would be invasive post-beta. -- Michael
Attachment
On Mon, May 25, 2020 at 06:29:10PM +0900, Michael Paquier wrote: > Perhaps you are right though, and that we don't need to spend this > much energy into improving the error messages so I am fine to discard > this part. At the end, in order to remove the crashes, you just need > to keep around the two RELKIND_HAS_STORAGE() checks. But I would > rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED) > instead of elog(), and keep the test coverage of the previous patch > (including the tests for the aggregates I noticed were missing). > Would you be fine with that? And this means the attached. Thoughts are welcome. -- Michael
Attachment
On Mon, 25 May 2020 at 22:01, Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, May 25, 2020 at 06:29:10PM +0900, Michael Paquier wrote: > > Perhaps you are right though, and that we don't need to spend this > > much energy into improving the error messages so I am fine to discard > > this part. At the end, in order to remove the crashes, you just need > > to keep around the two RELKIND_HAS_STORAGE() checks. But I would > > rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED) > > instead of elog(), and keep the test coverage of the previous patch > > (including the tests for the aggregates I noticed were missing). > > Would you be fine with that? > > And this means the attached. Thoughts are welcome. so, currently the patch just installs protections on both currtid_* functions and adds some tests... therefore we can consider it as a bug fix and let it go in 13? actually also backpatch in 12... patch works, server doesn't crash anymore only point to mention is a typo (a missing "l") in an added comment: + * currtid_byrename -- Jaime Casanova www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, May 27, 2020 at 12:29:39AM -0500, Jaime Casanova wrote: > so, currently the patch just installs protections on both currtid_* > functions and adds some tests... therefore we can consider it as a bug > fix and let it go in 13? actually also backpatch in 12... Yes, and it has the advantage to be simple. > only point to mention is a typo (a missing "l") in an added comment: > > + * currtid_byrename Oops, thanks. -- Michael
Attachment
On 2020-May-26, Michael Paquier wrote: > On Mon, May 25, 2020 at 06:29:10PM +0900, Michael Paquier wrote: > > Perhaps you are right though, and that we don't need to spend this > > much energy into improving the error messages so I am fine to discard > > this part. At the end, in order to remove the crashes, you just need > > to keep around the two RELKIND_HAS_STORAGE() checks. But I would > > rather keep these two to use ereport(ERRCODE_FEATURE_NOT_SUPPORTED) > > instead of elog(), and keep the test coverage of the previous patch > > (including the tests for the aggregates I noticed were missing). > > Would you be fine with that? > > And this means the attached. Thoughts are welcome. Yeah, this looks good to me. I would have used elog() instead, but I don't care enough ... as a translator, I can come up with a message as undecipherable as the original without worrying too much, since I suspect nobody will ever see it in practice. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2020-05-22 19:32:57 -0400, Alvaro Herrera wrote: > On 2020-Apr-09, Michael Paquier wrote: > > > Playing more with this stuff, it happens that we have zero code > > coverage for currtid() and currtid2(), and the main user of those > > functions I can find around is the ODBC driver: > > https://coverage.postgresql.org/src/backend/utils/adt/tid.c.gcov.html > > Yeah, they're there solely for ODBC as far as I know. And there only for very old servers (< 8.2), according to Hiroshi Inoue. Found that out post 12 freeze. I was planning to drop them for 13, but I unfortunately didn't get around to do so :( I guess we could decide to make a freeze exception to remove them now, although I'm not sure the reasons for doing so are strong enough. > > There are multiple cases to consider, particularly for views: > > - Case of a view with ctid as attribute taken from table. > > - Case of a view with ctid as attribute with incorrect attribute > > type. > > It is worth noting that all those code paths can trigger various > > elog() errors, which is not something that a user should be able to do > > using a SQL-callable function. There are also two code paths for > > cases where a view has no or more-than-one SELECT rules, which cannot > > normally be reached. > > > All in that, I propose something like the attached to patch the > > surroundings with tests to cover everything I could think of, which I > > guess had better be backpatched? > > I don't know, but this stuff is so unused that your patch seems > excessive ... and I think we'd rather not backpatch something so large. > I propose we do something less invasive in the backbranches, like just > throw elog() errors (nothing fancy) where necessary to avoid the > crashes. Even for pg12 it seems that that should be sufficient. > > For pg13 and beyond, I liked Tom's idea of installing dummy functions I concur that it seems unnecessary to make these translatable, even with the reduced scope from https://www.postgresql.org/message-id/20200526025959.GE6155%40paquier.xyz Greetings, Andres Freund
Hi, On 2020-04-05 12:51:56 -0400, Tom Lane wrote: > (2) The proximate cause of the crash is that rd_tableam is zero, > so that the interface functions in tableam.h just crash hard. > This seems like a pretty bad thing; does anyone want to promise > that there are no other oversights of the same ilk elsewhere, > and never will be? > > I think it might be a good idea to make relations-without-storage > set up rd_tableam as a vector of dummy functions that will throw > some suitable complaint about "relation lacks storage". NULL is > a horrible default for this. I don't have particularly strong views here. I can see a benefit to such a pseudo AM. I can even imagine that there might some cases where we would actually introduce some tableam functions for e.g. partitioned or views tables, to centralize their handling more, instead of having such considerations more distributed. Clearly not worth actively trying to do that for all existing code dealing with such relkinds, but there might be cases where it's worthwhile. OTOH, it's kinda annoying having to maintain a not insignificant number of functions that needs to be updated whenever the tableam interface evolves. I guess we could partially hack our way through that by having one such function, and just assigning it to all the mandatory callbacks by way of a void cast. But that'd be mighty ugly. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2020-04-05 12:51:56 -0400, Tom Lane wrote: >> I think it might be a good idea to make relations-without-storage >> set up rd_tableam as a vector of dummy functions that will throw >> some suitable complaint about "relation lacks storage". NULL is >> a horrible default for this. > OTOH, it's kinda annoying having to maintain a not insignificant number > of functions that needs to be updated whenever the tableam interface > evolves. That argument sounds pretty weak. If you're making breaking changes in the tableam API, updating the signatures (not even any code) of some dummy functions seems like by far the easiest part. regards, tom lane
On Thu, May 28, 2020 at 05:55:59PM -0700, Andres Freund wrote: > And there only for very old servers (< 8.2), according to Hiroshi > Inoue. Found that out post 12 freeze. I was planning to drop them for > 13, but I unfortunately didn't get around to do so :( [... digging ...] Ah, I think I see your point from the code. That's related to the use of RETURNING for ctids. > I guess we could decide to make a freeze exception to remove them now, > although I'm not sure the reasons for doing so are strong enough. Not sure that's a good thing to do after beta1 for 13, but there is an argument for that in 14. FWIW, my company is a huge user of the ODBC driver (perhaps the biggest one?), and we have nothing even close to 8.2. > I concur that it seems unnecessary to make these translatable, even with > the reduced scope from > https://www.postgresql.org/message-id/20200526025959.GE6155%40paquier.xyz Okay, I have switched the patch to do that. Any comments or objections? -- Michael
Attachment
On Fri, May 29, 2020 at 03:48:40PM +0900, Michael Paquier wrote: > Okay, I have switched the patch to do that. Any comments or > objections? Applied this one then. I also got to check the ODBC driver in more details, and I am indeed not seeing those functions getting used. One extra thing to know is that the ODBC driver requires libpq from at least 9.2, which may give one more argument to just remove them. NB: prion has been failing, just looking into it. -- Michael
Attachment
On Mon, Jun 01, 2020 at 10:57:29AM +0900, Michael Paquier wrote: > Applied this one then. I also got to check the ODBC driver in more > details, and I am indeed not seeing those functions getting used. > One extra thing to know is that the ODBC driver requires libpq from at > least 9.2, which may give one more argument to just remove them. > > NB: prion has been failing, just looking into it. Woah. This one is old, good catch from -DRELCACHE_FORCE_RELEASE. It happens that since its introduction in a3519a2 from 2002, currtid_for_view() in tid.c closes the view and then looks at a RTE from it. I have reproduced the issue and the patch attached takes care of the problem. Would it be better to backpatch all the way down or is that not worth caring about? -- Michael
Attachment
Michael Paquier <michael@paquier.xyz> writes: > Woah. This one is old, good catch from -DRELCACHE_FORCE_RELEASE. It > happens that since its introduction in a3519a2 from 2002, > currtid_for_view() in tid.c closes the view and then looks at a RTE > from it. I have reproduced the issue and the patch attached takes > care of the problem. Would it be better to backpatch all the way down > or is that not worth caring about? Ugh. Aside from the stale-pointer-deref problem, once we drop the lock we can't even be sure the table still exists. +1 for back-patch. regards, tom lane
On Sun, May 31, 2020 at 10:26:54PM -0400, Tom Lane wrote: > Ugh. Aside from the stale-pointer-deref problem, once we drop the lock > we can't even be sure the table still exists. +1 for back-patch. Thanks. Fixed down to 9.5 then to make prion happier. -- Michael