Thread: systable_getnext_ordered
hi, after systable_getnext_ordered returned NULL, is it ok to call it again? if yes, will it always returns NULL again? i'm wondering because inv_truncate seems to do it and expecting NULL. YAMAMOTO Takashi
yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: > after systable_getnext_ordered returned NULL, is it ok to call it again? I wouldn't rely on it working. > i'm wondering because inv_truncate seems to do it and expecting NULL. Hmm, that may well be a bug. Have you tested it? regards, tom lane
I wrote: > yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: >> after systable_getnext_ordered returned NULL, is it ok to call it again? > I wouldn't rely on it working. >> i'm wondering because inv_truncate seems to do it and expecting NULL. > Hmm, that may well be a bug. Have you tested it? I looked at this a bit more closely, and basically the answer is that inv_truncate is accidentally failing to fail. What will actually happen if systable_getnext_ordered is called another time, at least when using a btree index, is that it will start the same requested scan over again, ie deliver all the tuples it did the first time (which is none, in this case). That's an implementation artifact, but since the behavior is undefined in the first place, it's not wrong. Now, if inv_truncate's initial call on systable_getnext_ordered returns NULL (ie, the truncation point is past the current EOF page), it will fall through to the "Write a brand new page" code, which will create and insert a partial page at the truncation point. It then goes to the delete-all-remaining-pages loop. Because that starts a fresh scan with the very same scan key conditions, you might expect that it would find and delete the page it just inserted --- causing the apparent EOF of the blob to be wrong afterwards. It accidentally fails to do that because the new tuple postdates the snapshot it's scanning with. So the loop terminates having found no matching tuples, and all is well. So this code is confusing, inefficient (performing a useless search of the index), only works because of an obscure consideration not explained in the comments, and sets a bad precedent for people to follow. I'm going to go change it to explicitly not do the final loop if the initial search failed. It's not a bug, exactly, but it's sure lousy coding. Thanks for pointing it out. regards, tom lane
hi, > I wrote: >> yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: >>> after systable_getnext_ordered returned NULL, is it ok to call it again? > >> I wouldn't rely on it working. > >>> i'm wondering because inv_truncate seems to do it and expecting NULL. > >> Hmm, that may well be a bug. Have you tested it? > > I looked at this a bit more closely, and basically the answer is that > inv_truncate is accidentally failing to fail. What will actually happen > if systable_getnext_ordered is called another time, at least when using > a btree index, is that it will start the same requested scan over again, > ie deliver all the tuples it did the first time (which is none, in this > case). That's an implementation artifact, but since the behavior is > undefined in the first place, it's not wrong. > > Now, if inv_truncate's initial call on systable_getnext_ordered returns > NULL (ie, the truncation point is past the current EOF page), it will > fall through to the "Write a brand new page" code, which will create and > insert a partial page at the truncation point. It then goes to the > delete-all-remaining-pages loop. Because that starts a fresh scan with > the very same scan key conditions, you might expect that it would find > and delete the page it just inserted --- causing the apparent EOF of the > blob to be wrong afterwards. It accidentally fails to do that because > the new tuple postdates the snapshot it's scanning with. So the loop > terminates having found no matching tuples, and all is well. > > So this code is confusing, inefficient (performing a useless search of > the index), only works because of an obscure consideration not explained > in the comments, and sets a bad precedent for people to follow. I'm > going to go change it to explicitly not do the final loop if the initial > search failed. It's not a bug, exactly, but it's sure lousy coding. > Thanks for pointing it out. thanks for the quick investigation and fix. the attached patch is to avoid unnecessary detoast'ing and EOF marker pages when possible. does it make sense? YAMAMOTO Takashi > > regards, tom lane > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers
yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: > the attached patch is to avoid unnecessary detoast'ing and EOF marker pages > when possible. does it make sense? The blob page size is already chosen not to allow for out-of-line storage, not to mention that pg_largeobject doesn't have a TOAST table. So I think avoiding detoasting is largely a waste of time. I'm unexcited about the other consideration too --- it looks to me like it just makes truncation slower, more complicated, and hence more bug-prone, in return for a possible speedup that probably nobody will ever notice. regards, tom lane
hi, thanks for taking a look. > yamt@mwd.biglobe.ne.jp (YAMAMOTO Takashi) writes: >> the attached patch is to avoid unnecessary detoast'ing and EOF marker pages >> when possible. does it make sense? > > The blob page size is already chosen not to allow for out-of-line > storage, not to mention that pg_largeobject doesn't have a TOAST table. > So I think avoiding detoasting is largely a waste of time. doesn't detoasting involve decompression? > I'm > unexcited about the other consideration too --- it looks to me like it > just makes truncation slower, more complicated, and hence more > bug-prone, in return for a possible speedup that probably nobody will > ever notice. slower? it depends, i guess. my primary motivation of that part of the patch was to save some space for certain workloads. (besides that, leaving unnecessary rows isn't neat, but it might be a matter of taste.) YAMAMOTO Takashi > > regards, tom lane > > -- > Sent via pgsql-novice mailing list (pgsql-novice@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-novice