Thread: Assert in pageinspect with NULL pages
Hi, hackers! If we trying to call pageinspect functions for pages which are filled with nulls, we will get core dump. It happens with null pages for all indexes in pageinspect and for page_checksum. This problem was founded firstly by Anastasia Lubennikova, and now I continue this task. For example, next script leads to fail. CREATE TABLE test1 ( x bigserial, y bigint DEFAULT 0 ); INSERT INTO test1(y) SELECT 0 FROM generate_series(1,1E6) AS x; SELECT page_checksum(repeat(E'\\000', 8192)::bytea, 1); server closed the connection unexpectedly This probably means the server terminated abnormally before or while processing the request. fatal: connection to server was lost LOG: server process (PID 16465) was terminated by signal 6 DETAIL: Failed process was running: select page_checksum(repeat(E'\\000', 8192)::bytea, 1); LOG: terminating any other active server processes LOG: all server processes terminated; reinitializing LOG: database system was interrupted; last known up at 2022-02-16 14:03:16 +05 LOG: database system was not properly shut down; automatic recovery in progress LOG: redo starts at 0/14F1B20 LOG: invalid record length at 0/5D40CD8: wanted 24, got 0 LOG: redo done at 0/5D40BC0 system usage: CPU: user: 0.98 s, system: 0.02 s, elapsed: 1.01 s LOG: checkpoint starting: end-of-recovery immediate wait LOG: checkpoint complete: wrote 5500 buffers (33.6%); 0 WAL file(s) added, 0 removed, 4 recycled; write=0.064 s, sync=0.007 s, total=0.080 s; sync files=45, longest=0.004 s, average=0.001 s; distance=74044 kB, estimate=74044 kB LOG: database system is ready to accept connections Also it is possible to use select brin_metapage_info(repeat(E'\\000', 8192)::bytea); or select bt_page_items(repeat(E'\\000', 8192)::bytea); for getting fail. I tried to make some additional checks for null pages into pageinspect' functions. The attached patch also contains tests for this case. What do you think? -- Daria Lepikhova Postgres Professional
Attachment
On Thu, Feb 17, 2022 at 01:46:40PM +0500, Daria Lepikhova wrote: > INSERT INTO test1(y) SELECT 0 FROM generate_series(1,1E6) AS x; > SELECT page_checksum(repeat(E'\\000', 8192)::bytea, 1); Indeed. Good catch, and that seems pretty old at quick glance for the checksum part. I'll try to look at all that tomorrow. -- Michael
Attachment
BRIN can also crash if passed a non-brin index. I've been sitting on this one for awhile. Feel free to include it in your patchset. commit 08010a6037fc4e24a9ba05e5386e766f4310d35e Author: Justin Pryzby <pryzbyj@telsasoft.com> Date: Tue Jan 19 00:25:15 2021 -0600 pageinspect: brin_page_items(): check that given relation is not only an index, but a brin one diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c index f1e64a39ef2..3de6dd943ef 100644 --- a/contrib/pageinspect/brinfuncs.c +++ b/contrib/pageinspect/brinfuncs.c @@ -16,6 +16,7 @@ #include "access/brin_tuple.h" #include "access/htup_details.h" #include "catalog/index.h" +#include "catalog/pg_am.h" #include "catalog/pg_type.h" #include "funcapi.h" #include "lib/stringinfo.h" @@ -59,7 +60,7 @@ brin_page_type(PG_FUNCTION_ARGS) if (raw_page_size != BLCKSZ) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("input page too small"), + errmsg("input page wrong size"), errdetail("Expected size %d, got %d", BLCKSZ, raw_page_size))); @@ -97,7 +98,7 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype) if (raw_page_size != BLCKSZ) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), - errmsg("input page too small"), + errmsg("input page wrong size"), errdetail("Expected size %d, got %d", BLCKSZ, raw_page_size))); @@ -169,7 +170,14 @@ brin_page_items(PG_FUNCTION_ARGS) MemoryContextSwitchTo(oldcontext); indexRel = index_open(indexRelid, AccessShareLock); - bdesc = brin_build_desc(indexRel); + + /* Must be a BRIN index */ + if (indexRel->rd_rel->relkind != RELKIND_INDEX || + indexRel->rd_rel->relam != BRIN_AM_OID) + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("\"%s\" is not a BRIN index", + RelationGetRelationName(indexRel)))); /* minimally verify the page we got */ page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular"); @@ -178,6 +186,7 @@ brin_page_items(PG_FUNCTION_ARGS) * Initialize output functions for all indexed datatypes; simplifies * calling them later. */ + bdesc = brin_build_desc(indexRel); columns = palloc(sizeof(brin_column_state *) * RelationGetDescr(indexRel)->natts); for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++) {
Hello Michael, 18.02.2022 06:07, Michael Paquier wrpte: > On Thu, Feb 17, 2022 at 09:00:20PM -0600, Justin Pryzby wrote: >> BRIN can also crash if passed a non-brin index. >> >> I've been sitting on this one for awhile. Feel free to include it in your >> patchset. > Ugh. Thanks! I am keeping a note about this one. Could you please confirm before committing the patchset that it fixes the bug #16527 [1]? Or maybe I could check it? (Original patch proposed by Daria doesn't cover that case, but if the patch going to be improved, probably it's worth fixing that bug too.) [1] https://www.postgresql.org/message-id/flat/16527-ef7606186f0610a1%40postgresql.org Best regards, Alexander
First of all, thanks for the review and remarks! 18.02.2022 08:02, Michael Paquier пишет: > On Thu, Feb 17, 2022 at 05:40:41PM +0800, Julien Rouhaud wrote: >> About the patch, it's incorrectly using a hardcoded 8192 block-size rather than >> using the computed :block_size (or computing one when it's not already the >> case). > Well, the tests of pageinspect fail would already fail when using a > different page size than 8k, like the checksum ones :) > > Anywa, I agree with your point that if this is not a reason to not > make the tests more portable if we can do easily. Fixed. >> I'm also wondering if it wouldn't be better to return NULL rather than throwing >> an error for uninitialized pages. > Agreed that this is a sensible choice. NULL would be helpful for the > case where one compiles all the checksums of a relation with a full > scan based on the relation size, for example. All these behaviors > ought to be documented properly, as well. For SRFs, this should just > return no rows rather than generating an error. So, I tried to implement this remark. However, further getting rid of throwing an error and replacing it with a NULL return led to reproduce the problem that Alexander Lakhin mentioned And here I need little more time to figure it out. And one more addition. In the previous version of the patch, I forgot to add tests for the gist index, but the described problem is also relevant for it. -- Daria Lepikhova Postgres Professional
Attachment
18.02.2022 08:00, Justin Pryzby пишет: > BRIN can also crash if passed a non-brin index. > > I've been sitting on this one for awhile. Feel free to include it in your > patchset. > > commit 08010a6037fc4e24a9ba05e5386e766f4310d35e > Author: Justin Pryzby <pryzbyj@telsasoft.com> > Date: Tue Jan 19 00:25:15 2021 -0600 > > pageinspect: brin_page_items(): check that given relation is not only an index, but a brin one > > diff --git a/contrib/pageinspect/brinfuncs.c b/contrib/pageinspect/brinfuncs.c > index f1e64a39ef2..3de6dd943ef 100644 > --- a/contrib/pageinspect/brinfuncs.c > +++ b/contrib/pageinspect/brinfuncs.c > @@ -16,6 +16,7 @@ > #include "access/brin_tuple.h" > #include "access/htup_details.h" > #include "catalog/index.h" > +#include "catalog/pg_am.h" > #include "catalog/pg_type.h" > #include "funcapi.h" > #include "lib/stringinfo.h" > @@ -59,7 +60,7 @@ brin_page_type(PG_FUNCTION_ARGS) > if (raw_page_size != BLCKSZ) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > - errmsg("input page too small"), > + errmsg("input page wrong size"), > errdetail("Expected size %d, got %d", > BLCKSZ, raw_page_size))); > > @@ -97,7 +98,7 @@ verify_brin_page(bytea *raw_page, uint16 type, const char *strtype) > if (raw_page_size != BLCKSZ) > ereport(ERROR, > (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > - errmsg("input page too small"), > + errmsg("input page wrong size"), > errdetail("Expected size %d, got %d", > BLCKSZ, raw_page_size))); > > @@ -169,7 +170,14 @@ brin_page_items(PG_FUNCTION_ARGS) > MemoryContextSwitchTo(oldcontext); > > indexRel = index_open(indexRelid, AccessShareLock); > - bdesc = brin_build_desc(indexRel); > + > + /* Must be a BRIN index */ > + if (indexRel->rd_rel->relkind != RELKIND_INDEX || > + indexRel->rd_rel->relam != BRIN_AM_OID) > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > + errmsg("\"%s\" is not a BRIN index", > + RelationGetRelationName(indexRel)))); > > /* minimally verify the page we got */ > page = verify_brin_page(raw_page, BRIN_PAGETYPE_REGULAR, "regular"); > @@ -178,6 +186,7 @@ brin_page_items(PG_FUNCTION_ARGS) > * Initialize output functions for all indexed datatypes; simplifies > * calling them later. > */ > + bdesc = brin_build_desc(indexRel); > columns = palloc(sizeof(brin_column_state *) * RelationGetDescr(indexRel)->natts); > for (attno = 1; attno <= bdesc->bd_tupdesc->natts; attno++) > { Thanks! This is a very valuable note. I think I will definitely add it to the next version of the patch, as soon as I deal with the error exception. -- Daria Lepikhova Postgres Professional
On Thu, Feb 17, 2022 at 09:00:20PM -0600, Justin Pryzby wrote: > BRIN can also crash if passed a non-brin index. > > I've been sitting on this one for awhile. Feel free to include it in your > patchset. So, I have begun a close lookup of this thread, and the problem you are reporting here is worth fixing on its own, before we do something for new pages as reported originally. There is more that caught my attention than the brin AM not being check properly, because a couple of code paths are fuzzy with the checks on the page sizes. My impression of the whole thing is that we'd better generalize the use of get_page_from_raw(), improving the code on many sides when the user can provide a raw page in input (also I'd like to think that it is a better practice anyway as any of those functions may finish to look 8-byte fields, but the current coding would silently break in alignment-picky environments): - Some code paths (hash, btree) would trigger elog(ERROR) but we cannot use that for errors that can be triggered by the user. - bt_page_items_bytea(), page_header() and fsm_page_contents() were fuzzy on the page size check, so it was rather easy to generate garbage with random data. - page_checksum_internal() used a PageHeader, where I would have expected a Page. - More consistency in the error strings, meaning less contents to translate in the long-term. This first batch leads me to the attached, with tests to stress all that for all the functions taking raw pages in input. -- Michael
Attachment
On Tue, Mar 15, 2022 at 06:32:44PM +0900, Michael Paquier wrote: > + if (!IS_BRIN(indexRel)) > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > + errmsg("\"%s\" is not a %s index", > + RelationGetRelationName(indexRel), "BRIN"))); If it were me, I'd write this without the extra parens around (errcode()). > +-- Suppress the DETAIL message, to allow the tests to work across various > +-- default page sizes. I think you mean "various non-default page sizes" or "various page sizes". -- Justin
On Tue, Mar 15, 2022 at 06:56:46AM -0500, Justin Pryzby wrote: > On Tue, Mar 15, 2022 at 06:32:44PM +0900, Michael Paquier wrote: >> +-- Suppress the DETAIL message, to allow the tests to work across various >> +-- default page sizes. > > I think you mean "various non-default page sizes" or "various page sizes". Thanks. Indeed, this sounded a bit weird. I have been able to look at all that this morning and backpatched this first part. -- Michael
Attachment
On Mon, Feb 21, 2022 at 10:00:00AM +0300, Alexander Lakhin wrote: > Could you please confirm before committing the patchset that it fixes > the bug #16527 [1]? Or maybe I could check it? > (Original patch proposed by Daria doesn't cover that case, but if the > patch going to be improved, probably it's worth fixing that bug too.) > > [1] > https://www.postgresql.org/message-id/flat/16527-ef7606186f0610a1%40postgresql.org FWIW, I think that this one has been fixed by 076f4d9, where we make sure that the page is correctly aligned. Are you still seeing this problem? -- Michael
Attachment
On Wed, Feb 23, 2022 at 12:09:02PM +0500, Daria Lepikhova wrote: > And one more addition. In the previous version of the patch, I forgot to add > tests for the gist index, but the described problem is also relevant for it. So, I have looked at this second part of the thread, and concluded that we should not fail for empty pages. First, we fetch pages from the buffer pool in normal mode, where empty pages are valid. There is also a second point in favor of doing so: code paths dedicated to hash indexes already do that, marking such pages as simply "unused". The proposal from Julien upthread sounds cleaner to me though in the long run, as NULL gives the user the possibility to do a full-table scan with simple clauses to filter out anything found as NULL. Painting more PageIsNew() across the place requires a bit more work than a simple ereport(ERROR) in get_page_from_raw(), of course, but the gain is the portability of the functions. (One can have a lot of fun playing with random inputs and breaking most code paths, but that's not new.) -- Michael
Attachment
Hello Michael, 16.03.2022 10:39, Michael Paquier wrote: > On Mon, Feb 21, 2022 at 10:00:00AM +0300, Alexander Lakhin wrote: >> Could you please confirm before committing the patchset that it fixes >> the bug #16527 [1]? Or maybe I could check it? >> (Original patch proposed by Daria doesn't cover that case, but if the >> patch going to be improved, probably it's worth fixing that bug too.) >> >> [1] >> https://www.postgresql.org/message-id/flat/16527-ef7606186f0610a1%40postgresql.org > FWIW, I think that this one has been fixed by 076f4d9, where we make > sure that the page is correctly aligned. Are you still seeing this > problem? > Yes, I've reproduced that issue on master (46d9bfb0). pageinspect-zeros-v2.patch doesn't help too. Best regards. Alexander
On Wed, Mar 16, 2022 at 08:35:59PM +0300, Alexander Lakhin wrote: > Yes, I've reproduced that issue on master (46d9bfb0). Okay. I'll try to look more closely at that, then. It feels like we are just missing a MAXALIGN() in those code paths. Are you using any specific CFLAGS or something environment-sensitive (macos, 32b, etc.)? > pageinspect-zeros-v2.patch doesn't help too. Well, the scope is different, so that's not a surprise. -- Michael
Attachment
Hello Michael, 18.03.2022 05:06, Michael Paquier wrote: > Okay. I'll try to look more closely at that, then. It feels like we > are just missing a MAXALIGN() in those code paths. Are you using any > specific CFLAGS or something environment-sensitive (macos, 32b, etc.)? No, just x86_64, Ubuntu 20.04, gcc 11, valgrind 3.15. I just put that query in page.sql and see the server abort. Best regards, Alexander
On Fri, Mar 18, 2022 at 06:43:39AM +0300, Alexander Lakhin wrote: > Hello Michael, > No, just x86_64, Ubuntu 20.04, gcc 11, valgrind 3.15. I just put that query > in page.sql and see the server abort. Bah, of course, I have missed the valgrind part of the problem. The problem is that we attempt to verify a heap page here but its pd_special is BLCKSZ. This causes BrinPageType() to grab back a position of the area at the exact end of the page, via PageGetSpecialPointer(), hence the failure in reading two bytes outside the page. The result here is that the set of defenses in verify_brin_page() is poor: we should at least make sure that the special area is available for a read. As far as I can see, this is also possible in bt_page_items_bytea(), gist_page_opaque_info(), gin_metapage_info() and gin_page_opaque_info(). All those code paths should be protected with some checks on PageGetSpecialSize(), I guess, before attempting to read the special area of the page. Hash indexes are protected by checking the expected size of the special area, and one code path of btree relies on the opened relation to be a btree index. -- Michael
Attachment
Hi, On Wed, Mar 23, 2022 at 05:16:41PM +0900, Michael Paquier wrote: > On Fri, Mar 18, 2022 at 06:43:39AM +0300, Alexander Lakhin wrote: > > Hello Michael, > > No, just x86_64, Ubuntu 20.04, gcc 11, valgrind 3.15. I just put that query > > in page.sql and see the server abort. > > Bah, of course, I have missed the valgrind part of the problem. > > The problem is that we attempt to verify a heap page here but its > pd_special is BLCKSZ. This causes BrinPageType() to grab back a > position of the area at the exact end of the page, via > PageGetSpecialPointer(), hence the failure in reading two bytes > outside the page. The result here is that the set of defenses in > verify_brin_page() is poor: we should at least make sure that the > special area is available for a read. As far as I can see, this is > also possible in bt_page_items_bytea(), gist_page_opaque_info(), > gin_metapage_info() and gin_page_opaque_info(). All those code paths > should be protected with some checks on PageGetSpecialSize(), I > guess, before attempting to read the special area of the page. Hash > indexes are protected by checking the expected size of the special > area, and one code path of btree relies on the opened relation to be a > btree index. I worked on a patch to fix the problem. The functions you mention were indeed missing some check, but after digging a bit more I found that other problem existed. For instance, feeding a btree page to a gist_page_items_bytea() (both pages have the same special size) can be quite problematic too, as you can end up accessing bogus data when retrieving the items. I also added additional sanity checks with what is available (gist_page_id for gist, zero-level for btree leaf pages and so on), and tried to cover everything with tests. I'm a bit worried about the btree tests stability. I avoid emitting the level found to help with that, but it still depends on what other AM will put in their special page.
Attachment
On Thu, Mar 24, 2022 at 04:27:40PM +0800, Julien Rouhaud wrote: > I worked on a patch to fix the problem. The functions you mention were indeed > missing some check, but after digging a bit more I found that other problem > existed. For instance, feeding a btree page to a gist_page_items_bytea() (both > pages have the same special size) can be quite problematic too, as you can end > up accessing bogus data when retrieving the items. I also added additional > sanity checks with what is available (gist_page_id for gist, zero-level for > btree leaf pages and so on), and tried to cover everything with tests. Thanks for the patch! I have reviewed what you have sent, bumping on a couple of issues: - The tests of btree and BRIN failed with 32-bit builds, because MAXALIGN returns shorter special area sizes in those cases. This can be fixed by abusing of \set VERBOSITY to mask the error details. We already do that in some of the tests to make them portable. - Some of the tests were not necessary, overlapping with stuff that already existed. - Some checks missed a MAXALIGN(). - Did some tweaks with the error messages. - Some error messages used an incorrect term to define the index AM type, aka s/gist/GiST/ or s/brin/BRIN/. - errdetail() requires a sentence, finishing with a dot (some places of hashfuncs.c missed that. - Not your fault, but hash indexes would complain about corrupted indexes which could be misleading for the user if passing down a correct page from an incorrect index AM. - While on it, I have made the error messages more generic in the places where I could do so. What I have finished with seems to have a good balance. > I'm a bit worried about the btree tests stability. I avoid emitting the level > found to help with that, but it still depends on what other AM will put in > their special page. Well, the limit of the pageinspect model comes from the fact that it is possible to pass down any bytea and all those code paths would happily process the blobs as long as they are 8kB. Pages can be crafted as well to bypass some of the checks. This is superuser-only, so people have to be careful, but preventing out-of-bound reads is a different class of problem, as long as these come from valid pages. With all that in place, I get the attached. It is Friday, so I am not going to take my bets on the buildfarm today with a rather-risky patch. Monday/Tuesday will be fine. -- Michael
Attachment
On Fri, Mar 25, 2022 at 11:44:26AM +0900, Michael Paquier wrote: > I have reviewed what you have sent, bumping on a couple of issues: Thanks! I'm happy with all the changes, except: + if (P_ISLEAF(opaque) && opaque->btpo_level != 0) + ereport(ERROR, + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), + errmsg("block is not a valid leaf page"))); All other messages specify which kind of page it's about, so I think it would be better to specify "btree" leaf page here, especially since some other AMs also have leaf pages. > - The tests of btree and BRIN failed with 32-bit builds, because > MAXALIGN returns shorter special area sizes in those cases. This can > be fixed by abusing of \set VERBOSITY to mask the error details. We > already do that in some of the tests to make them portable. Yeah, that's the other stability problem I was worried about. I should have tried to compile with -m32. > > I'm a bit worried about the btree tests stability. I avoid emitting the level > > found to help with that, but it still depends on what other AM will put in > > their special page. > > Well, the limit of the pageinspect model comes from the fact that it > is possible to pass down any bytea and all those code paths would > happily process the blobs as long as they are 8kB. Pages can be > crafted as well to bypass some of the checks. This is superuser-only, > so people have to be careful, but preventing out-of-bound reads is a > different class of problem, as long as these come from valid pages. Agreed. Also pageinspect can be handy when debugging corruption, so I think it shouldn't try too hard to discard buggy pages.
On Fri, Mar 25, 2022 at 11:03:47AM +0800, Julien Rouhaud wrote: > I'm happy with all the changes, except: > > + if (P_ISLEAF(opaque) && opaque->btpo_level != 0) > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("block is not a valid leaf page"))); > > All other messages specify which kind of page it's about, so I think it would > be better to specify "btree" leaf page here, especially since some other AMs > also have leaf pages. Makes sense. Fine by me to stick an extra "btree" here. -- Michael
Attachment
On Wed, Mar 23, 2022 at 1:16 AM Michael Paquier <michael@paquier.xyz> wrote: > As far as I can see, this is > also possible in bt_page_items_bytea(), gist_page_opaque_info(), > gin_metapage_info() and gin_page_opaque_info(). All those code paths > should be protected with some checks on PageGetSpecialSize(), I > guess, before attempting to read the special area of the page. Hash > indexes are protected by checking the expected size of the special > area, and one code path of btree relies on the opened relation to be a > btree index. amcheck's palloc_btree_page() function validates that an 8KiB page is in fact an nbtree page, in a maximally paranoid way. Might be an example worth following here. -- Peter Geoghegan
On Thu, Mar 24, 2022 at 08:54:14PM -0700, Peter Geoghegan wrote: > amcheck's palloc_btree_page() function validates that an 8KiB page is > in fact an nbtree page, in a maximally paranoid way. Might be an > example worth following here. Sure, we could do that. However, I don't think that going down to that is something we have any need for in pageinspect, as the module is also useful to look at the contents of the full page, even if slightly corrupted, and too many checks would prevent a lookup at the internal contents of a page. -- Michael
Attachment
On Fri, Mar 25, 2022 at 12:27:24PM +0900, Michael Paquier wrote: > Makes sense. Fine by me to stick an extra "btree" here. I have been able to look at that again today (earlier than expected), and except for one incorrect thing in the GIN tests where we were not testing the correct pattern, the rest was correct. So applied and backpatched. The buildfarm is not complaining based on the first reports. -- Michael
Attachment
On Fri, Mar 25, 2022 at 12:57 AM Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Mar 24, 2022 at 08:54:14PM -0700, Peter Geoghegan wrote: > > amcheck's palloc_btree_page() function validates that an 8KiB page is > > in fact an nbtree page, in a maximally paranoid way. Might be an > > example worth following here. > > Sure, we could do that. However, I don't think that going down to > that is something we have any need for in pageinspect, as the module > is also useful to look at the contents of the full page, even if > slightly corrupted, and too many checks would prevent a lookup at the > internal contents of a page. It's my impression that there are many ways of crashing the system using pageinspect right now. We aren't very careful about making sure that our code that reads from disk doesn't crash if the data on disk is corrupted, and all of that systemic weakness is inherited by pageinspect. But, with pageinspect, you can supply your own pages, not just use the ones that actually exist on disk. Fixing this seems like a lot of work and problematic for various reasons. And I do agree that if we can allow people to look at what's going on with a corrupt page, that's useful. On the other hand, if by "looking" they crash the system, that sucks a lot. So overall I think we need a lot more sanity checks here. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, 27 Mar 2022 at 16:24, Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Mar 25, 2022 at 12:57 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Mar 24, 2022 at 08:54:14PM -0700, Peter Geoghegan wrote: > > > amcheck's palloc_btree_page() function validates that an 8KiB page is > > > in fact an nbtree page, in a maximally paranoid way. Might be an > > > example worth following here. > > > > Sure, we could do that. However, I don't think that going down to > > that is something we have any need for in pageinspect, as the module > > is also useful to look at the contents of the full page, even if > > slightly corrupted, and too many checks would prevent a lookup at the > > internal contents of a page. > > It's my impression that there are many ways of crashing the system > using pageinspect right now. We aren't very careful about making sure > that our code that reads from disk doesn't crash if the data on disk > is corrupted, and all of that systemic weakness is inherited by > pageinspect. But, with pageinspect, you can supply your own pages, not > just use the ones that actually exist on disk. > > Fixing this seems like a lot of work and problematic for various > reasons. And I do agree that if we can allow people to look at what's > going on with a corrupt page, that's useful. On the other hand, if by > "looking" they crash the system, that sucks a lot. So overall I think > we need a lot more sanity checks here. I noticed this thread due to recent commit 291e517a, and noticed that it has some overlap with one of my patches [0], in which I fix the corresponding issue in core postgres by assuming (and in assert-enabled builds, verifying) the size and location of the special area. As such, you might be interested in that patch. Note that currently in core postgres a corrupted special area pointer will likely target neighbouring blocks in the buffer pool; resulting in spreading corruption when the special area is updated. This spreading corruption should be limited to only the corrupted block with my patch. Kind regards, Matthias van de Meent [0] https://commitfest.postgresql.org/37/3543/ https://www.postgresql.org/message-id/CAEze2WjE3+tGO9Fs9+iZMU+z6mMZKo54W1Zt98WKqbEUHbHOBg@mail.gmail.com
On Sun, Mar 27, 2022 at 7:24 AM Robert Haas <robertmhaas@gmail.com> wrote: > It's my impression that there are many ways of crashing the system > using pageinspect right now. We aren't very careful about making sure > that our code that reads from disk doesn't crash if the data on disk > is corrupted, and all of that systemic weakness is inherited by > pageinspect. It's significantly worse than the core code, I'd say (before we even consider the fact that you can supply your own pages). At least with index AMs, where functions like _bt_checkpage() and gistcheckpage() provide the most basic sanity checks for any page that is read from disk. > Fixing this seems like a lot of work and problematic for various > reasons. And I do agree that if we can allow people to look at what's > going on with a corrupt page, that's useful. On the other hand, if by > "looking" they crash the system, that sucks a lot. So overall I think > we need a lot more sanity checks here. I don't think it's particularly hard to do a little more. That's all it would take to prevent practically all crashes. We're not dealing with adversarial page images here. Sure, it would be overkill to fully adapt something like palloc_btree_page() for pageinspect, for the reason Michael gave. But there are a couple of checks that it does that would avoid practically all real world hard crashes, without any plausible downside: * The basic _bt_checkpage() call that every nbtree page uses in the core code (as well as in amcheck). * The maxoffset > MaxIndexTuplesPerPage check. You could even go a bit further, and care about individual index tuples. My commit 93ee38eade added some basic sanity checks for index tuples to bt_page_items(). That could go a bit further as well. In particular, the sanity checks from amcheck's PageGetItemIdCareful() function could be carried over. That would make it impossible for bt_page_items() to read past the end of the page when given a page that has an index tuple whose item pointer offset has some wildly unreasonable value. I'm not volunteering. Just saying that this is quite possible. -- Peter Geoghegan
On Sun, Mar 27, 2022 at 4:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > We're not dealing > with adversarial page images here. I think it's bad that we have to make that assumption, considering that there's nothing whatever to keep users from supplying arbitrary page images to pageinspect. But I also agree that if we're unable or unwilling to make things perfect, it's still good to make them better. -- Robert Haas EDB: http://www.enterprisedb.com
On Sun, Mar 27, 2022 at 2:02 PM Robert Haas <robertmhaas@gmail.com> wrote: > On Sun, Mar 27, 2022 at 4:26 PM Peter Geoghegan <pg@bowt.ie> wrote: > > We're not dealing > > with adversarial page images here. > > I think it's bad that we have to make that assumption, considering > that there's nothing whatever to keep users from supplying arbitrary > page images to pageinspect. Maybe it isn't strictly necessary for bt_page_items(), but making that level of guarantee is really hard, and not particularly useful. And that's the easy case for pageinspect: gist_page_items() takes a raw bytea, and puts it through the underlying types output functions. I think that it might actually be fundamentally impossible to guarantee that that'll be safe, because we have no idea what the output function might be doing. It's arbitrary user-defined code that could easily be implemented in C. Combined with an arbitrary page image. > But I also agree that if we're unable or > unwilling to make things perfect, it's still good to make them better. I think that most of the functions can approach being perfectly robust, with a little work. In practical terms they can almost certainly be made so robust that no real user of bt_page_items() will ever crash the server. Somebody that goes out of their way to do that *might* find a way (even with the easier cases), but that doesn't particularly concern me. -- Peter Geoghegan
On Sun, Mar 27, 2022 at 02:36:54PM -0700, Peter Geoghegan wrote: > I think that it might actually be fundamentally impossible to > guarantee that that'll be safe, because we have no idea what the > output function might be doing. It's arbitrary user-defined code that > could easily be implemented in C. Combined with an arbitrary page > image. My guess that you'd bring in a lot more safety if we completely cut the capacity to fetch and pass down raw pages across the SQL calls because you can add checks for the wanted AM, replacing all that with a (regclass,blkno) pair. I am however ready to buy that this may not be worth rewriting 10~15 years after the fact. > I think that most of the functions can approach being perfectly > robust, with a little work. In practical terms they can almost > certainly be made so robust that no real user of bt_page_items() will > ever crash the server. Somebody that goes out of their way to do that > *might* find a way (even with the easier cases), but that doesn't > particularly concern me. That does not concern me either, and my own take is to take as realistic things that can be fetched from a get_raw_page() call rather than a bytea specifically crafted. Now, I have found myself in cases where it was useful to see the contents of a page, as long as you can go through the initial checks, particularly in cases where corruptions involved unaligned contents. Trigerring an error on a sanity check for a specific block may be fine, now I have also found myself doing some full scans to see in one shot the extent of a damaged relation file using the functions of pageinspect. Fun times. -- Michael
Attachment
I've suddenly found that the test in this patch is based on a fact that heap pages don't have PageSpecial or it is of different size with btree pages Special area:
CREATE TABLE test1 (a int8, b int4range);
SELECT bt_page_items(get_raw_page('test1', 0));
In the current state is is so, but it is not guaranteed. For example, in the proposed 64xid patch [1] we introduce PageSpecial into heap pages and its size on occasion equals to the size of btree page special so check from above is false. Even if we pass heap page into pageinspect pretending it is btree page. So the test will fail.
Generally it seems not only a wrong test but the consequence of a bigger scale fact that we can not always distinguish different AM's pages. So I'd propose at least that we don't come to a conclusion that a page is valid based on PageSpecial size is right.
Hi, On Mon, Mar 28, 2022 at 08:29:29PM +0300, Maxim Orlov wrote: > I've suddenly found that the test in this patch is based on a fact that > heap pages don't have PageSpecial or it is of different size with btree > pages Special area: > > CREATE TABLE test1 (a int8, b int4range); > > SELECT bt_page_items(get_raw_page('test1', 0)); > > > In the current state is is so, but it is not guaranteed. For example, in > the proposed 64xid patch [1] we introduce PageSpecial into heap pages and > its size on occasion equals to the size of btree page special so check from > above is false. Even if we pass heap page into pageinspect pretending it is > btree page. So the test will fail. > > > Generally it seems not only a wrong test but the consequence of a bigger > scale fact that we can not always distinguish different AM's pages. So I'd > propose at least that we don't come to a conclusion that a page is valid > based on PageSpecial size is right. We don't assume that the page is valid, or of the correct AM, just based on the special area size. We do reject it if the size is invalid, but we also have other tests based on what's actually possible to test (checking flag sanity, magic values and so on). Note that Peter G. suggested to add more checks for btree based on palloc_btree_page(), did you check if those were enough to fix this test with the 64bits xid patchset applied?
On Wed, Mar 16, 2022 at 05:43:48PM +0900, Michael Paquier wrote: > So, I have looked at this second part of the thread, and concluded > that we should not fail for empty pages. First, we fetch pages from > the buffer pool in normal mode, where empty pages are valid. There is > also a second point in favor of doing so: code paths dedicated to hash > indexes already do that, marking such pages as simply "unused". The > proposal from Julien upthread sounds cleaner to me though in the long > run, as NULL gives the user the possibility to do a full-table scan > with simple clauses to filter out anything found as NULL. It has been a couple of weeks, but I have been able to come back to this set of issues for all-zero pages, double-checked the whole and applied a set of fixes down to 10. So we should be completely done here. -- Michael