Thread: Online checksums verification in the backend
Hi, This topic was discussed several times, with the most recent discussions found at [1] and [2]. Based on those discussions, my understanding is that the current approach in BASE_BACKUP has too many drawbacks and we should instead do this check in the backend. I've been working using such approach at VMware, and I'm submitting it here to discuss the approach and rationales, and hopefully have such a feature integrated. First, this was originally developed as an extension. It means that the check is performed using an SRF. That's maybe not the best approach, as a transaction has be kept for the total processing time. It can be leveraged by checking each relation independently, but that's still not ideal. Maybe using some utility commands (as part of VACUUM or a new CHECK command for instance) would be a better approach. This brings the second consideration: how to report the list corrupted blocks to end users. As I said this is for now returned via the SRF, but this is clearly not ideal and should rather be made available more globally. One usage of this information could be block level recovery. I'm Cc-ing Sawada-san, as I know he's working on this and mentioned me that he had ideas on passing the list of corrupted blocks using the stat collector. Finally, the read and locking considerations. I tried to cover that extensively in the comments, but here are some details on how I tried to make the check safe while trying to keep the overhead as low as possible. First thing is that this is only doing buffered reads, without any attempt to discard OS cache. Therefore, any discrepancy between the OS cache and the disk cannot be detected unless you do other actions, such as sync / drop_caches on GNU/Linux. An access share lock on the currently checked relation is held, meaning that it can't get deleted/truncated. The total number of blocks for the given fork is retrieved first, so any new block will be ignored. Such new blocks are considered out of scope as being written after the start of the check. Each time a buffer is being checked, the target buffer mapping partition lock is acquired in shared mode, to prevent concurrent eviction. If the buffer is found in shared buffers, it's pinned and released immediately, just to get the state. If the buffer is found dirty, no check is performed as it'll be written to disk by the checkpointer, or during recovery in case of unclean shutdown. Otherwise, an IO lock is held while the the buffer is being read in a private buffer. IO Lock and buffer mapping lock are released and then the check is performed. If the buffer is not found in shared buffers, the buffer mapping partition lock is released immediately and the block is read from disk. It's therefore possible to get a false positive here, as the block could be concurrently read, modified and partially written to disk. So, if an error is detected in this case, the check is restarted from scratch and if the buffer is still not found in shared buffers, the read will be done while still holding the buffer mapping partition lock to make sure that it can't get concurrently loaded and modified. This is an optimistic approach to avoid performance overhead, assuming that there shouldn't be a lot of positive, and false positive possibility is very narrow. The check consists of simply comparing the stored and computed checksum, with an additional check that the page is really new (using PageIsVerified) if it's found as PageIsNew(). Since this is done after releasing all locks, we could definitely add more checks without causing much overhead, like pd_lower/pd_upper sanity. I prefer to keep the check simple for now and rather focus on the general approach. Finally, I also reused vacuum costing GUC (for simplicity) and approach to add some throttling. I'm attaching a patch that adds a new pg_check_relation() sql function to perform a check of one or all relations, and some simple regression tests. [1] https://www.postgresql.org/message-id/flat/1532606373.3422.5.camel%40credativ.de [2] https://www.postgresql.org/message-id/flat/20190326170820.6sylklg7eh6uhabd%40alap3.anarazel.de
Attachment
On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > This topic was discussed several times, with the most recent > discussions found at [1] and [2]. Based on those discussions, my > understanding is that the current approach in BASE_BACKUP has too many > drawbacks and we should instead do this check in the backend. Good idea. > This brings the second consideration: how to report the list corrupted > blocks to end users. As I said this is for now returned via the SRF, > but this is clearly not ideal and should rather be made available more > globally. Some people might prefer notices, because you can get those while the thing is still running, rather than a result set, which you will only see when the query finishes. Other people might prefer an SRF, because they want to have the data in structured form so that they can postprocess it. Not sure what you mean by "more globally." I guess one idea would be to provide a way to kick this off in the background via a background worker or similar and then have it put the results in a table. But that might fail if there are checksum errors in the catalogs themselves. I don't really know what's best. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Fri, Dec 6, 2019 at 9:51 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > This brings the second consideration: how to report the list corrupted > > blocks to end users. As I said this is for now returned via the SRF, > > but this is clearly not ideal and should rather be made available more > > globally. > > Some people might prefer notices, because you can get those while the > thing is still running, rather than a result set, which you will only > see when the query finishes. Other people might prefer an SRF, because > they want to have the data in structured form so that they can > postprocess it. Not sure what you mean by "more globally." I meant having the results available system-wide, not only to the caller. I think that emitting a log/notice level should always be done on top on whatever other communication facility we're using. > I guess one > idea would be to provide a way to kick this off in the background via > a background worker or similar and then have it put the results in a > table. But that might fail if there are checksum errors in the > catalogs themselves. Yes that's a concern. We could maintain a list in (dynamic) shared memory with a simple SQL wrapper to read the data, but that would be lost with a crash/restart. Or use pgstat_report_checksum_failures_in_db(), modifying it to get an relfilenode, bocknum and forknum and append that to some flat files, hoping that it won't get corrupted either.
On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote: >> Some people might prefer notices, because you can get those while the >> thing is still running, rather than a result set, which you will only >> see when the query finishes. Other people might prefer an SRF, because >> they want to have the data in structured form so that they can >> postprocess it. Not sure what you mean by "more globally." > > I meant having the results available system-wide, not only to the > caller. I think that emitting a log/notice level should always be > done on top on whatever other communication facility we're using. The problem of notice and logs is that they tend to be ignored. Now I don't see no problems either in adding something into the logs which can be found later on for parsing on top of a SRF returned by the caller which includes all the corruption details, say with pgbadger or your friendly neighborhood grep. I think that any backend function should also make sure to call pgstat_report_checksum_failure() to report a report visible at database-level in the catalogs, so as it is possible to use that as a cheap high-level warning. The details of the failures could always be dug from the logs or the result of the function itself after finding out that something is wrong in pg_stat_database. >> I guess one >> idea would be to provide a way to kick this off in the background via >> a background worker or similar and then have it put the results in a >> table. But that might fail if there are checksum errors in the >> catalogs themselves. > > Yes that's a concern. We could maintain a list in (dynamic) shared > memory with a simple SQL wrapper to read the data, but that would be > lost with a crash/restart. Or use > pgstat_report_checksum_failures_in_db(), modifying it to get an > relfilenode, bocknum and forknum and append that to some flat files, > hoping that it won't get corrupted either. If a lot of blocks are corrupted, that could bloat things. Hence some retention policies would be necessary, and that's tricky to define and configure properly. I'd tend to be in the school of just logging the information and be done with it, because that's simple and because you won't need to worry about any more configuration. Doing the work in the background is still separate than a SQL-callable function though, no? In this case you need a connection to a database to allow the checksum verification to happen on a relfilenode based on the relation to check, also because you want the thing to be safe concurrently (a background work here is a combo with a bgworker triggering dynamic children working on one database, not necessarily something that needs to be in core). -- Michael
Attachment
On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > >> Some people might prefer notices, because you can get those while the > >> thing is still running, rather than a result set, which you will only > >> see when the query finishes. Other people might prefer an SRF, because > >> they want to have the data in structured form so that they can > >> postprocess it. Not sure what you mean by "more globally." > > > > I meant having the results available system-wide, not only to the > > caller. I think that emitting a log/notice level should always be > > done on top on whatever other communication facility we're using. > > The problem of notice and logs is that they tend to be ignored. Now I > don't see no problems either in adding something into the logs which > can be found later on for parsing on top of a SRF returned by the > caller which includes all the corruption details, say with pgbadger > or your friendly neighborhood grep. I think that any backend function > should also make sure to call pgstat_report_checksum_failure() to > report a report visible at database-level in the catalogs, so as it is > possible to use that as a cheap high-level warning. The details of > the failures could always be dug from the logs or the result of the > function itself after finding out that something is wrong in > pg_stat_database. I agree that adding extra information in the logs and calling pgstat_report_checksum_failure is a must do, and I changed that locally. However, I doubt that the logs is the right place to find the details of corrupted blocks. There's no guarantee that the file will be accessible to the DBA, nor that the content won't get truncated by the time it's needed. I really think that corruption is important enough to justify more specific location. > >> I guess one > >> idea would be to provide a way to kick this off in the background via > >> a background worker or similar and then have it put the results in a > >> table. But that might fail if there are checksum errors in the > >> catalogs themselves. > > > > Yes that's a concern. We could maintain a list in (dynamic) shared > > memory with a simple SQL wrapper to read the data, but that would be > > lost with a crash/restart. Or use > > pgstat_report_checksum_failures_in_db(), modifying it to get an > > relfilenode, bocknum and forknum and append that to some flat files, > > hoping that it won't get corrupted either. > > If a lot of blocks are corrupted, that could bloat things. Hence some > retention policies would be necessary, and that's tricky to define and > configure properly. I'd tend to be in the school of just logging the > information and be done with it, because that's simple and because you > won't need to worry about any more configuration. If the number of corrupted blocks becomes high enough to excessively bloat things, it's likely that the instance is doomed anyway, so I'm not especially concerned about it.
On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > Hi, > > This topic was discussed several times, with the most recent > discussions found at [1] and [2]. Based on those discussions, my > understanding is that the current approach in BASE_BACKUP has too many > drawbacks and we should instead do this check in the backend. I've > been working using such approach at VMware, and I'm submitting it here > to discuss the approach and rationales, and hopefully have such a > feature integrated. Thank you for working on this! > > First, this was originally developed as an extension. It means that > the check is performed using an SRF. That's maybe not the best > approach, as a transaction has be kept for the total processing time. > It can be leveraged by checking each relation independently, but > that's still not ideal. Maybe using some utility commands (as part of > VACUUM or a new CHECK command for instance) would be a better > approach. > > This brings the second consideration: how to report the list corrupted > blocks to end users. As I said this is for now returned via the SRF, > but this is clearly not ideal and should rather be made available more > globally. One usage of this information could be block level > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > mentioned me that he had ideas on passing the list of corrupted blocks > using the stat collector. Yes it's necessary the list of corrupted pages for single page recovery. Apart from single page recovery I think it's helpful for DBA if they can find the corrupted blocks in the server logs and on a system view. I've also tried to report corrupted pages to the stats collector during I researching single page recovery in PostgreSQL but one problem is that the statistics in the stats collector is cleared when crash recovery. I want the information of block corruption to survive even when the server down. And we might want to add checksums to the permanent file having information of database corruption. The correctness of these information would be important because we can fix a database by restoring some tables from a logical backup or by doing reindex etc as long as we have a non-broken information of database corruption. > > Finally, the read and locking considerations. I tried to cover that > extensively in the comments, but here are some details on how I tried > to make the check safe while trying to keep the overhead as low as > possible. First thing is that this is only doing buffered reads, > without any attempt to discard OS cache. Therefore, any discrepancy > between the OS cache and the disk cannot be detected unless you do > other actions, such as sync / drop_caches on GNU/Linux. > > An access share lock on the currently checked relation is held, > meaning that it can't get deleted/truncated. The total number of > blocks for the given fork is retrieved first, so any new block will be > ignored. Such new blocks are considered out of scope as being written > after the start of the check. > > Each time a buffer is being checked, the target buffer mapping > partition lock is acquired in shared mode, to prevent concurrent > eviction. If the buffer is found in shared buffers, it's pinned and > released immediately, just to get the state. I wonder if there is possibility that blocks on disk can be corrupted even if these are loaded to the shared buffer. ISTM the above method cannot detect such corruption. Reading and checking blocks fast is attractive but I thought it's also important to check blocks precisely without overlooking. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > This brings the second consideration: how to report the list corrupted > > blocks to end users. As I said this is for now returned via the SRF, > > but this is clearly not ideal and should rather be made available more > > globally. One usage of this information could be block level > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > mentioned me that he had ideas on passing the list of corrupted blocks > > using the stat collector. > > Yes it's necessary the list of corrupted pages for single page > recovery. Apart from single page recovery I think it's helpful for DBA > if they can find the corrupted blocks in the server logs and on a > system view. > > I've also tried to report corrupted pages to the stats collector > during I researching single page recovery in PostgreSQL but one > problem is that the statistics in the stats collector is cleared when > crash recovery. I want the information of block corruption to survive > even when the server down. Yes, having the list of corrupted blocks surviving a crash-and-restart cycle, and also available after a clean shutdown is definitely important. > And we might want to add checksums to the > permanent file having information of database corruption. The > correctness of these information would be important because we can fix > a database by restoring some tables from a logical backup or by doing > reindex etc as long as we have a non-broken information of database > corruption. Agreed > > Finally, the read and locking considerations. I tried to cover that > > extensively in the comments, but here are some details on how I tried > > to make the check safe while trying to keep the overhead as low as > > possible. First thing is that this is only doing buffered reads, > > without any attempt to discard OS cache. Therefore, any discrepancy > > between the OS cache and the disk cannot be detected unless you do > > other actions, such as sync / drop_caches on GNU/Linux. > > > > An access share lock on the currently checked relation is held, > > meaning that it can't get deleted/truncated. The total number of > > blocks for the given fork is retrieved first, so any new block will be > > ignored. Such new blocks are considered out of scope as being written > > after the start of the check. > > > > Each time a buffer is being checked, the target buffer mapping > > partition lock is acquired in shared mode, to prevent concurrent > > eviction. If the buffer is found in shared buffers, it's pinned and > > released immediately, just to get the state. > > I wonder if there is possibility that blocks on disk can be corrupted > even if these are loaded to the shared buffer. ISTM the above method > cannot detect such corruption. Reading and checking blocks fast is > attractive but I thought it's also important to check blocks precisely > without overlooking. It can definitely happen, and it's the usual doomsday scenario: database is working fine for months, then postgres is restarted say for a minor version upgrade and then boom the most populars blocks that are constantly used in read only were corrupted on disk but never evicted from shared buffers, and you have a major outage. I have witnessed that unfortunately too many times. This is especially bad as in this kind of scenario, you typically discover the corruption once all backup only contains the corrupted blocks. Note that in the approach I'm suggesting, I do verify blocks that are loaded in shared buffers, I only ignore the dirty blocks, as they'll be written by the checkpointer or recovery process in case of unclean shutdown. A bufferpin isn't necessary to avoid torn page read, an IO lock also guarantees that and causes less overhead. The included TAP test should also detect the corruption of a present-in-shared-buffers-non-dirty block. It could however be improved eg. by calling pg_prewarm to make sure that it's indeed in shared_buffers, and also do the same test after a clean restart to make sure that it's hitting the not-in-shared-buffers case.
On Tue, 24 Dec 2019 at 16:09, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Tue, Dec 24, 2019 at 4:23 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote: > > > > On Fri, Dec 6, 2019 at 11:51 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > > > This brings the second consideration: how to report the list corrupted > > > blocks to end users. As I said this is for now returned via the SRF, > > > but this is clearly not ideal and should rather be made available more > > > globally. One usage of this information could be block level > > > recovery. I'm Cc-ing Sawada-san, as I know he's working on this and > > > mentioned me that he had ideas on passing the list of corrupted blocks > > > using the stat collector. > > > > Yes it's necessary the list of corrupted pages for single page > > recovery. Apart from single page recovery I think it's helpful for DBA > > if they can find the corrupted blocks in the server logs and on a > > system view. > > > > I've also tried to report corrupted pages to the stats collector > > during I researching single page recovery in PostgreSQL but one > > problem is that the statistics in the stats collector is cleared when > > crash recovery. I want the information of block corruption to survive > > even when the server down. > > Yes, having the list of corrupted blocks surviving a crash-and-restart > cycle, and also available after a clean shutdown is definitely > important. > > > And we might want to add checksums to the > > permanent file having information of database corruption. The > > correctness of these information would be important because we can fix > > a database by restoring some tables from a logical backup or by doing > > reindex etc as long as we have a non-broken information of database > > corruption. > > Agreed > > > > Finally, the read and locking considerations. I tried to cover that > > > extensively in the comments, but here are some details on how I tried > > > to make the check safe while trying to keep the overhead as low as > > > possible. First thing is that this is only doing buffered reads, > > > without any attempt to discard OS cache. Therefore, any discrepancy > > > between the OS cache and the disk cannot be detected unless you do > > > other actions, such as sync / drop_caches on GNU/Linux. > > > > > > An access share lock on the currently checked relation is held, > > > meaning that it can't get deleted/truncated. The total number of > > > blocks for the given fork is retrieved first, so any new block will be > > > ignored. Such new blocks are considered out of scope as being written > > > after the start of the check. > > > > > > Each time a buffer is being checked, the target buffer mapping > > > partition lock is acquired in shared mode, to prevent concurrent > > > eviction. If the buffer is found in shared buffers, it's pinned and > > > released immediately, just to get the state. > > > > I wonder if there is possibility that blocks on disk can be corrupted > > even if these are loaded to the shared buffer. ISTM the above method > > cannot detect such corruption. Reading and checking blocks fast is > > attractive but I thought it's also important to check blocks precisely > > without overlooking. > > It can definitely happen, and it's the usual doomsday scenario: > database is working fine for months, then postgres is restarted say > for a minor version upgrade and then boom the most populars blocks > that are constantly used in read only were corrupted on disk but never > evicted from shared buffers, and you have a major outage. I have > witnessed that unfortunately too many times. This is especially bad > as in this kind of scenario, you typically discover the corruption > once all backup only contains the corrupted blocks. > > Note that in the approach I'm suggesting, I do verify blocks that are > loaded in shared buffers, I only ignore the dirty blocks, as they'll > be written by the checkpointer or recovery process in case of unclean > shutdown. A bufferpin isn't necessary to avoid torn page read, an IO > lock also guarantees that and causes less overhead. The included TAP > test should also detect the corruption of a > present-in-shared-buffers-non-dirty block. It could however be > improved eg. by calling pg_prewarm to make sure that it's indeed in > shared_buffers, and also do the same test after a clean restart to > make sure that it's hitting the not-in-shared-buffers case. It reads blocks from disk even if they are loaded in shared buffer. Now I understand. Thanks! Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote: > On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > >> Some people might prefer notices, because you can get those while the > > >> thing is still running, rather than a result set, which you will only > > >> see when the query finishes. Other people might prefer an SRF, because > > >> they want to have the data in structured form so that they can > > >> postprocess it. Not sure what you mean by "more globally." > > > > > > I meant having the results available system-wide, not only to the > > > caller. I think that emitting a log/notice level should always be > > > done on top on whatever other communication facility we're using. > > > > The problem of notice and logs is that they tend to be ignored. Now I > > don't see no problems either in adding something into the logs which > > can be found later on for parsing on top of a SRF returned by the > > caller which includes all the corruption details, say with pgbadger > > or your friendly neighborhood grep. I think that any backend function > > should also make sure to call pgstat_report_checksum_failure() to > > report a report visible at database-level in the catalogs, so as it is > > possible to use that as a cheap high-level warning. The details of > > the failures could always be dug from the logs or the result of the > > function itself after finding out that something is wrong in > > pg_stat_database. > > I agree that adding extra information in the logs and calling > pgstat_report_checksum_failure is a must do, and I changed that > locally. However, I doubt that the logs is the right place to find > the details of corrupted blocks. There's no guarantee that the file > will be accessible to the DBA, nor that the content won't get > truncated by the time it's needed. I really think that corruption is > important enough to justify more specific location. The cfbot reported a build failure, so here's a rebased v2 which also contains the pg_stat_report_failure() call and extra log info.
Attachment
On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote: > The cfbot reported a build failure, so here's a rebased v2 which also contains > the pg_stat_report_failure() call and extra log info. + * - if a block is not found in shared_buffers, the LWLock is relased and the + * block is read from disk without taking any lock. If an error is detected, + * the read block will be discarded and retrieved again while holding the + * LWLock. This is because an error due to concurrent write is possible but + * very unlikely, so it's better to have an optimistic approach to limit + * locking overhead This can be risky with false positives, no? With a large amount of shared buffer eviction you actually increase the risk of torn page reads. Instead of a logic relying on partition mapping locks, which could be unwise on performance grounds, did you consider different approaches? For example a kind of pre-emptive lock on the page in storage to prevent any shared buffer operation to happen while the block is read from storage, that would act like a barrier. + * Vacuum's GUCs are used to avoid consuming too much resources while running + * this tool. Shouldn't this involve separate GUCs instead of the VACUUM ones? I guess that this leads to the fact that this function may be better as a contrib module, with the addition of some better-suited APIs in core (see paragraph above). +Run + make check +or + make installcheck Why is installcheck mentioned here? I don't think that it is appropriate to place the SQL-callable part in the existing checksum.c. I would suggest instead a new file, say checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions for checksums. -SUBDIRS = perl regress isolation modules authentication recovery subscription +SUBDIRS = perl regress isolation modules authentication check_relation \ + recovery subscription It seems to me that this test would be a good fit for src/test/modules/test_misc/. +static void +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, + ForkNumber forknum) Per the argument of bloat, I think that I would remove check_all_relation() as this function could take a very long time to run, and just make the SQL function strict. + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to + * disk either before the end of the next checkpoint or during recovery in + * case of unsafe shutdown Wouldn't it actually be a good thing to check that the page on storage is fine in this case? This depends on the system settings and the checkpoint frequency, but checkpoint_timeout can be extended up to 1 day. And plenty of things could happen to the storage in one day, including a base backup that includes a corrupted page on storage, that this function would not be able to detect. + * - if a block is otherwise found in shared_buffers, an IO lock is taken on + * the block and the block is then read from storage, ignoring the block in + * shared_buffers Yeah, I think that you are right here to check the page on storage anyway. + * we detect if a block is in shared_buffers or not. See get_buffer() + * comments for more details about the locking strategy. get_buffer() does not exist in your patch, check_get_buffer() does. + * - if a block is not found in shared_buffers, the LWLock is relased and the [...] + * To avoid torn page and possible false postives when reading data, and Typos. + if (!DataChecksumsEnabled()) + elog(ERROR, "Data checksums are not enabled"); Note that elog() is for the class of errors which are never expected, and here a caller of pg_check_relation() with checksums disabled can trigger that. So you need to call ereport() with ERRCODE_FEATURE_NOT_SUPPORTED. + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to + * disk either before the end of the next checkpoint or during recovery in + * case of unsafe shutdown Not sure that the indentation is going to react well on that part of the patch, perhaps it would be better to add some "/*-------" at the beginning and end of the comment block to tell pgindent to ignore this part? Based on the feedback gathered on this thread, I guess that you should have a SRF returning the list of broken blocks, as well as NOTICE messages. Another thing to consider is the addition of a range argument to only check a certain portion of the blocks, say one segment file at a time, etc. Fine by me to not include in the first flavor of the patch. The patch needs documentation. -- Michael
Attachment
On Wed, 11 Mar 2020 at 16:18, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Tue, Dec 10, 2019 at 11:12:34AM +0100, Julien Rouhaud wrote: > > On Tue, Dec 10, 2019 at 3:26 AM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Mon, Dec 09, 2019 at 07:02:43PM +0100, Julien Rouhaud wrote: > > > > On Mon, Dec 9, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > > > >> Some people might prefer notices, because you can get those while the > > > >> thing is still running, rather than a result set, which you will only > > > >> see when the query finishes. Other people might prefer an SRF, because > > > >> they want to have the data in structured form so that they can > > > >> postprocess it. Not sure what you mean by "more globally." > > > > > > > > I meant having the results available system-wide, not only to the > > > > caller. I think that emitting a log/notice level should always be > > > > done on top on whatever other communication facility we're using. > > > > > > The problem of notice and logs is that they tend to be ignored. Now I > > > don't see no problems either in adding something into the logs which > > > can be found later on for parsing on top of a SRF returned by the > > > caller which includes all the corruption details, say with pgbadger > > > or your friendly neighborhood grep. I think that any backend function > > > should also make sure to call pgstat_report_checksum_failure() to > > > report a report visible at database-level in the catalogs, so as it is > > > possible to use that as a cheap high-level warning. The details of > > > the failures could always be dug from the logs or the result of the > > > function itself after finding out that something is wrong in > > > pg_stat_database. > > > > I agree that adding extra information in the logs and calling > > pgstat_report_checksum_failure is a must do, and I changed that > > locally. However, I doubt that the logs is the right place to find > > the details of corrupted blocks. There's no guarantee that the file > > will be accessible to the DBA, nor that the content won't get > > truncated by the time it's needed. I really think that corruption is > > important enough to justify more specific location. > > > The cfbot reported a build failure, so here's a rebased v2 which also contains > the pg_stat_report_failure() call and extra log info. In addition to comments from Michael-san, here are my comments: 1. + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("only superuser or a member of the pg_read_server_files role may use this function"))); + + if (!DataChecksumsEnabled()) + elog(ERROR, "Data checksums are not enabled"); I think it's better to reverse the order of the above checks. 2. +#define CRF_COLS 5 /* Number of output arguments in the SRF */ Should it be SRF_COLS? 3. +static void +check_delay_point(void) +{ + /* Always check for interrupts */ + CHECK_FOR_INTERRUPTS(); + + /* Nap if appropriate */ + if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit) + { + int msec; + + msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit; + if (msec > VacuumCostDelay * 4) + msec = VacuumCostDelay * 4; + + pg_usleep(msec * 1000L); + + VacuumCostBalance = 0; + + /* Might have gotten an interrupt while sleeping */ + CHECK_FOR_INTERRUPTS(); + } +} Even if we use vacuum delay for this function, I think we need to set VacuumDelayActive and return if it's false, or it's better to just return if VacuumCostDelay == 0. 4. +static void +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, + ForkNumber forknum) I also agree with Michael-san to remove this function. Instead we can check all relations by: select pg_check_relation(oid) from pg_class; 6. Other typos s/dirted/dirtied/ s/explictly/explicitly/ Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Thanks for the review Michael! On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > On Wed, Mar 11, 2020 at 08:18:23AM +0100, Julien Rouhaud wrote: > > The cfbot reported a build failure, so here's a rebased v2 which also contains > > the pg_stat_report_failure() call and extra log info. > > + * - if a block is not found in shared_buffers, the LWLock is relased and the > + * block is read from disk without taking any lock. If an error is detected, > + * the read block will be discarded and retrieved again while holding the > + * LWLock. This is because an error due to concurrent write is possible but > + * very unlikely, so it's better to have an optimistic approach to limit > + * locking overhead > This can be risky with false positives, no? Do you mean high probability of false positive in the 1st iteration, so running frequently the recheck that can't have false positive, not that the 2nd check can lead to false positive? > With a large amount of > shared buffer eviction you actually increase the risk of torn page > reads. Instead of a logic relying on partition mapping locks, which > could be unwise on performance grounds, did you consider different > approaches? For example a kind of pre-emptive lock on the page in > storage to prevent any shared buffer operation to happen while the > block is read from storage, that would act like a barrier. Even with a workload having a large shared_buffers eviction pattern, I don't think that there's a high probability of hitting a torn page. Unless I'm mistaken it can only happen if all those steps happen concurrently to doing the block read just after releasing the LWLock: - postgres read the same block in shared_buffers (including all the locking) - dirties it - writes part of the page It's certainly possible, but it seems so unlikely that the optimistic lock-less approach seems like a very good tradeoff. > > + * Vacuum's GUCs are used to avoid consuming too much resources while running > + * this tool. > Shouldn't this involve separate GUCs instead of the VACUUM ones? We could but the access pattern looked so similar that it looked like a good idea to avoid adding 2 new GUC for that to keep configuration simple. Unless there are objections I'll add them in the next version. > I guess that this leads to the fact that this function may be better as > a contrib module, with the addition of some better-suited APIs in core > (see paragraph above). Below? > > +Run > + make check > +or > + make installcheck > Why is installcheck mentioned here? Oups, copy/pasto error from the original contrib module this stuff was initially implemented as, will fix. > > I don't think that it is appropriate to place the SQL-callable part in > the existing checksum.c. I would suggest instead a new file, say > checksumfuncs.c in src/backend/utils/adt/, holding any SQL functions > for checksums. Agreed. > > -SUBDIRS = perl regress isolation modules authentication recovery > subscription > +SUBDIRS = perl regress isolation modules authentication check_relation \ > + recovery subscription > It seems to me that this test would be a good fit for > src/test/modules/test_misc/. WFM. > > +static void > +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, > + ForkNumber forknum) > Per the argument of bloat, I think that I would remove > check_all_relation() as this function could take a very long time to > run, and just make the SQL function strict. No objection. > > + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to > + * disk either before the end of the next checkpoint or during recovery in > + * case of unsafe shutdown > Wouldn't it actually be a good thing to check that the page on storage > is fine in this case? This depends on the system settings and the > checkpoint frequency, but checkpoint_timeout can be extended up to 1 > day. And plenty of things could happen to the storage in one day, > including a base backup that includes a corrupted page on storage, > that this function would not be able to detect. How could that lead to data corruption? If postgres crashes before the checkpoint completion, the block will be overwritten during recovery, and if a base backup is taken the block will also be overwritten while replaying all the required WALs. Detecting a corrupted blocks in those cases would have the merit of possibly warning about possibly broken hardware sooner, but it would also make the check more expensive as the odds to prevent postgres from evicting a dirty block is way higher. Maybe an additional GUC for that? For the record when I first tested that feature I did try to check dirty blocks, and it seemed that dirty blocks of shared relation were sometimes wrongly reported as corrupted. I didn't try to investigate more though. > + * we detect if a block is in shared_buffers or not. See get_buffer() > + * comments for more details about the locking strategy. > get_buffer() does not exist in your patch, check_get_buffer() does. Oops, will fix. > > + * - if a block is not found in shared_buffers, the LWLock is relased and the > [...] > + * To avoid torn page and possible false postives when reading data, and > Typos. > > + if (!DataChecksumsEnabled()) > + elog(ERROR, "Data checksums are not enabled"); > Note that elog() is for the class of errors which are never expected, > and here a caller of pg_check_relation() with checksums disabled can > trigger that. So you need to call ereport() with > ERRCODE_FEATURE_NOT_SUPPORTED. Indeed, will fix. > > + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to > + * disk either before the end of the next checkpoint or during recovery in > + * case of unsafe shutdown > Not sure that the indentation is going to react well on that part of > the patch, perhaps it would be better to add some "/*-------" at the > beginning and end of the comment block to tell pgindent to ignore this > part? Ok. Although I think only the beginning comment is needed? > > Based on the feedback gathered on this thread, I guess that you should > have a SRF returning the list of broken blocks, as well as NOTICE > messages. The current patch has an SRF and a WARNING message, do you want an additional NOTICE message or downgrade the existing one? > Another thing to consider is the addition of a range > argument to only check a certain portion of the blocks, say one > segment file at a time, etc. Fine by me to not include in the first > flavor of the patch. Ok! > The patch needs documentation. I'll try to add some.
On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote: > > In addition to comments from Michael-san, here are my comments: > > 1. > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("only superuser or a member of the > pg_read_server_files role may use this function"))); Good point! I'll fix it. > + > + if (!DataChecksumsEnabled()) > + elog(ERROR, "Data checksums are not enabled"); > > I think it's better to reverse the order of the above checks. Indeed. > > 2. > +#define CRF_COLS 5 /* Number of output arguments in the SRF */ > > Should it be SRF_COLS? Oops, will fix. > > 3. > +static void > +check_delay_point(void) > +{ > + /* Always check for interrupts */ > + CHECK_FOR_INTERRUPTS(); > + > + /* Nap if appropriate */ > + if (!InterruptPending && VacuumCostBalance >= VacuumCostLimit) > + { > + int msec; > + > + msec = VacuumCostDelay * VacuumCostBalance / VacuumCostLimit; > + if (msec > VacuumCostDelay * 4) > + msec = VacuumCostDelay * 4; > + > + pg_usleep(msec * 1000L); > + > + VacuumCostBalance = 0; > + > + /* Might have gotten an interrupt while sleeping */ > + CHECK_FOR_INTERRUPTS(); > + } > +} > > Even if we use vacuum delay for this function, I think we need to set > VacuumDelayActive and return if it's false, or it's better to just > return if VacuumCostDelay == 0. Good point, I'll fix that. > > 4. > +static void > +check_all_relations(TupleDesc tupdesc, Tuplestorestate *tupstore, > + ForkNumber forknum) > > I also agree with Michael-san to remove this function. Instead we can > check all relations by: > > select pg_check_relation(oid) from pg_class; Sure, but ideally we should do that in a client program (eg. pg_checksums) that wouldn't maintain a transaction active for the whole execution. > 6. > Other typos > > s/dirted/dirtied/ > s/explictly/explicitly/ Will fix, thanks!
On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote: > On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote: > > > > In addition to comments from Michael-san, here are my comments: Thanks both for the reviews. I'm attaching a v3 with all comments addressed, except: > It seems to me that this test would be a good fit for > src/test/modules/test_misc/. AFAICT this is explicitly documented as tests for various extensions, and for now it's a core function, so I didn't move it. > +Run > + make check > +or > + make installcheck > Why is installcheck mentioned here? This is actually already used in multiple other test readme.
Attachment
On Mon, Mar 16, 2020 at 02:15:27PM +0100, Julien Rouhaud wrote: > On Mon, Mar 16, 2020 at 09:42:39AM +0100, Julien Rouhaud wrote: > > On Mon, Mar 16, 2020 at 01:53:35PM +0900, Masahiko Sawada wrote: > > > > > > In addition to comments from Michael-san, here are my comments: > > Thanks both for the reviews. I'm attaching a v3 with all comments addressed, > except: > > > It seems to me that this test would be a good fit for > > src/test/modules/test_misc/. > > > AFAICT this is explicitly documented as tests for various extensions, and for > now it's a core function, so I didn't move it. > > > > +Run > > + make check > > +or > > + make installcheck > > Why is installcheck mentioned here? > > > This is actually already used in multiple other test readme. Sorry I forgot to update the regression tests. v4 attached.
Attachment
On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: >> With a large amount of >> shared buffer eviction you actually increase the risk of torn page >> reads. Instead of a logic relying on partition mapping locks, which >> could be unwise on performance grounds, did you consider different >> approaches? For example a kind of pre-emptive lock on the page in >> storage to prevent any shared buffer operation to happen while the >> block is read from storage, that would act like a barrier. > > Even with a workload having a large shared_buffers eviction pattern, I don't > think that there's a high probability of hitting a torn page. Unless I'm > mistaken it can only happen if all those steps happen concurrently to doing the > block read just after releasing the LWLock: > > - postgres read the same block in shared_buffers (including all the locking) > - dirties it > - writes part of the page > > It's certainly possible, but it seems so unlikely that the optimistic lock-less > approach seems like a very good tradeoff. Having false reports in this area could be very confusing for the user. That's for example possible now with checksum verification and base backups. >> I guess that this leads to the fact that this function may be better as >> a contrib module, with the addition of some better-suited APIs in core >> (see paragraph above). > > Below? Above. This thought more precisely: >> For example a kind of pre-emptive lock on the page in >> storage to prevent any shared buffer operation to happen while the >> block is read from storage, that would act like a barrier. > For the record when I first tested that feature I did try to check dirty > blocks, and it seemed that dirty blocks of shared relation were sometimes > wrongly reported as corrupted. I didn't try to investigate more though. Hmm. It would be good to look at that, correct verification of shared relations matter. >> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to >> + * disk either before the end of the next checkpoint or during recovery in >> + * case of unsafe shutdown >> Not sure that the indentation is going to react well on that part of >> the patch, perhaps it would be better to add some "/*-------" at the >> beginning and end of the comment block to tell pgindent to ignore this >> part? > > Ok. Although I think only the beginning comment is needed? From src/tools/pgindent/README: "pgindent will reflow any comment block that's not at the left margin. If this messes up manual formatting that ought to be preserved, protect the comment block with some dashes:" /*---------- * Text here will not be touched by pgindent. *---------- */ >> Based on the feedback gathered on this thread, I guess that you should >> have a SRF returning the list of broken blocks, as well as NOTICE >> messages. > > The current patch has an SRF and a WARNING message, do you want an additional > NOTICE message or downgrade the existing one? Right, not sure which one is better, for zero_damaged_pages a WARNING is used. -- Michael
Attachment
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > >> With a large amount of > >> shared buffer eviction you actually increase the risk of torn page > >> reads. Instead of a logic relying on partition mapping locks, which > >> could be unwise on performance grounds, did you consider different > >> approaches? For example a kind of pre-emptive lock on the page in > >> storage to prevent any shared buffer operation to happen while the > >> block is read from storage, that would act like a barrier. > > > > Even with a workload having a large shared_buffers eviction pattern, I don't > > think that there's a high probability of hitting a torn page. Unless I'm > > mistaken it can only happen if all those steps happen concurrently to doing the > > block read just after releasing the LWLock: > > > > - postgres read the same block in shared_buffers (including all the locking) > > - dirties it > > - writes part of the page > > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less > > approach seems like a very good tradeoff. > > Having false reports in this area could be very confusing for the > user. That's for example possible now with checksum verification and > base backups. I agree, however this shouldn't be the case here, as the block will be rechecked while holding proper lock the 2nd time in case of possible false positive before being reported as corrupted. So the only downside is to check twice a corrupted block that's not found in shared buffers (or concurrently loaded/modified/half flushed). As the number of corrupted or concurrently loaded/modified/half flushed blocks should usually be close to zero, it seems worthwhile to have a lockless check first for performance reason. > > For the record when I first tested that feature I did try to check dirty > > blocks, and it seemed that dirty blocks of shared relation were sometimes > > wrongly reported as corrupted. I didn't try to investigate more though. > > Hmm. It would be good to look at that, correct verification of shared > relations matter. I'll try to investigate, but non-dirty shared relation blocks can be checked and work as intended. > > >> + * - if a block is dirty in shared_buffers, it's ignored as it'll be flushed to > >> + * disk either before the end of the next checkpoint or during recovery in > >> + * case of unsafe shutdown > >> Not sure that the indentation is going to react well on that part of > >> the patch, perhaps it would be better to add some "/*-------" at the > >> beginning and end of the comment block to tell pgindent to ignore this > >> part? > > > > Ok. Although I think only the beginning comment is needed? > > From src/tools/pgindent/README: > "pgindent will reflow any comment block that's not at the left margin. > If this messes up manual formatting that ought to be preserved, > protect the comment block with some dashes:" > > /*---------- > * Text here will not be touched by pgindent. > *---------- > */ For instance the block comment in gen_partprune_steps_internal() disagrees. Anyway I added both as all the nearby codes does that for overall function comments.
On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > >> Based on the feedback gathered on this thread, I guess that you should > >> have a SRF returning the list of broken blocks, as well as NOTICE > >> messages. > > > > The current patch has an SRF and a WARNING message, do you want an additional > > NOTICE message or downgrade the existing one? > > Right, not sure which one is better, for zero_damaged_pages a WARNING > is used. Sorry forgot to answer that. IMHO a WARNING is better here, as we're talking about data corruption. Also, a WARNING will be reported to both the client and server logs, which sounds like a good thing.
On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote: > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > > >> With a large amount of > > >> shared buffer eviction you actually increase the risk of torn page > > >> reads. Instead of a logic relying on partition mapping locks, which > > >> could be unwise on performance grounds, did you consider different > > >> approaches? For example a kind of pre-emptive lock on the page in > > >> storage to prevent any shared buffer operation to happen while the > > >> block is read from storage, that would act like a barrier. > > > > > > Even with a workload having a large shared_buffers eviction pattern, I don't > > > think that there's a high probability of hitting a torn page. Unless I'm > > > mistaken it can only happen if all those steps happen concurrently to doing the > > > block read just after releasing the LWLock: > > > > > > - postgres read the same block in shared_buffers (including all the locking) > > > - dirties it > > > - writes part of the page > > > > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less > > > approach seems like a very good tradeoff. > > > > Having false reports in this area could be very confusing for the > > user. That's for example possible now with checksum verification and > > base backups. > > > I agree, however this shouldn't be the case here, as the block will be > rechecked while holding proper lock the 2nd time in case of possible false > positive before being reported as corrupted. So the only downside is to check > twice a corrupted block that's not found in shared buffers (or concurrently > loaded/modified/half flushed). As the number of corrupted or concurrently > loaded/modified/half flushed blocks should usually be close to zero, it seems > worthwhile to have a lockless check first for performance reason. I just noticed some dumb mistakes while adding the new GUCs. v5 attached to fix that, no other changes.
Attachment
On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Wed, Mar 18, 2020 at 07:06:19AM +0100, Julien Rouhaud wrote: > > On Wed, Mar 18, 2020 at 01:20:47PM +0900, Michael Paquier wrote: > > > On Mon, Mar 16, 2020 at 09:21:22AM +0100, Julien Rouhaud wrote: > > > > On Mon, Mar 16, 2020 at 12:29:28PM +0900, Michael Paquier wrote: > > > >> With a large amount of > > > >> shared buffer eviction you actually increase the risk of torn page > > > >> reads. Instead of a logic relying on partition mapping locks, which > > > >> could be unwise on performance grounds, did you consider different > > > >> approaches? For example a kind of pre-emptive lock on the page in > > > >> storage to prevent any shared buffer operation to happen while the > > > >> block is read from storage, that would act like a barrier. > > > > > > > > Even with a workload having a large shared_buffers eviction pattern, I don't > > > > think that there's a high probability of hitting a torn page. Unless I'm > > > > mistaken it can only happen if all those steps happen concurrently to doing the > > > > block read just after releasing the LWLock: > > > > > > > > - postgres read the same block in shared_buffers (including all the locking) > > > > - dirties it > > > > - writes part of the page > > > > > > > > It's certainly possible, but it seems so unlikely that the optimistic lock-less > > > > approach seems like a very good tradeoff. > > > > > > Having false reports in this area could be very confusing for the > > > user. That's for example possible now with checksum verification and > > > base backups. > > > > > > I agree, however this shouldn't be the case here, as the block will be > > rechecked while holding proper lock the 2nd time in case of possible false > > positive before being reported as corrupted. So the only downside is to check > > twice a corrupted block that's not found in shared buffers (or concurrently > > loaded/modified/half flushed). As the number of corrupted or concurrently > > loaded/modified/half flushed blocks should usually be close to zero, it seems > > worthwhile to have a lockless check first for performance reason. > > > I just noticed some dumb mistakes while adding the new GUCs. v5 attached to > fix that, no other changes. Thank you for updating the patch. I have some comments: 1. + <entry> + <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>, <parameter>fork</parameter> <type>text</type>)</function></literal> + </entry> Looking at the declaration of pg_check_relation, 'relation' and 'fork' are optional arguments. So I think the above is not correct. But as I commented below, 'relation' should not be optional, so maybe the above line could be: + <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>[, <parameter>fork</parameter> <type>text</type>])</function></literal> 2. + <indexterm> + <primary>pg_check_relation</primary> + </indexterm> + <para> + <function>pg_check_relation</function> iterates over all the blocks of all + or the specified fork of a given relation and verify their checksum. It + returns the list of blocks for which the found checksum doesn't match the + expected one. You must be a member of the + <literal>pg_read_all_stats</literal> role to use this function. It can + only be used if data checksums are enabled. See <xref + linkend="app-initdb-data-checksums"/> for more information. + </para> * I think we need a description about possible values for 'fork' (i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork' is omitted. * Do we need to explain about checksum cost-based delay here? 3. +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; Now that pg_check_relation doesn't accept NULL as 'relation', I think we need to make 'relation' a mandatory argument. 4. + /* Check if the relation (still) exists */ + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) + { + /* + * We use a standard relation_open() to acquire the initial lock. It + * means that this will block until the lock is acquired, or will + * raise an ERROR if lock_timeout has been set. If caller wants to + * check multiple tables while relying on a maximum wait time, it + * should process tables one by one instead of relying on a global + * processing with the main SRF. + */ + relation = relation_open(relid, AccessShareLock); + } IIUC the above was necessary because we used to have check_all_relations() which iterates all relations on the database to do checksum checks. But now that we don't have that function and pg_check_relation processes one relation. Can we just do relation_open() here? 5. I think we need to check if the relation is a temp relation. I'm not sure it's worth to check checksums of temp relations but at least we need not to check other's temp relations. 6. +/* + * Safely read the wanted buffer from disk, dealing with possible concurrency + * issue. Note that if a buffer is found dirty in shared_buffers, no read will + * be performed and the caller will be informed that no check should be done. + * We can safely ignore such buffers as they'll be written before next + * checkpoint's completion.. + * + * The following locks can be used in this function: + * + * - shared LWLock on the target buffer pool partition mapping. + * - IOLock on the buffer + * + * The IOLock is taken when reading the buffer from disk if it exists in + * shared_buffers, to avoid torn pages. + * + * If the buffer isn't in shared_buffers, it'll be read from disk without any + * lock unless caller asked otherwise, setting needlock. In this case, the + * read will be done while the buffer mapping partition LWLock is still being + * held. Reading with this lock is to avoid the unlikely but possible case + * that a buffer wasn't present in shared buffers when we checked but it then + * alloc'ed in shared_buffers, modified and flushed concurrently when we + * later try to read it, leading to false positive due to torn page. Caller + * can read first the buffer without holding the target buffer mapping + * partition LWLock to have an optimistic approach, and reread the buffer + * from disk in case of error. + * + * Caller should hold an AccessShareLock on the Relation + */ I think the above comment also needs some "/*-------" at the beginning and end. 7. +static void +check_get_buffer(Relation relation, ForkNumber forknum, + BlockNumber blkno, char *buffer, bool needlock, bool *checkit, + bool *found_in_sb) +{ Maybe we can make check_get_buffer() return a bool indicating we found a buffer to check, instead of having '*checkit'. That way, we can simplify the following code: + check_get_buffer(relation, forknum, blkno, buffer, force_lock, + &checkit, &found_in_sb); + + if (!checkit) + continue; to something like: + if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, + &found_in_sb)) + continue; 8. + if (PageIsVerified(buffer, blkno)) + { + /* + * If the page is really new, there won't by any checksum to be + * computed or expected. + */ + *chk_expected = *chk_found = NoComputedChecksum; + return true; + } + else + { + /* + * There's a corruption, but since this affect PageIsNew, we + * can't compute a checksum, so set NoComputedChecksum for the + * expected checksum. + */ + *chk_expected = NoComputedChecksum; + *chk_found = hdr->pd_checksum; + } + return false; * I think the 'else' is not necessary here. * Setting *chk_expected and *chk_found seems useless when we return true. The caller doesn't use them. * Should we forcibly overwrite ignore_checksum_failure to off in pg_check_relation()? Otherwise, this logic seems not work fine. * I don't understand why we cannot compute a checksum in case where a page looks like a new page but is actually corrupted. Could you please elaborate on that? 8. + { + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, + gettext_noop("Checksum cost for a page found in the buffer cache."), + NULL + }, + &ChecksumCostPageHit, + 1, 0, 10000, + NULL, NULL, NULL + }, * There is no description about the newly added four GUC parameters in the doc. * We need to put new GUC parameters into postgresql.conf.sample as well. * The patch doesn't use checksum_cost_page_hit at all. 9. diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c index eb19644419..37f63e747c 100644 --- a/src/backend/utils/init/globals.c +++ b/src/backend/utils/init/globals.c @@ -134,6 +134,14 @@ int max_worker_processes = 8; int max_parallel_workers = 8; int MaxBackends = 0; +int ChecksumCostPageHit = 1; /* GUC parameters for checksum check */ +int ChecksumCostPageMiss = 10; +int ChecksumCostLimit = 200; +double ChecksumCostDelay = 0; + +int ChecksumCostBalance = 0; /* working state for checksums check */ +bool ChecksumCostActive = false; Can we declare them in checksum.c since these parameters are used only in checksum.c and it does I/O my itself. 10. + /* Report the failure to the stat collector and the logs. */ + pgstat_report_checksum_failure(); + ereport(WARNING, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("invalid page in block %u of relation %s", + blkno, + relpath(relation->rd_smgr->smgr_rnode, forknum)))); I think we could do pgstat_report_checksum_failure() and emit WARNING twice for the same page since PageIsVerified() also does the same. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote: > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > v5 attached > > Thank you for updating the patch. I have some comments: Thanks a lot for the review! > 1. > + <entry> > + <literal><function>pg_check_relation(<parameter>relation</parameter> > <type>oid</type>, <parameter>fork</parameter> > <type>text</type>)</function></literal> > + </entry> > > Looking at the declaration of pg_check_relation, 'relation' and 'fork' > are optional arguments. So I think the above is not correct. But as I > commented below, 'relation' should not be optional, so maybe the above > line could be: > > + <literal><function>pg_check_relation(<parameter>relation</parameter> > <type>oid</type>[, <parameter>fork</parameter> > <type>text</type>])</function></literal> Yes I missed that when making relation mandatory. Fixed. > 2. > + <indexterm> > + <primary>pg_check_relation</primary> > + </indexterm> > + <para> > + <function>pg_check_relation</function> iterates over all the blocks of all > + or the specified fork of a given relation and verify their checksum. It > + returns the list of blocks for which the found checksum doesn't match the > + expected one. You must be a member of the > + <literal>pg_read_all_stats</literal> role to use this function. It can > + only be used if data checksums are enabled. See <xref > + linkend="app-initdb-data-checksums"/> for more information. > + </para> > > * I think we need a description about possible values for 'fork' > (i.g., 'main', 'vm', 'fsm' and 'init'), and the behavior when 'fork' > is omitted. Done. > * Do we need to explain about checksum cost-based delay here? It's probably better in config.sgml, nearby vacuum cost-based delay, so done this way with a link to reference that part. > 3. > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass DEFAULT NULL::regclass, IN fork text DEFAULT NULL::text, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record VOLATILE LANGUAGE internal AS 'pg_check_relation' > + PARALLEL RESTRICTED; > > Now that pg_check_relation doesn't accept NULL as 'relation', I think > we need to make 'relation' a mandatory argument. Correct, fixed. > 4. > + /* Check if the relation (still) exists */ > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > + { > + /* > + * We use a standard relation_open() to acquire the initial lock. It > + * means that this will block until the lock is acquired, or will > + * raise an ERROR if lock_timeout has been set. If caller wants to > + * check multiple tables while relying on a maximum wait time, it > + * should process tables one by one instead of relying on a global > + * processing with the main SRF. > + */ > + relation = relation_open(relid, AccessShareLock); > + } > > IIUC the above was necessary because we used to have > check_all_relations() which iterates all relations on the database to > do checksum checks. But now that we don't have that function and > pg_check_relation processes one relation. Can we just do > relation_open() here? Ah yes I missed that comment. I think only the comment needed to be updated to remove any part related to NULL-relation call. I ended up removign the whole comment since locking and lock_timeout behavior is inherent to relation_open and there's no need to document that any further now that we always only check one relation at a time. > 5. > I think we need to check if the relation is a temp relation. I'm not > sure it's worth to check checksums of temp relations but at least we > need not to check other's temp relations. Good point. I think it's still worthwhile to check the backend's temp relation, although typical usage should be a bgworker/cron job doing that check so there shouldn't be any. > 6. > +/* > + * Safely read the wanted buffer from disk, dealing with possible concurrency > + * issue. Note that if a buffer is found dirty in shared_buffers, no read will > + * be performed and the caller will be informed that no check should be done. > + * We can safely ignore such buffers as they'll be written before next > + * checkpoint's completion.. > [...] > + */ > > I think the above comment also needs some "/*-------" at the beginning and end. Fixed. > 7. > +static void > +check_get_buffer(Relation relation, ForkNumber forknum, > + BlockNumber blkno, char *buffer, bool needlock, bool *checkit, > + bool *found_in_sb) > +{ > > Maybe we can make check_get_buffer() return a bool indicating we found > a buffer to check, instead of having '*checkit'. That way, we can > simplify the following code: > > + check_get_buffer(relation, forknum, blkno, buffer, force_lock, > + &checkit, &found_in_sb); > + > + if (!checkit) > + continue; > > to something like: > > + if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, > + &found_in_sb)) > + continue; Changed. > 8. > + if (PageIsVerified(buffer, blkno)) > + { > + /* > + * If the page is really new, there won't by any checksum to be > + * computed or expected. > + */ > + *chk_expected = *chk_found = NoComputedChecksum; > + return true; > + } > + else > + { > + /* > + * There's a corruption, but since this affect PageIsNew, we > + * can't compute a checksum, so set NoComputedChecksum for the > + * expected checksum. > + */ > + *chk_expected = NoComputedChecksum; > + *chk_found = hdr->pd_checksum; > + } > + return false; > > * I think the 'else' is not necessary here. AFAICT it's, see below. > * Setting *chk_expected and *chk_found seems useless when we return > true. The caller doesn't use them. Indeed, fixed. > * Should we forcibly overwrite ignore_checksum_failure to off in > pg_check_relation()? Otherwise, this logic seems not work fine. > > * I don't understand why we cannot compute a checksum in case where a > page looks like a new page but is actually corrupted. Could you please > elaborate on that? PageIsVerified has a different behavior depending on whether the page looks new or not. If the page looks like new, it only checks that it's indeed a new page, and otherwise try to verify the checksum. Also, pg_check_page() has an assert to make sure that the page isn't (or don't look like) new. So it seems to me that the 'else' is required to properly detect a real or fake PageIsNew, and try to compute checksums only when required. > 8. > + { > + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, > + gettext_noop("Checksum cost for a page found in the buffer cache."), > + NULL > + }, > + &ChecksumCostPageHit, > + 1, 0, 10000, > + NULL, NULL, NULL > + }, > > * There is no description about the newly added four GUC parameters in the doc. > > * We need to put new GUC parameters into postgresql.conf.sample as well. Fixed both. > * The patch doesn't use checksum_cost_page_hit at all. Indeed, I also realized that while working on previous issues. I removed it and renamed checksum_cost_page_miss to checksum_cost_page. > > 9. > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c > index eb19644419..37f63e747c 100644 > --- a/src/backend/utils/init/globals.c > +++ b/src/backend/utils/init/globals.c > @@ -134,6 +134,14 @@ int max_worker_processes = 8; > int max_parallel_workers = 8; > int MaxBackends = 0; > > +int ChecksumCostPageHit = 1; /* GUC parameters for > checksum check */ > +int ChecksumCostPageMiss = 10; > +int ChecksumCostLimit = 200; > +double ChecksumCostDelay = 0; > + > +int ChecksumCostBalance = 0; /* working state for > checksums check */ > +bool ChecksumCostActive = false; > > Can we declare them in checksum.c since these parameters are used only > in checksum.c and it does I/O my itself. The GUC parameters would still need to be global, so for consistency I kept all the variables in globals.c. > > 10. > + /* Report the failure to the stat collector and the logs. */ > + pgstat_report_checksum_failure(); > + ereport(WARNING, > + (errcode(ERRCODE_DATA_CORRUPTED), > + errmsg("invalid page in block %u of relation %s", > + blkno, > + relpath(relation->rd_smgr->smgr_rnode, forknum)))); > > I think we could do pgstat_report_checksum_failure() and emit WARNING > twice for the same page since PageIsVerified() also does the same. As mentioned before, in this patch I only calls PageIsVerified() if the buffer looks like new, and in this case PageIsVerified() only verify that it's a true all-zero-page, and won't try to verify the checksum, so there's no possibility of duplicated report. I modified the comments to document all the interactions and expectations. v6 attached.
Attachment
On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Sat, Mar 28, 2020 at 12:28:27PM +0900, Masahiko Sawada wrote: > > On Wed, 18 Mar 2020 at 19:11, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > > > v5 attached > > > > Thank you for updating the patch. I have some comments: > > Thanks a lot for the review! Thank you for updating the patch! > > 4. > > + /* Check if the relation (still) exists */ > > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > > + { > > + /* > > + * We use a standard relation_open() to acquire the initial lock. It > > + * means that this will block until the lock is acquired, or will > > + * raise an ERROR if lock_timeout has been set. If caller wants to > > + * check multiple tables while relying on a maximum wait time, it > > + * should process tables one by one instead of relying on a global > > + * processing with the main SRF. > > + */ > > + relation = relation_open(relid, AccessShareLock); > > + } > > > > IIUC the above was necessary because we used to have > > check_all_relations() which iterates all relations on the database to > > do checksum checks. But now that we don't have that function and > > pg_check_relation processes one relation. Can we just do > > relation_open() here? > > Ah yes I missed that comment. I think only the comment needed to be updated to > remove any part related to NULL-relation call. I ended up removign the whole > comment since locking and lock_timeout behavior is inherent to relation_open > and there's no need to document that any further now that we always only check > one relation at a time. The current patch still checks SearchSysCacheExists1() before relation_open. Why do we need to call SearchSysCacheExists1() here? I think if the given relation doesn't exist, relation_open() will raise an error "could not open relation with OID %u". + /* Open the relation if it exists. */ + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) + { + relation = relation_open(relid, AccessShareLock); + } > > 8. > > + if (PageIsVerified(buffer, blkno)) > > + { > > + /* > > + * If the page is really new, there won't by any checksum to be > > + * computed or expected. > > + */ > > + *chk_expected = *chk_found = NoComputedChecksum; > > + return true; > > + } > > + else > > + { > > + /* > > + * There's a corruption, but since this affect PageIsNew, we > > + * can't compute a checksum, so set NoComputedChecksum for the > > + * expected checksum. > > + */ > > + *chk_expected = NoComputedChecksum; > > + *chk_found = hdr->pd_checksum; > > + } > > + return false; > > > > * I think the 'else' is not necessary here. > > AFAICT it's, see below. > > > * Setting *chk_expected and *chk_found seems useless when we return > > true. The caller doesn't use them. > > Indeed, fixed. The patch still sets values to both? + if (PageIsVerified(buffer, blkno)) + { + /* No corruption. */ + *chk_expected = *chk_found = NoComputedChecksum; + return true; + } > > > * Should we forcibly overwrite ignore_checksum_failure to off in > > pg_check_relation()? Otherwise, this logic seems not work fine. > > > > * I don't understand why we cannot compute a checksum in case where a > > page looks like a new page but is actually corrupted. Could you please > > elaborate on that? > > PageIsVerified has a different behavior depending on whether the page looks new > or not. If the page looks like new, it only checks that it's indeed a new > page, and otherwise try to verify the checksum. > > Also, pg_check_page() has an assert to make sure that the page isn't (or don't > look like) new. > > So it seems to me that the 'else' is required to properly detect a real or fake > PageIsNew, and try to compute checksums only when required. Thank you for your explanation! I understand. I thought we can arrange the code to something like: if (PageIsNew(hdr)) { if (PageIsVerified(hdr)) { *chk_expected = *chk_found = NoComputedChecksum; return true; } *chk_expected = NoComputedChecksum; *chk_found = hdr->pd_checksum; return false; } But since it's not a critical problem you can ignore it. > > > 8. > > + { > > + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, > > + gettext_noop("Checksum cost for a page found in the buffer cache."), > > + NULL > > + }, > > + &ChecksumCostPageHit, > > + 1, 0, 10000, > > + NULL, NULL, NULL > > + }, > > > > * There is no description about the newly added four GUC parameters in the doc. > > > > * We need to put new GUC parameters into postgresql.conf.sample as well. > > Fixed both. > > > * The patch doesn't use checksum_cost_page_hit at all. > > Indeed, I also realized that while working on previous issues. I removed it > and renamed checksum_cost_page_miss to checksum_cost_page. Perhaps we can use checksum_cost_page_hit when we found the page in the shared buffer but it's marked as dirty? > > > > 9. > > diff --git a/src/backend/utils/init/globals.c b/src/backend/utils/init/globals.c > > index eb19644419..37f63e747c 100644 > > --- a/src/backend/utils/init/globals.c > > +++ b/src/backend/utils/init/globals.c > > @@ -134,6 +134,14 @@ int max_worker_processes = 8; > > int max_parallel_workers = 8; > > int MaxBackends = 0; > > > > +int ChecksumCostPageHit = 1; /* GUC parameters for > > checksum check */ > > +int ChecksumCostPageMiss = 10; > > +int ChecksumCostLimit = 200; > > +double ChecksumCostDelay = 0; > > + > > +int ChecksumCostBalance = 0; /* working state for > > checksums check */ > > +bool ChecksumCostActive = false; > > > > Can we declare them in checksum.c since these parameters are used only > > in checksum.c and it does I/O my itself. > > The GUC parameters would still need to be global, so for consistency I kept all > the variables in globals.c. Okay. > > > > 10. > > + /* Report the failure to the stat collector and the logs. */ > > + pgstat_report_checksum_failure(); > > + ereport(WARNING, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > + errmsg("invalid page in block %u of relation %s", > > + blkno, > > + relpath(relation->rd_smgr->smgr_rnode, forknum)))); > > > > I think we could do pgstat_report_checksum_failure() and emit WARNING > > twice for the same page since PageIsVerified() also does the same. > > As mentioned before, in this patch I only calls PageIsVerified() if the buffer > looks like new, and in this case PageIsVerified() only verify that it's a true > all-zero-page, and won't try to verify the checksum, so there's no possibility > of duplicated report. I modified the comments to document all the interactions > and expectations. You're right. Thank you for the explanation! I've read the latest patch and here is random comments: 1. + /* + * Add a page miss cost, as we're always reading outside the shared + * buffers. + */ + /* Add a page cost. */ + ChecksumCostBalance += ChecksumCostPage; There are duplicate comments. 2. + /* Dirty pages are ignored as they'll be flushed soon. */ + if (buf_state & BM_DIRTY) + checkit = false; Should we check the buffer if it has BM_TAG_VALID as well here? I thought there might be a possibility that BufTableLookup() returns a buf_Id but its buffer tag is not valid for example when the previous read failed after inserting the buffer tag to the buffer table. 3. + /* Add a page cost. */ + ChecksumCostBalance += ChecksumCostPage; + + return checkit; +} The check_get_buffer() seems to be slightly complex to me but when we reached the end of this function we always return true. Similarly, in the case where we read the block while holding a partition lock we always return true as well. Is my understanding right? If so, it might be better to put some assertions. 4. @@ -10825,6 +10825,14 @@ proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}', proargnames => '{tablespace,name,size,modification}', prosrc => 'pg_ls_tmpdir_1arg' }, +{ oid => '9147', descr => 'check data integrity for one or all relations', + proname => 'pg_check_relation', proisstrict => 'f', procost => '10000', + prorows => '20', proretset => 't', proparallel => 'r', + provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text', + proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}', + proargmodes => '{i,i,o,o,o,o,o}', + proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}', + prosrc => 'pg_check_relation' }, Why is the pg_check_relation() is not a strict function? I think prostrict can be 'true' for this function and we can drop checking if the first argument is NULL. 5. + memset(values, 0, sizeof(values)); + memset(nulls, 0, sizeof(nulls)); I think we can do memset right before setting values to them, that is, after checking (!found_in_sb && !force_lock). 6. +static bool +check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected, + uint16 *chk_found) +{ + PageHeader hdr = (PageHeader) buffer; + + Assert(chk_expected && chk_found); + + if (PageIsNew(hdr)) + { + /* + * Check if the page is really new or if there's a corruption that + * affected PageIsNew detection. Note that PageIsVerified won't try to + * detect checksum corruption in this case, so there's no risk of + * duplicated corruption report. + */ + if (PageIsVerified(buffer, blkno)) How about using Page instead of PageHeader? Looking at other codes, ISTM we usually pass Page to both PageIsNew() and PageIsVerified(). 7. + <entry> + <literal><function>pg_check_relation(<parameter>relation</parameter> <type>oid</type>[, <parameter>fork</parameter> <type>text</type>])</function></literal>. + </entry> +{ oid => '9147', descr => 'check data integrity for one or all relations', + proname => 'pg_check_relation', proisstrict => 'f', procost => '10000', + prorows => '20', proretset => 't', proparallel => 'r', + provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text', + proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}', + proargmodes => '{i,i,o,o,o,o,o}', + proargnames => '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}', + prosrc => 'pg_check_relation' }, The function argument data types don't match in the doc and function declaretion. relation is 'oid' in the doc but is 'regclass' in the function declaretion. 8. +#define SRF_COLS 5 /* Number of output arguments in the SRF */ Looking at similar built-in functions that return set of records they use a more specific name for the number of returned columns such as PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about PG_CHECK_RELATION_COLS? check_relation_fork() seems to quite depends on pg_check_relation() because the returned tuplestore is specified by pg_check_relation(). It's just an idea but to improve reusability, how about moving check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c while iterating all blocks we call a new function in checksum.c, say check_one_block() function, which has the following part and is responsible for getting, checking the specified block and returning a boolean indicating whether the block has corruption or not, along with chk_found and chk_expected: /* * To avoid too much overhead, the buffer will be first read without * the locks that would guarantee the lack of false positive, as such * events should be quite rare. */ Retry: if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, &found_in_sb)) continue; if (check_buffer(buffer, blkno, &chk_expected, &chk_found)) continue; /* * If we get a failure and the buffer wasn't found in shared buffers, * reread the buffer with suitable lock to avoid false positive. See * check_get_buffer for more details. */ if (!found_in_sb && !force_lock) { force_lock = true; goto Retry; } A new function in checksumfuncs.c or pg_check_relation will be responsible for storing the result to the tuplestore. That way, check_one_block() will be useful for other use when we want to check if the particular block has corruption with low overhead. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote: > On Sat, 28 Mar 2020 at 21:19, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > The current patch still checks SearchSysCacheExists1() before > relation_open. Why do we need to call SearchSysCacheExists1() here? I > think if the given relation doesn't exist, relation_open() will raise > an error "could not open relation with OID %u". > > + /* Open the relation if it exists. */ > + if (SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) > + { > + relation = relation_open(relid, AccessShareLock); > + } Oops yes sorry about that. Fixed. > > > 8. > > > + if (PageIsVerified(buffer, blkno)) > > > + { > > > + /* > > > + * If the page is really new, there won't by any checksum to be > > > + * computed or expected. > > > + */ > > > + *chk_expected = *chk_found = NoComputedChecksum; > > > + return true; > > > + } > > > + else > > > + { > > > + /* > > > + * There's a corruption, but since this affect PageIsNew, we > > > + * can't compute a checksum, so set NoComputedChecksum for the > > > + * expected checksum. > > > + */ > > > + *chk_expected = NoComputedChecksum; > > > + *chk_found = hdr->pd_checksum; > > > + } > > > + return false; > > > > > > * I think the 'else' is not necessary here. > > > > AFAICT it's, see below. > > > > > * Setting *chk_expected and *chk_found seems useless when we return > > > true. The caller doesn't use them. > > > > Indeed, fixed. > > The patch still sets values to both? > > + if (PageIsVerified(buffer, blkno)) > + { > + /* No corruption. */ > + *chk_expected = *chk_found = NoComputedChecksum; > + return true; > + } Sorry again, fixed. > > > * Should we forcibly overwrite ignore_checksum_failure to off in > > > pg_check_relation()? Otherwise, this logic seems not work fine. > > > > > > * I don't understand why we cannot compute a checksum in case where a > > > page looks like a new page but is actually corrupted. Could you please > > > elaborate on that? > > > > PageIsVerified has a different behavior depending on whether the page looks new > > or not. If the page looks like new, it only checks that it's indeed a new > > page, and otherwise try to verify the checksum. > > > > Also, pg_check_page() has an assert to make sure that the page isn't (or don't > > look like) new. > > > > So it seems to me that the 'else' is required to properly detect a real or fake > > PageIsNew, and try to compute checksums only when required. > > Thank you for your explanation! I understand. > > I thought we can arrange the code to something like: > > if (PageIsNew(hdr)) > { > if (PageIsVerified(hdr)) > { > *chk_expected = *chk_found = NoComputedChecksum; > return true; > } > > *chk_expected = NoComputedChecksum; > *chk_found = hdr->pd_checksum; > return false; > } > > But since it's not a critical problem you can ignore it. I like it, so done! > > > 8. > > > + { > > > + {"checksum_cost_page_hit", PGC_USERSET, RESOURCES_CHECKSUM_DELAY, > > > + gettext_noop("Checksum cost for a page found in the buffer cache."), > > > + NULL > > > + }, > > > + &ChecksumCostPageHit, > > > + 1, 0, 10000, > > > + NULL, NULL, NULL > > > + }, > > > > > > * There is no description about the newly added four GUC parameters in the doc. > > > > > > * We need to put new GUC parameters into postgresql.conf.sample as well. > > > > Fixed both. > > > > > * The patch doesn't use checksum_cost_page_hit at all. > > > > Indeed, I also realized that while working on previous issues. I removed it > > and renamed checksum_cost_page_miss to checksum_cost_page. > > Perhaps we can use checksum_cost_page_hit when we found the page in > the shared buffer but it's marked as dirty? The thing is that when the buffer is dirty, we won't do any additional check, thus not adding any overhead. What may be needed here is to account for the locking overhead (in all cases), so that if all (or almost all) the buffers are dirty and in shared buffers the execution can be throttled. I don't know how much an issue it can be, but if that's something to be fixes then page_hit doesn't look like the right answer for that. > I've read the latest patch and here is random comments: > > 1. > + /* > + * Add a page miss cost, as we're always reading outside the shared > + * buffers. > + */ > + /* Add a page cost. */ > + ChecksumCostBalance += ChecksumCostPage; > > There are duplicate comments. Fixed. > 2. > + /* Dirty pages are ignored as they'll be flushed soon. */ > + if (buf_state & BM_DIRTY) > + checkit = false; > > Should we check the buffer if it has BM_TAG_VALID as well here? I > thought there might be a possibility that BufTableLookup() returns a > buf_Id but its buffer tag is not valid for example when the previous > read failed after inserting the buffer tag to the buffer table. Good point, fixed. > 3. > + /* Add a page cost. */ > + ChecksumCostBalance += ChecksumCostPage; > + > + return checkit; > +} > > The check_get_buffer() seems to be slightly complex to me but when we > reached the end of this function we always return true. Similarly, in > the case where we read the block while holding a partition lock we > always return true as well. Is my understanding right? If so, it might > be better to put some assertions. Yes it's a little bit complex. I used this approach to avoid the need to release the locks all over the place, but maybe this doesn't really improve things. I added asserts and comments anyway as suggested, thanks. > 4. > @@ -10825,6 +10825,14 @@ > proallargtypes => '{oid,text,int8,timestamptz}', proargmodes => '{i,o,o,o}', > proargnames => '{tablespace,name,size,modification}', > prosrc => 'pg_ls_tmpdir_1arg' }, > +{ oid => '9147', descr => 'check data integrity for one or all relations', > + proname => 'pg_check_relation', proisstrict => 'f', procost => '10000', > + prorows => '20', proretset => 't', proparallel => 'r', > + provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text', > + proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}', > + proargmodes => '{i,i,o,o,o,o,o}', > + proargnames => > '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}', > + prosrc => 'pg_check_relation' }, > > Why is the pg_check_relation() is not a strict function? I think > prostrict can be 'true' for this function and we can drop checking if > the first argument is NULL. That's because the fork is still optional. While this could be made mandatory without much problems, I think we'll eventually want to add a way to check only a subset of a fork, so it seemed to me that is wasn't worth changing that now. > 5. > + memset(values, 0, sizeof(values)); > + memset(nulls, 0, sizeof(nulls)); > > I think we can do memset right before setting values to them, that is, > after checking (!found_in_sb && !force_lock). Indeed, done! > 6. > +static bool > +check_buffer(char *buffer, uint32 blkno, uint16 *chk_expected, > + uint16 *chk_found) > +{ > + PageHeader hdr = (PageHeader) buffer; > + > + Assert(chk_expected && chk_found); > + > + if (PageIsNew(hdr)) > + { > + /* > + * Check if the page is really new or if there's a corruption that > + * affected PageIsNew detection. Note that PageIsVerified won't try to > + * detect checksum corruption in this case, so there's no risk of > + * duplicated corruption report. > + */ > + if (PageIsVerified(buffer, blkno)) > > How about using Page instead of PageHeader? Looking at other codes, > ISTM we usually pass Page to both PageIsNew() and PageIsVerified(). Agreed, done. > 7. > + <entry> > + <literal><function>pg_check_relation(<parameter>relation</parameter> > <type>oid</type>[, <parameter>fork</parameter> > <type>text</type>])</function></literal>. > + </entry> > > +{ oid => '9147', descr => 'check data integrity for one or all relations', > + proname => 'pg_check_relation', proisstrict => 'f', procost => '10000', > + prorows => '20', proretset => 't', proparallel => 'r', > + provolatile => 'v', prorettype => 'record', proargtypes => 'regclass text', > + proallargtypes => '{regclass,text,oid,int4,int8,int4,int4}', > + proargmodes => '{i,i,o,o,o,o,o}', > + proargnames => > '{relation,fork,relid,forknum,failed_blocknum,expected_checksum,found_checksum}', > + prosrc => 'pg_check_relation' }, > > The function argument data types don't match in the doc and function > declaretion. relation is 'oid' in the doc but is 'regclass' in the > function declaretion. Fixed. > 8. > +#define SRF_COLS 5 /* Number of output arguments in the SRF */ > > Looking at similar built-in functions that return set of records they > use a more specific name for the number of returned columns such as > PG_STAT_GET_WAL_SENDERS_COLS and PG_GET_SHMEM_SIZES_COLS. How about > PG_CHECK_RELATION_COLS? > > check_relation_fork() seems to quite depends on pg_check_relation() > because the returned tuplestore is specified by pg_check_relation(). > It's just an idea but to improve reusability, how about moving > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c > while iterating all blocks we call a new function in checksum.c, say > check_one_block() function, which has the following part and is > responsible for getting, checking the specified block and returning a > boolean indicating whether the block has corruption or not, along with > chk_found and chk_expected: > > /* > * To avoid too much overhead, the buffer will be first read without > * the locks that would guarantee the lack of false positive, as such > * events should be quite rare. > */ > Retry: > if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, > &found_in_sb)) > continue; > > if (check_buffer(buffer, blkno, &chk_expected, &chk_found)) > continue; > > /* > * If we get a failure and the buffer wasn't found in shared buffers, > * reread the buffer with suitable lock to avoid false positive. See > * check_get_buffer for more details. > */ > if (!found_in_sb && !force_lock) > { > force_lock = true; > goto Retry; > } > > A new function in checksumfuncs.c or pg_check_relation will be > responsible for storing the result to the tuplestore. That way, > check_one_block() will be useful for other use when we want to check > if the particular block has corruption with low overhead. Yes, I agree that passing the tuplestore isn't an ideal approach and some refactoring should probably happen. One thing is that this wouldn't be "check_one_block()" but "check_one_block_on_disk()" (which could also be from the OS cache). I'm not sure how useful it's in itself. It also raises some concerns about the throttling. I didn't change that for now, but I hope there'll be some other feedback about it.
Attachment
On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote: > On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote: > > > > check_relation_fork() seems to quite depends on pg_check_relation() > > because the returned tuplestore is specified by pg_check_relation(). > > It's just an idea but to improve reusability, how about moving > > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c > > while iterating all blocks we call a new function in checksum.c, say > > check_one_block() function, which has the following part and is > > responsible for getting, checking the specified block and returning a > > boolean indicating whether the block has corruption or not, along with > > chk_found and chk_expected: > > > > /* > > * To avoid too much overhead, the buffer will be first read without > > * the locks that would guarantee the lack of false positive, as such > > * events should be quite rare. > > */ > > Retry: > > if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, > > &found_in_sb)) > > continue; > > > > if (check_buffer(buffer, blkno, &chk_expected, &chk_found)) > > continue; > > > > /* > > * If we get a failure and the buffer wasn't found in shared buffers, > > * reread the buffer with suitable lock to avoid false positive. See > > * check_get_buffer for more details. > > */ > > if (!found_in_sb && !force_lock) > > { > > force_lock = true; > > goto Retry; > > } > > > > A new function in checksumfuncs.c or pg_check_relation will be > > responsible for storing the result to the tuplestore. That way, > > check_one_block() will be useful for other use when we want to check > > if the particular block has corruption with low overhead. > > > Yes, I agree that passing the tuplestore isn't an ideal approach and some > refactoring should probably happen. One thing is that this wouldn't be > "check_one_block()" but "check_one_block_on_disk()" (which could also be from > the OS cache). I'm not sure how useful it's in itself. It also raises some > concerns about the throttling. I didn't change that for now, but I hope > there'll be some other feedback about it. > I had some time this morning, so I did the suggested refactoring as it seems like a way cleaner interface. I also kept the suggested check_one_block().
Attachment
On Sat, 4 Apr 2020 at 18:04, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Fri, Apr 03, 2020 at 11:39:11AM +0200, Julien Rouhaud wrote: > > On Fri, Apr 03, 2020 at 12:24:50PM +0900, Masahiko Sawada wrote: > > > > > > check_relation_fork() seems to quite depends on pg_check_relation() > > > because the returned tuplestore is specified by pg_check_relation(). > > > It's just an idea but to improve reusability, how about moving > > > check_relation_fork() to checksumfunc.c? That is, in checksumfuncs.c > > > while iterating all blocks we call a new function in checksum.c, say > > > check_one_block() function, which has the following part and is > > > responsible for getting, checking the specified block and returning a > > > boolean indicating whether the block has corruption or not, along with > > > chk_found and chk_expected: > > > > > > /* > > > * To avoid too much overhead, the buffer will be first read without > > > * the locks that would guarantee the lack of false positive, as such > > > * events should be quite rare. > > > */ > > > Retry: > > > if (!check_get_buffer(relation, forknum, blkno, buffer, force_lock, > > > &found_in_sb)) > > > continue; > > > > > > if (check_buffer(buffer, blkno, &chk_expected, &chk_found)) > > > continue; > > > > > > /* > > > * If we get a failure and the buffer wasn't found in shared buffers, > > > * reread the buffer with suitable lock to avoid false positive. See > > > * check_get_buffer for more details. > > > */ > > > if (!found_in_sb && !force_lock) > > > { > > > force_lock = true; > > > goto Retry; > > > } > > > > > > A new function in checksumfuncs.c or pg_check_relation will be > > > responsible for storing the result to the tuplestore. That way, > > > check_one_block() will be useful for other use when we want to check > > > if the particular block has corruption with low overhead. > > > > > > Yes, I agree that passing the tuplestore isn't an ideal approach and some > > refactoring should probably happen. One thing is that this wouldn't be > > "check_one_block()" but "check_one_block_on_disk()" (which could also be from > > the OS cache). I'm not sure how useful it's in itself. It also raises some > > concerns about the throttling. I didn't change that for now, but I hope > > there'll be some other feedback about it. > > > > > I had some time this morning, so I did the suggested refactoring as it seems > like a way cleaner interface. I also kept the suggested check_one_block(). Thank you for updating the patch! The patch looks good to me. Here are some random comments mostly about cosmetic changes. 1. I think we can have two separate SQL functions: pg_check_relation(regclass, text) and pg_check_relation(regclass), instead of setting NULL by default to the second argument. 2. + * Check data sanity for a specific block in the given fork of the given + * relation, always retrieved locally with smgrred even if a version exists in s/smgrred/smgrread/ 3. + /* The buffer will have to check checked. */ + Assert(checkit); Should it be "The buffer will have to be checked"? 4. + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) + ereport(ERROR, + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), + errmsg("only superuser or a member of the pg_read_server_files role may use this function"))); Looking at the definition of pg_stat_read_server_files role, this role seems to be for operations that could read non-database files such as csv files. Therefore, currently this role is used by file_fdw and COPY command. I personally think pg_stat_scan_tables would be more appropriate for this function but I'm not sure. 5. + /* Set cost-based vacuum delay */ + ChecksumCostActive = (ChecksumCostDelay > 0); + ChecksumCostBalance = 0; s/vacuum/checksum verification/ 6. + ereport(WARNING, + (errcode(ERRCODE_DATA_CORRUPTED), + errmsg("invalid page in block %u of relation %s", + blkno, + relpath(relation->rd_smgr->smgr_rnode, forknum)))); I think it's better to show the relation name instead of the relation path here. 7. + ereport(ERROR, + (errcode(ERRCODE_WRONG_OBJECT_TYPE), + errmsg("relation \"%s\" does not have storage to be checked", + quote_qualified_identifier( + get_namespace_name(get_rel_namespace(relid)), + get_rel_name(relid))))); Looking at other similar error messages we don't show qualified relation name but the relation name gotten by RelationGetRelationName(relation). Can we do that here as well for consistency? 8. + if (!(rsinfo->allowedModes & SFRM_Materialize)) + ereport(ERROR, + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), + errmsg("materialize mode required, but it is not " \ + "allowed in this context"))); I think it's better to have this error message in one line for easy grepping. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote: > > Thank you for updating the patch! The patch looks good to me. Here are > some random comments mostly about cosmetic changes. > Thanks a lot for the review! > > 1. > I think we can have two separate SQL functions: > pg_check_relation(regclass, text) and pg_check_relation(regclass), > instead of setting NULL by default to the second argument. > I'm fine with it, so implemented this way with the required documentation changes. > > 2. > + * Check data sanity for a specific block in the given fork of the given > + * relation, always retrieved locally with smgrred even if a version exists in > > s/smgrred/smgrread/ Fixed. > > 3. > + /* The buffer will have to check checked. */ > + Assert(checkit); > > Should it be "The buffer will have to be checked"? > Oops indeed, fixed. > > 4. > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > + ereport(ERROR, > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > + errmsg("only superuser or a member of the > pg_read_server_files role may use this function"))); > > Looking at the definition of pg_stat_read_server_files role, this role > seems to be for operations that could read non-database files such as > csv files. Therefore, currently this role is used by file_fdw and COPY > command. I personally think pg_stat_scan_tables would be more > appropriate for this function but I'm not sure. > That's a very good point, especially since the documentation of this default role is quite relevant for those functions: "Execute monitoring functions that may take ACCESS SHARE locks on tables, potentially for a long time." So changed! > > 5. > + /* Set cost-based vacuum delay */ > + ChecksumCostActive = (ChecksumCostDelay > 0); > + ChecksumCostBalance = 0; > > s/vacuum/checksum verification/ > Fixed. > > 6. > + ereport(WARNING, > + (errcode(ERRCODE_DATA_CORRUPTED), > + errmsg("invalid page in block %u of relation %s", > + blkno, > + relpath(relation->rd_smgr->smgr_rnode, forknum)))); > > I think it's better to show the relation name instead of the relation path here. > I'm here using the same pattern as what ReadBuffer_common() would display if a corrupted block is read. I think it's better to keep the format for both, so any existing log analyzer will keep working with those new functions. > > 7. > + ereport(ERROR, > + (errcode(ERRCODE_WRONG_OBJECT_TYPE), > + errmsg("relation \"%s\" does not have storage to be checked", > + quote_qualified_identifier( > + get_namespace_name(get_rel_namespace(relid)), > + get_rel_name(relid))))); > > Looking at other similar error messages we don't show qualified > relation name but the relation name gotten by > RelationGetRelationName(relation). Can we do that here as well for > consistency? > Indeed, fixed. > > 8. > + if (!(rsinfo->allowedModes & SFRM_Materialize)) > + ereport(ERROR, > + (errcode(ERRCODE_FEATURE_NOT_SUPPORTED), > + errmsg("materialize mode required, but it is not " \ > + "allowed in this context"))); > > I think it's better to have this error message in one line for easy grepping. Fixed. I also fixed missing leading tab in the perl TAP tests
Attachment
On Sun, 5 Apr 2020 at 17:44, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Sun, Apr 05, 2020 at 01:13:30PM +0900, Masahiko Sawada wrote: > > > > Thank you for updating the patch! The patch looks good to me. Here are > > some random comments mostly about cosmetic changes. > > > > Thanks a lot for the review! Thank you for updating the patch. > > > > > 1. > > I think we can have two separate SQL functions: > > pg_check_relation(regclass, text) and pg_check_relation(regclass), > > instead of setting NULL by default to the second argument. > > > > I'm fine with it, so implemented this way with the required documentation > changes. Why do we need two rows in the doc? For instance, replication slot functions have some optional arguments but there is only one row in the doc. So I think we don't need to change the doc from the previous version patch. And I think these are not necessary as we already defined in include/catalog/pg_proc.dat: +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation' + PARALLEL RESTRICTED; + +CREATE OR REPLACE FUNCTION pg_check_relation( + IN relation regclass, IN fork text, + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, + OUT expected_checksum integer, OUT found_checksum integer) + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal + AS 'pg_check_relation_fork' + PARALLEL RESTRICTED; > > > > > 2. > > + * Check data sanity for a specific block in the given fork of the given > > + * relation, always retrieved locally with smgrred even if a version exists in > > > > s/smgrred/smgrread/ > > Fixed. > > > > > 3. > > + /* The buffer will have to check checked. */ > > + Assert(checkit); > > > > Should it be "The buffer will have to be checked"? > > > > Oops indeed, fixed. > > > > > 4. > > + if (!is_member_of_role(GetUserId(), DEFAULT_ROLE_READ_SERVER_FILES)) > > + ereport(ERROR, > > + (errcode(ERRCODE_INSUFFICIENT_PRIVILEGE), > > + errmsg("only superuser or a member of the > > pg_read_server_files role may use this function"))); > > > > Looking at the definition of pg_stat_read_server_files role, this role > > seems to be for operations that could read non-database files such as > > csv files. Therefore, currently this role is used by file_fdw and COPY > > command. I personally think pg_stat_scan_tables would be more > > appropriate for this function but I'm not sure. > > > > That's a very good point, especially since the documentation of this default > role is quite relevant for those functions: > > "Execute monitoring functions that may take ACCESS SHARE locks on tables, > potentially for a long time." > > So changed! > > > > > 5. > > + /* Set cost-based vacuum delay */ > > + ChecksumCostActive = (ChecksumCostDelay > 0); > > + ChecksumCostBalance = 0; > > > > s/vacuum/checksum verification/ > > > > Fixed. > > > > > 6. > > + ereport(WARNING, > > + (errcode(ERRCODE_DATA_CORRUPTED), > > + errmsg("invalid page in block %u of relation %s", > > + blkno, > > + relpath(relation->rd_smgr->smgr_rnode, forknum)))); > > > > I think it's better to show the relation name instead of the relation path here. > > > > I'm here using the same pattern as what ReadBuffer_common() would display if a > corrupted block is read. I think it's better to keep the format for both, so > any existing log analyzer will keep working with those new functions. Ok, I agree with you. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > Why do we need two rows in the doc? For instance, replication slot > functions have some optional arguments but there is only one row in > the doc. So I think we don't need to change the doc from the previous > version patch. > I thought that if we document the function as pg_check_relation(regclass [, fork]) users could think that the 2nd argument is optional, so that pg_check_relation('something', NULL) could be a valid alias for the 1-argument form, which it isn't. After checking, I see that e.g. current_setting has the same semantics and is documented the way you suggest, so fixed back to previous version. > And I think these are not necessary as we already defined in > include/catalog/pg_proc.dat: > > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation' > + PARALLEL RESTRICTED; > + > +CREATE OR REPLACE FUNCTION pg_check_relation( > + IN relation regclass, IN fork text, > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > + OUT expected_checksum integer, OUT found_checksum integer) > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > + AS 'pg_check_relation_fork' > + PARALLEL RESTRICTED; > Oh right this isn't required since there's no default value anymore, fixed. v9 attached.
Attachment
On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > > > Why do we need two rows in the doc? For instance, replication slot > > functions have some optional arguments but there is only one row in > > the doc. So I think we don't need to change the doc from the previous > > version patch. > > > > I thought that if we document the function as pg_check_relation(regclass [, > fork]) users could think that the 2nd argument is optional, so that > pg_check_relation('something', NULL) could be a valid alias for the 1-argument > form, which it isn't. After checking, I see that e.g. current_setting has the > same semantics and is documented the way you suggest, so fixed back to previous > version. > > > And I think these are not necessary as we already defined in > > include/catalog/pg_proc.dat: > > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > + IN relation regclass, > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > + OUT expected_checksum integer, OUT found_checksum integer) > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation' > > + PARALLEL RESTRICTED; > > + > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > + IN relation regclass, IN fork text, > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > + OUT expected_checksum integer, OUT found_checksum integer) > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > > + AS 'pg_check_relation_fork' > > + PARALLEL RESTRICTED; > > > > Oh right this isn't required since there's no default value anymore, fixed. > > v9 attached. Thank you for updating the patch! The patch looks good to me. I've marked this patch as Ready for Committer. I hope this patch will get committed to PG13. Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote: > On Sun, 5 Apr 2020 at 18:45, Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > On Sun, Apr 05, 2020 at 06:08:06PM +0900, Masahiko Sawada wrote: > > > > > > Why do we need two rows in the doc? For instance, replication slot > > > functions have some optional arguments but there is only one row in > > > the doc. So I think we don't need to change the doc from the previous > > > version patch. > > > > > > > I thought that if we document the function as pg_check_relation(regclass [, > > fork]) users could think that the 2nd argument is optional, so that > > pg_check_relation('something', NULL) could be a valid alias for the 1-argument > > form, which it isn't. After checking, I see that e.g. current_setting has the > > same semantics and is documented the way you suggest, so fixed back to previous > > version. > > > > > And I think these are not necessary as we already defined in > > > include/catalog/pg_proc.dat: > > > > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > > + IN relation regclass, > > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > > + OUT expected_checksum integer, OUT found_checksum integer) > > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal AS 'pg_check_relation' > > > + PARALLEL RESTRICTED; > > > + > > > +CREATE OR REPLACE FUNCTION pg_check_relation( > > > + IN relation regclass, IN fork text, > > > + OUT relid oid, OUT forknum integer, OUT failed_blocknum bigint, > > > + OUT expected_checksum integer, OUT found_checksum integer) > > > + RETURNS SETOF record STRICT VOLATILE LANGUAGE internal > > > + AS 'pg_check_relation_fork' > > > + PARALLEL RESTRICTED; > > > > > > > Oh right this isn't required since there's no default value anymore, fixed. > > > > v9 attached. > > Thank you for updating the patch! The patch looks good to me. > > I've marked this patch as Ready for Committer. I hope this patch will > get committed to PG13. > Thanks a lot!
> On 5 Apr 2020, at 13:17, Julien Rouhaud <rjuju123@gmail.com> wrote: > On Sun, Apr 05, 2020 at 08:01:36PM +0900, Masahiko Sawada wrote: >> Thank you for updating the patch! The patch looks good to me. >> >> I've marked this patch as Ready for Committer. I hope this patch will >> get committed to PG13. > Thanks a lot! This patch has been through quite thorough review, and skimming the thread all concerns raised have been addressed. It still applies and tests gree in the CF Patchtester. The feature in itself certainly gets my +1 for inclusion, it seems a good addition. Is any committer who has taken part in the thread (or anyone else for that matter) interested in seeing this to some form of closure in this CF? cheers ./daniel
On Mon, Sep 7, 2020 at 8:59 AM Michael Paquier <michael@paquier.xyz> wrote: > > +#include "postgres.h" > + > +#include "access/tupdesc.h" > +#include "common/relpath.h" > #include "storage/block.h" > +#include "utils/relcache.h" > +#include "utils/tuplestore.h" > [...] > +extern bool check_one_block(Relation relation, ForkNumber forknum, > + BlockNumber blkno, uint16 *chk_expected, > + uint16 *chk_found); > I don't think that it is a good idea to add this much to checksum.h > as these are includes coming mainly from the backend. Note that > pg_checksum_page() is a function designed to be also available for > frontend tools, with checksum.h something that can be included in > frontends. This would mean that we could move all the buffer lookup > APIs directly to checksumfuncs.c, or move that into a separate file > closer to the location. Did you mean creating a new checksumfuncs.c file? I don't find any such file in the current tree. > + * A zero checksum can never be computed, see pg_checksum_page() */ > +#define NoComputedChecksum 0 > Wouldn't it be better to rename that something like > InvalidPageChecksum, and make use of it in pg_checksum_page()? It > would be more consistent with the naming of transaction IDs, OIDs or > even XLogRecPtr. And that could be a separate patch. It seems quite ambiguous, as checksum validity usually has a different meaning. And in the code added here, the meaning isn't that the ckecksum is invalid but that there's no checsum as it cannot be computed due to PageIsNew(). > +++ b/src/test/check_relation/t/01_checksums_check.pl > @@ -0,0 +1,276 @@ > +use strict; > +use warnings; > It could be better to move that to src/test/modules/, so as it could > be picked more easily by MSVC scripts in the future. Note that if we > apply the normal naming convention here this should be named > 001_checksum_check.pl. > > +subdir = src/test/check_relation > +top_builddir = ../../.. > +include $(top_builddir)/src/Makefile.global > Let's use a Makefile shaped in a way similar to modules/test_misc that > makes use of TAP_TESTS = 1. There is the infra, let's rely on it for > the regression tests. Will fix. > + pg_usleep(msec * 1000L); > Could it be possible to add a wait event here? It would be nice to be > able to monitor that in pg_stat_activity. Sure, I missed that as this was first implemented as an extension. > +if (exists $ENV{MY_PG_REGRESS}) > +{ > + $ENV{PG_REGRESS} = $ENV{MY_PG_REGRESS}; > +} > What is MY_PG_REGRESS for? A remnant from an external makefile > perhaps? Indeed. > + /* > + * If we get a failure and the buffer wasn't found in shared buffers, > + * reread the buffer with suitable lock to avoid false positive. See > + * check_get_buffer for more details. > + */ > + if (!found_in_sb && !force_lock) > + { > + force_lock = true; > + goto Retry; > + } > As designed, we have a risk of false positives with a torn page in the > first loop when trying to look for a given buffer as we would try to > use smgrread() without a partition lock. This stresses me a bit, and > false positives could scare users easily. Could we consider first a > safer approach where we don't do that, and just read the page while > holding the partition lock? OK, that would be more expensive, but at > least that's safe in any case. My memory of this patch is a bit > fuzzy, but this is itching me and this is the heart of the problem > dealt with here :) I'm not sure I understand. Unless I missed something this approach *cannot* raise a false positive. What it does is force a 2nd check with stronger lock *to make sure it's actually a corruption*, so we don't raise false positive. The only report that can happen in this 1st loop is if smgread raises an error, which AFAICT can only happen (at least with mdread) if the whole block couldn't be read, which is a sign of a very bad problem. This should clearly be reported, as this cannot be caused by the locking heuristics used here.
On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote: > Did you mean creating a new checksumfuncs.c file? I don't find any > such file in the current tree. Your patch adds checksumfuncs.c, so the subroutines grabbing a given block could just be moved there. > I'm not sure I understand. Unless I missed something this approach > *cannot* raise a false positive. What it does is force a 2nd check > with stronger lock *to make sure it's actually a corruption*, so we > don't raise false positive. The only report that can happen in this > 1st loop is if smgread raises an error, which AFAICT can only happen > (at least with mdread) if the whole block couldn't be read, which is a > sign of a very bad problem. This should clearly be reported, as this > cannot be caused by the locking heuristics used here. We don't know how much this optimization matters though? Could it be possible to get an idea of that? For example, take the case of one relation with a fixed size in a read-only workload and a read-write workload (as long as autovacuum and updates make the number of relation blocks rather constant for the read-write case), doing a checksum verification in parallel of multiple clients working on the relation concurrently. Assuming that the relation is fully in the OS cache, we could get an idea of the impact with multiple (shared_buffers / relation size) rates to make the eviction more aggressive? The buffer partition locks, knowing that NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it seems to me that it would be good to see if we have a difference. What do you think? -- Michael
Attachment
On Mon, 7 Sep 2020 at 15:59, Michael Paquier <michael@paquier.xyz> wrote: > > On Tue, Jul 14, 2020 at 11:08:08AM +0200, Julien Rouhaud wrote: > > On Sun, Jul 12, 2020 at 12:34:03PM -0500, Justin Pryzby wrote: > >> Small language fixes in comments and user-facing docs. > > > > Thanks a lot! I just fixed a small issue (see below), PFA updated v10. > > Sawada-san, you are registered as a reviewer of this patch. Are you > planning to look at it? If you are busy lately, that's fine as well > (congrats!). Thanks! > In this case it could be better to unregister from the > CF app for this entry. Well, I sent review comments on this patch and Julien fixed all comments. So I’d marked this as Ready for Committer since I didn't have further comments at that time, and I was waiting for the committer review. I'll look at this patch again but should I remove my name from the reviewer after that if no comments? Regards, -- Masahiko Sawada http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Sep 08, 2020 at 11:36:45AM +0900, Masahiko Sawada wrote: > On Mon, 7 Sep 2020 at 15:59, Michael Paquier <michael@paquier.xyz> wrote: >> In this case it could be better to unregister from the >> CF app for this entry. > > Well, I sent review comments on this patch and Julien fixed all > comments. So I’d marked this as Ready for Committer since I didn't > have further comments at that time, and I was waiting for the > committer review. I'll look at this patch again but should I remove my > name from the reviewer after that if no comments? Ah, sorry, I somewhat missed the previous status of the patch. Perhaps that's an overdose of CF. Keeping your name as reviewer is fine I guess. I have begun looking at the patch and spotted some issues, so let's see where we do from here. -- Michael
Attachment
On Mon, Sep 07, 2020 at 05:50:38PM +0900, Michael Paquier wrote: > On Mon, Sep 07, 2020 at 09:38:30AM +0200, Julien Rouhaud wrote: > > Did you mean creating a new checksumfuncs.c file? I don't find any > > such file in the current tree. > > Your patch adds checksumfuncs.c, so the subroutines grabbing a given > block could just be moved there. > Sorry, I was in the middle of a rebase for another patch and missed the new files added in this one. I added a new checksumfuncs.h for the required include that should not be seen by client code. I kept checksumfuncs.c and checksums.c so that the SQL visible declaration are separated from the rest of the implementation as this is what we already do elsewhere I think. If that's a problem I'll change and put everything in checksumfuncs.[ch]. I also moved the tap tests in src/test/modules and renamed the file with a 3 digits. For the record I initially copied src/test/modules/brin, and this is apparently the only subdir that has a 2 digits pattern. I also added a new WAIT_EVENT_CHECK_DELAY wait event. > > I'm not sure I understand. Unless I missed something this approach > > *cannot* raise a false positive. What it does is force a 2nd check > > with stronger lock *to make sure it's actually a corruption*, so we > > don't raise false positive. The only report that can happen in this > > 1st loop is if smgread raises an error, which AFAICT can only happen > > (at least with mdread) if the whole block couldn't be read, which is a > > sign of a very bad problem. This should clearly be reported, as this > > cannot be caused by the locking heuristics used here. > > We don't know how much this optimization matters though? Could it be > possible to get an idea of that? For example, take the case of one > relation with a fixed size in a read-only workload and a read-write > workload (as long as autovacuum and updates make the number of > relation blocks rather constant for the read-write case), doing a > checksum verification in parallel of multiple clients working on the > relation concurrently. Assuming that the relation is fully in the OS > cache, we could get an idea of the impact with multiple > (shared_buffers / relation size) rates to make the eviction more > aggressive? The buffer partition locks, knowing that > NUM_BUFFER_PARTITIONS caps that, should be the bottleneck, still it > seems to me that it would be good to see if we have a difference. > What do you think? I assumed that the odds of having to check the buffer twice were so low, and avoiding to keep a bufmapping lock while doing some IO was an uncontroversial enough optimisation, but maybe that's only wishful thinking. I'll do some becnhmarking and see if I can get some figures, but it'll probably take some time. In the meantime I'm attaching v11 of the patch that should address all other comments.
Attachment
On Wed, Sep 09, 2020 at 11:25:24AM +0200, Julien Rouhaud wrote: > I assumed that the odds of having to check the buffer twice were so low, and > avoiding to keep a bufmapping lock while doing some IO was an uncontroversial > enough optimisation, but maybe that's only wishful thinking. Perhaps it is worth it, so it would be good to make sure of it and see if that's actually worth the extra complication. This also depends if the page is in the OS cache if the page is not in shared buffers, meaning that smgrread() is needed to fetch the page to check. I would be more curious to see if there is an actual difference if the page is the OS cache. > I'll do some becnhmarking and see if I can get some figures, but it'll probably > take some time. In the meantime I'm attaching v11 of the patch that should > address all other comments. Thanks. Another thing that was itching me is the introduction of 3 GUCs with one new category for the sake of two SQL functions. For VACUUM we have many things relying on the GUC delays, with autovacuum and manual vacuum. Perhaps it would make sense to have these if we have some day a background worker doing checksum verifications, still that could perfectly be in contrib/, and that would be kind of hard to tune as well. The patch enabling checksums on-the-fly could also be a reason good enough. Another thing we could consider is to pass down those parameters as function arguments, at the cost of not being able to reload them. -- Michael
Attachment
On Wed, Sep 9, 2020 at 2:37 PM Michael Paquier <michael@paquier.xyz> wrote: > > Another thing that was itching me is the introduction of 3 GUCs with > one new category for the sake of two SQL functions. For VACUUM we > have many things relying on the GUC delays, with autovacuum and manual > vacuum. Perhaps it would make sense to have these if we have some day > a background worker doing checksum verifications, still that could > perfectly be in contrib/, and that would be kind of hard to tune as > well. The patch enabling checksums on-the-fly could also be a reason > good enough. Another thing we could consider is to pass down those > parameters as function arguments, at the cost of not being able to > reload them. I'm not terribly happy with adding that for now, but it's quite clear that there'll eventually be a lot of new stuff added that will benefit from either the category or the GUC. FTR once we reach an agreement on how to do this check (I'm wondering if it'll stay an SQL function or become a plain backend command, in which case GUCs would be mandatory), I'll also be happy to work on a background worker to help people running the check regularly. So in my opinion it's better to add them now so we won't have to change the sql function definition later when other facilities will rely on them.
On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote: > > I'll do some becnhmarking and see if I can get some figures, but it'll probably > take some time. In the meantime I'm attaching v11 of the patch that should > address all other comments. I just realized that I forgot to update one of the Makefile when moving the TAP test folder. v12 attached should fix that.
Attachment
On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote: > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote: > > > > I'll do some becnhmarking and see if I can get some figures, but it'll probably > > take some time. In the meantime I'm attaching v11 of the patch that should > > address all other comments. > > I just realized that I forgot to update one of the Makefile when moving the TAP > test folder. v12 attached should fix that. And the cfbot just reported a new error for Windows build. Attached v13 should fix that.
Attachment
On Thu, Sep 10, 2020 at 09:47:23AM +0200, Julien Rouhaud wrote: > On Wed, Sep 09, 2020 at 03:41:30PM +0200, Julien Rouhaud wrote: > > On Wed, Sep 09, 2020 at 11:25:29AM +0200, Julien Rouhaud wrote: > > > > > > I'll do some becnhmarking and see if I can get some figures, but it'll probably > > > take some time. In the meantime I'm attaching v11 of the patch that should > > > address all other comments. > > > > I just realized that I forgot to update one of the Makefile when moving the TAP > > test folder. v12 attached should fix that. > > > And the cfbot just reported a new error for Windows build. Attached v13 should > fix that. I did some benchmarking using the following environnment: - 16MB shared buffers - 490MB table (10M rows) - synchronized_seqscan to off - table in OS cache I don't have a big machine so I went with a very small shared_buffers and a small table, to make sure that all data is in OS cache but the table more than an order bigger than the shared_buffers, to simulate some plausible environment. I used a simple read only query that performs a sequential scan of the table (a simple SELECT * FROM table), run using 10 concurrent connections, 5 runs of 700 seconds. I did that without any other activity, with a \watch of the original pg_check_relation function using \watch .1, and a modified version of that function without the optimisation, still with a \watch .1 The TPS is obviously overall extremely bad, but I can see that the submitted version added an overhead of ~3.9% (average of 5 runs), while the version without the optimisation added an overhead of ~6.57%. This is supposed to be a relatively fair benchmark as all the data are cached on the OS side, so IO done while holding the bufmapping lock aren't too long, but we can see that we already get a non negligible benefit from this optimisation. Should I do additional benchmarking, like dropping the OS cache and/or adding some write activity? This would probably only make the unoptimized version perform even worse.
On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote: > The TPS is obviously overall extremely bad, but I can see that the submitted > version added an overhead of ~3.9% (average of 5 runs), while the version > without the optimisation added an overhead of ~6.57%. Okay, so that stands as a difference. I am planning to run some benchmarks on my end as well, and see if I can see a clear difference. > This is supposed to be a relatively fair benchmark as all the data are cached > on the OS side, so IO done while holding the bufmapping lock aren't too long, > but we can see that we already get a non negligible benefit from this > optimisation. Should I do additional benchmarking, like dropping the OS cache > and/or adding some write activity? This would probably only make the > unoptimized version perform even worse. It would be also interesting to see the case where the pages are not in the OS cache and see how bad it can get. For the read-write case, I am not sure as we may have some different overhead that hide the noise. Also, did you run your tests with the functions scanning at full speed, with (ChecksumCostDelay < 0) so as there is no throttling? -- Michael
Attachment
On Fri, Sep 11, 2020 at 9:34 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Thu, Sep 10, 2020 at 08:06:10PM +0200, Julien Rouhaud wrote: > > The TPS is obviously overall extremely bad, but I can see that the submitted > > version added an overhead of ~3.9% (average of 5 runs), while the version > > without the optimisation added an overhead of ~6.57%. > > Okay, so that stands as a difference. I am planning to run some > benchmarks on my end as well, and see if I can see a clear > difference. Thanks! > > This is supposed to be a relatively fair benchmark as all the data are cached > > on the OS side, so IO done while holding the bufmapping lock aren't too long, > > but we can see that we already get a non negligible benefit from this > > optimisation. Should I do additional benchmarking, like dropping the OS cache > > and/or adding some write activity? This would probably only make the > > unoptimized version perform even worse. > > It would be also interesting to see the case where the pages are not > in the OS cache and see how bad it can get. For the read-write case, > I am not sure as we may have some different overhead that hide the > noise. Also, did you run your tests with the functions scanning at > full speed, with (ChecksumCostDelay < 0) so as there is no throttling? I used all default settings, but by default checksum_cost_delay is 0 so there shouldn't be any throttling.
On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote: > Thanks! I got some numbers out of my pocket, using the following base configuration: wal_level = minimal fsync = off shared_buffers = '300MB' # also tested with 30MB and 60MB checksum_cost_delay = 0 # default in patch And for the test I have used pgbench initialized at a scale of 250, to have close to 3.5GB of data, so as it gives us a sort of 90% buffer eviction, with all the data cached in the OS (we may want to look as well at the case where the OS cache does not hold all the relation pages). I have also run some tests with 30MB and 60MB of shared buffers, for similar results. I also applied some prewarming on the cluster: create extension pg_prewarm select pg_prewarm(oid) from pg_class where oid > 16000; Then, I have done five runs using that: pgbench -S -M prepared -c 64/128/256 -n -T 60 In parallel of that, I got this stuff running in parallel all the time: select pg_check_relation('pgbench_accounts'); \watch 0.1 Here are some TPS numbers with the execution time of pg_check_relation. In the five runs, I removed the highest and lowest ones, then took an average of the remaining three. I have also tested two modes: with and without the optimization, that requires a one-liner in checksum.c as per your latest patch: --- a/src/backend/storage/page/checksum.c +++ b/src/backend/storage/page/checksum.c @@ -84,7 +84,7 @@ check_one_block(Relation relation, ForkNumber forknum, BlockNumber blkno, uint16 *chk_expected, uint16 *chk_found) { char buffer[BLCKSZ]; - bool force_lock = false; + bool force_lock = true; bool found_in_sb; Within parenthesis is the amount of time taken by pg_relation_check() for a single check. This is not completely exact and I saw some variations, just to give an impression: Conns 64 128 256 force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s) force_lock=false 58637 (5s) 54131 (6~7s) 37091 (1.1~1.2s) For connections higher than 128, I was kind of surprised to see pg_relation_check being more aggressive without the optimization, with much less contention on the buffer mapping LWLock actually, but that's an impression from looking at pg_stat_activity. Looking at the wait events for each run, I saw much more hiccups with the buffer mapping LWLock when forcing the lock rather than not, still I was able to see some contention when also not forcing the lock. Not surprising as this rejects a bunch of pages from shared buffers. > I used all default settings, but by default checksum_cost_delay is 0 > so there shouldn't be any throttling. Thanks, so did I. From what I can see, there could be as well benefits in not using the optimization by default so as the relation check applies some natural throttling by making the checks actually slower (there is a link between the individual runtime of pg_relation_time and the TPS). So it also seems to me that the throttling parameters would be beneficial, but it looks to me that there is as well a point to not include any throttling in a first version if the optimization to go full speed is not around. Using three new GUCs for those function calls is still too much IMO, so there is also the argument to move all this stuff into a new contrib/ module, and have a bgworker implementation as part of it as it would need shared_preload_libraries anyway. Also, I have been putting some thoughts into the APIs able to fetch a buffer without going through the shared buffers. And neither checksum.c, because it should be a place that those APIs depends on and include only the most-internal logic for checksum algorithm and computation, nor checksumfuncs.c, because we need to think about the case of a background worker as well (that could spawn a set of dynamic workers connecting to different databases able to do checksum verifications?). It would be good to keep the handling of the buffer mapping lock as well as the calls to smgrread() into a single place. ReadBuffer_common() is a natural place for that, though it means for our use case the addition of three new options: - Being able to pass down directly a buffer pointer to save the page contents. - Being able to not verify directly a page, leaving the verification to the caller upthread. - Addition of a new mode, that I am calling here RBM_PRIVATE, where we actually read the page from disk if not yet in shared buffers, except that we fill in the contents of the page using the pointer given by the caller. That's different than the use of local buffers, as we don't need this much amount of complications like temporary tables of course for per-page checks. Another idea would be to actually just let ReadBuffer_common just do the check by itself, with a different mode like a kind of RBM_VALIDATE, where we just return a verification state of the page that can be consumed by callers. This also comes with some more advantages: - Tracking of reads from disk with track_io_timing. - Addition of some private stats dedicated to this private mode, with new fields in pgBufferUsage, all in a single place -- Michael
Attachment
On Wed, Sep 16, 2020 at 11:46 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Sep 11, 2020 at 09:49:16AM +0200, Julien Rouhaud wrote: > > Thanks! > > I got some numbers out of my pocket, using the following base > configuration: > [...] > > Within parenthesis is the amount of time taken by pg_relation_check() > for a single check. This is not completely exact and I saw some > variations, just to give an impression: > Conns 64 128 256 > force_lock=true 60590 (7~8s) 55652 (10.2~10.5s) 46812 (9~10s) > force_lock=false 58637 (5s) 54131 (6~7s) 37091 (1.1~1.2s) > > For connections higher than 128, I was kind of surprised to see > pg_relation_check being more aggressive without the optimization, with > much less contention on the buffer mapping LWLock actually, but that's > an impression from looking at pg_stat_activity. > > Looking at the wait events for each run, I saw much more hiccups with > the buffer mapping LWLock when forcing the lock rather than not, still > I was able to see some contention when also not forcing the lock. Not > surprising as this rejects a bunch of pages from shared buffers. > > > I used all default settings, but by default checksum_cost_delay is 0 > > so there shouldn't be any throttling. > > Thanks, so did I. From what I can see, there could be as well > benefits in not using the optimization by default so as the relation > check applies some natural throttling by making the checks actually > slower (there is a link between the individual runtime of > pg_relation_time and the TPS). Thanks a lot for the tests! I'm not surprised that forcing the lock will slow down the pg_check_relation() execution, but I'm a bit surprised that holding the buffer mapping lock longer in a workload that has a lot of evictions actually makes things faster. Do you have any idea why that's the case? > So it also seems to me that the > throttling parameters would be beneficial, but it looks to me that > there is as well a point to not include any throttling in a first > version if the optimization to go full speed is not around. Using > three new GUCs for those function calls is still too much IMO I'm assuming that you prefer to remove both the optimization and the throttling part? I'll do that with the next version unless there's objections. >, so > there is also the argument to move all this stuff into a new contrib/ > module, and have a bgworker implementation as part of it as it would > need shared_preload_libraries anyway. > > Also, I have been putting some thoughts into the APIs able to fetch a > buffer without going through the shared buffers. And neither > checksum.c, because it should be a place that those APIs depends on > and include only the most-internal logic for checksum algorithm and > computation, nor checksumfuncs.c, because we need to think about the > case of a background worker as well (that could spawn a set of dynamic > workers connecting to different databases able to do checksum > verifications?). It would be good to keep the handling of the buffer > mapping lock as well as the calls to smgrread() into a single place. > ReadBuffer_common() is a natural place for that, though it means for > our use case the addition of three new options: > - Being able to pass down directly a buffer pointer to save the page > contents. > - Being able to not verify directly a page, leaving the verification > to the caller upthread. > - Addition of a new mode, that I am calling here RBM_PRIVATE, where we > actually read the page from disk if not yet in shared buffers, except > that we fill in the contents of the page using the pointer given by > the caller. That's different than the use of local buffers, as we > don't need this much amount of complications like temporary tables of > course for per-page checks. > > Another idea would be to actually just let ReadBuffer_common just do > the check by itself, with a different mode like a kind of > RBM_VALIDATE, where we just return a verification state of the page > that can be consumed by callers. > > This also comes with some more advantages: > - Tracking of reads from disk with track_io_timing. > - Addition of some private stats dedicated to this private mode, with > new fields in pgBufferUsage, all in a single place I agree that putting the code nearby ReadBuffer_common() would be a good idea. However, that means that I can't move all the code to contrib/ I'm wondering what you'd like to see going there. I can see some values in also having the SQL functions available in core rather than contrib, e.g. if you need to quickly check a relation on a standby, so without requiring to create the extension on the primary node first. Then, I'm a bit worried about adding this code in ReadBuffer_common. What this code does is quite different, and I'm afraid that it'll make ReadBuffer_common more complex than needed, which is maybe not a good idea for something as critical as this function. What do you think?
On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > Thanks a lot for the tests! I'm not surprised that forcing the lock > will slow down the pg_check_relation() execution, but I'm a bit > surprised that holding the buffer mapping lock longer in a workload > that has a lot of evictions actually makes things faster. Do you have > any idea why that's the case? That's still a bit unclear to me, but I have not spent much time thinking about this particular point either. > I'm assuming that you prefer to remove both the optimization and the > throttling part? I'll do that with the next version unless there's > objections. Yeah, any tests I have done tends to show that. It would be good to also check some perf profiles here, at least for the process running the relation check in a loop. > I agree that putting the code nearby ReadBuffer_common() would be a > good idea. However, that means that I can't move all the code to > contrib/ I'm wondering what you'd like to see going there. I can see > some values in also having the SQL functions available in core rather > than contrib, e.g. if you need to quickly check a relation on a > standby, so without requiring to create the extension on the primary > node first. Good point. This could make the user experience worse. > Then, I'm a bit worried about adding this code in ReadBuffer_common. > What this code does is quite different, and I'm afraid that it'll make > ReadBuffer_common more complex than needed, which is maybe not a good > idea for something as critical as this function. > > What do you think? Yeah, I have been looking at ReadBuffer_common() and it is true that it is complicated enough so we may not really need an extra mode or more options, for a final logic that is actually different than what a buffer read does: we just want to know if a page has a valid checksum or not. An idea that I got here would be to add a new, separate function to do the page check directly in bufmgr.c, but that's what you mean. Now only the prefetch routine and ReadBuffer_common use partition locks, but getting that done in the same file looks like a good compromise to me. It would be also possible to keep the BLCKSZ buffer used to check the page directly in this routine, so as any caller willing to do a check don't need to worry about any allocation. -- Michael
Attachment
On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > will slow down the pg_check_relation() execution, but I'm a bit > > surprised that holding the buffer mapping lock longer in a workload > > that has a lot of evictions actually makes things faster. Do you have > > any idea why that's the case? > > That's still a bit unclear to me, but I have not spent much time > thinking about this particular point either. > > > I'm assuming that you prefer to remove both the optimization and the > > throttling part? I'll do that with the next version unless there's > > objections. > > Yeah, any tests I have done tends to show that. It would be good to > also check some perf profiles here, at least for the process running > the relation check in a loop. > > > I agree that putting the code nearby ReadBuffer_common() would be a > > good idea. However, that means that I can't move all the code to > > contrib/ I'm wondering what you'd like to see going there. I can see > > some values in also having the SQL functions available in core rather > > than contrib, e.g. if you need to quickly check a relation on a > > standby, so without requiring to create the extension on the primary > > node first. > > Good point. This could make the user experience worse. > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > What this code does is quite different, and I'm afraid that it'll make > > ReadBuffer_common more complex than needed, which is maybe not a good > > idea for something as critical as this function. > > > > What do you think? > > Yeah, I have been looking at ReadBuffer_common() and it is true that > it is complicated enough so we may not really need an extra mode or > more options, for a final logic that is actually different than what a > buffer read does: we just want to know if a page has a valid checksum > or not. An idea that I got here would be to add a new, separate > function to do the page check directly in bufmgr.c, but that's what > you mean. Now only the prefetch routine and ReadBuffer_common use > partition locks, but getting that done in the same file looks like a > good compromise to me. It would be also possible to keep the BLCKSZ > buffer used to check the page directly in this routine, so as any > caller willing to do a check don't need to worry about any > allocation. I made all the suggested modifications in attached v14: - moved the C code in bufmgr.c nearby ReadBuffer - removed the GUC and throttling options - removed the dubious optimization All documentation and comments are updated to reflect those changes. I also split the commit in two, one for the backend infrastructure and one for the SQL wrappers.
Attachment
On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > > will slow down the pg_check_relation() execution, but I'm a bit > > > surprised that holding the buffer mapping lock longer in a workload > > > that has a lot of evictions actually makes things faster. Do you have > > > any idea why that's the case? > > > > That's still a bit unclear to me, but I have not spent much time > > thinking about this particular point either. > > > > > I'm assuming that you prefer to remove both the optimization and the > > > throttling part? I'll do that with the next version unless there's > > > objections. > > > > Yeah, any tests I have done tends to show that. It would be good to > > also check some perf profiles here, at least for the process running > > the relation check in a loop. > > > > > I agree that putting the code nearby ReadBuffer_common() would be a > > > good idea. However, that means that I can't move all the code to > > > contrib/ I'm wondering what you'd like to see going there. I can see > > > some values in also having the SQL functions available in core rather > > > than contrib, e.g. if you need to quickly check a relation on a > > > standby, so without requiring to create the extension on the primary > > > node first. > > > > Good point. This could make the user experience worse. > > > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > > What this code does is quite different, and I'm afraid that it'll make > > > ReadBuffer_common more complex than needed, which is maybe not a good > > > idea for something as critical as this function. > > > > > > What do you think? > > > > Yeah, I have been looking at ReadBuffer_common() and it is true that > > it is complicated enough so we may not really need an extra mode or > > more options, for a final logic that is actually different than what a > > buffer read does: we just want to know if a page has a valid checksum > > or not. An idea that I got here would be to add a new, separate > > function to do the page check directly in bufmgr.c, but that's what > > you mean. Now only the prefetch routine and ReadBuffer_common use > > partition locks, but getting that done in the same file looks like a > > good compromise to me. It would be also possible to keep the BLCKSZ > > buffer used to check the page directly in this routine, so as any > > caller willing to do a check don't need to worry about any > > allocation. > > I made all the suggested modifications in attached v14: > > - moved the C code in bufmgr.c nearby ReadBuffer > - removed the GUC and throttling options > - removed the dubious optimization > > All documentation and comments are updated to reflect those changes. > I also split the commit in two, one for the backend infrastructure and > one for the SQL wrappers. And I did miss a reference in the sgml documentation, fixed in v15.
Attachment
On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > > > will slow down the pg_check_relation() execution, but I'm a bit > > > > surprised that holding the buffer mapping lock longer in a workload > > > > that has a lot of evictions actually makes things faster. Do you have > > > > any idea why that's the case? > > > > > > That's still a bit unclear to me, but I have not spent much time > > > thinking about this particular point either. > > > > > > > I'm assuming that you prefer to remove both the optimization and the > > > > throttling part? I'll do that with the next version unless there's > > > > objections. > > > > > > Yeah, any tests I have done tends to show that. It would be good to > > > also check some perf profiles here, at least for the process running > > > the relation check in a loop. > > > > > > > I agree that putting the code nearby ReadBuffer_common() would be a > > > > good idea. However, that means that I can't move all the code to > > > > contrib/ I'm wondering what you'd like to see going there. I can see > > > > some values in also having the SQL functions available in core rather > > > > than contrib, e.g. if you need to quickly check a relation on a > > > > standby, so without requiring to create the extension on the primary > > > > node first. > > > > > > Good point. This could make the user experience worse. > > > > > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > > > What this code does is quite different, and I'm afraid that it'll make > > > > ReadBuffer_common more complex than needed, which is maybe not a good > > > > idea for something as critical as this function. > > > > > > > > What do you think? > > > > > > Yeah, I have been looking at ReadBuffer_common() and it is true that > > > it is complicated enough so we may not really need an extra mode or > > > more options, for a final logic that is actually different than what a > > > buffer read does: we just want to know if a page has a valid checksum > > > or not. An idea that I got here would be to add a new, separate > > > function to do the page check directly in bufmgr.c, but that's what > > > you mean. Now only the prefetch routine and ReadBuffer_common use > > > partition locks, but getting that done in the same file looks like a > > > good compromise to me. It would be also possible to keep the BLCKSZ > > > buffer used to check the page directly in this routine, so as any > > > caller willing to do a check don't need to worry about any > > > allocation. > > > > I made all the suggested modifications in attached v14: > > > > - moved the C code in bufmgr.c nearby ReadBuffer > > - removed the GUC and throttling options > > - removed the dubious optimization > > > > All documentation and comments are updated to reflect those changes. > > I also split the commit in two, one for the backend infrastructure and > > one for the SQL wrappers. > > And I did miss a reference in the sgml documentation, fixed in v15. I forgot to add the modified file in the previous attachment, sorry for the noise.
Attachment
On Fri, Oct 16, 2020 at 8:59 AM Julien Rouhaud <rjuju123@gmail.com> wrote: > > On Thu, Oct 15, 2020 at 3:59 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > On Thu, Oct 15, 2020 at 1:37 PM Julien Rouhaud <rjuju123@gmail.com> wrote: > > > > > > On Thu, Oct 1, 2020 at 1:07 PM Michael Paquier <michael@paquier.xyz> wrote: > > > > > > > > On Fri, Sep 25, 2020 at 06:11:47PM +0800, Julien Rouhaud wrote: > > > > > Thanks a lot for the tests! I'm not surprised that forcing the lock > > > > > will slow down the pg_check_relation() execution, but I'm a bit > > > > > surprised that holding the buffer mapping lock longer in a workload > > > > > that has a lot of evictions actually makes things faster. Do you have > > > > > any idea why that's the case? > > > > > > > > That's still a bit unclear to me, but I have not spent much time > > > > thinking about this particular point either. > > > > > > > > > I'm assuming that you prefer to remove both the optimization and the > > > > > throttling part? I'll do that with the next version unless there's > > > > > objections. > > > > > > > > Yeah, any tests I have done tends to show that. It would be good to > > > > also check some perf profiles here, at least for the process running > > > > the relation check in a loop. > > > > > > > > > I agree that putting the code nearby ReadBuffer_common() would be a > > > > > good idea. However, that means that I can't move all the code to > > > > > contrib/ I'm wondering what you'd like to see going there. I can see > > > > > some values in also having the SQL functions available in core rather > > > > > than contrib, e.g. if you need to quickly check a relation on a > > > > > standby, so without requiring to create the extension on the primary > > > > > node first. > > > > > > > > Good point. This could make the user experience worse. > > > > > > > > > Then, I'm a bit worried about adding this code in ReadBuffer_common. > > > > > What this code does is quite different, and I'm afraid that it'll make > > > > > ReadBuffer_common more complex than needed, which is maybe not a good > > > > > idea for something as critical as this function. > > > > > > > > > > What do you think? > > > > > > > > Yeah, I have been looking at ReadBuffer_common() and it is true that > > > > it is complicated enough so we may not really need an extra mode or > > > > more options, for a final logic that is actually different than what a > > > > buffer read does: we just want to know if a page has a valid checksum > > > > or not. An idea that I got here would be to add a new, separate > > > > function to do the page check directly in bufmgr.c, but that's what > > > > you mean. Now only the prefetch routine and ReadBuffer_common use > > > > partition locks, but getting that done in the same file looks like a > > > > good compromise to me. It would be also possible to keep the BLCKSZ > > > > buffer used to check the page directly in this routine, so as any > > > > caller willing to do a check don't need to worry about any > > > > allocation. > > > > > > I made all the suggested modifications in attached v14: > > > > > > - moved the C code in bufmgr.c nearby ReadBuffer > > > - removed the GUC and throttling options > > > - removed the dubious optimization > > > > > > All documentation and comments are updated to reflect those changes. > > > I also split the commit in two, one for the backend infrastructure and > > > one for the SQL wrappers. > > > > And I did miss a reference in the sgml documentation, fixed in v15. > > I forgot to add the modified file in the previous attachment, sorry > for the noise. And Michael just told me that I also missed adding one of the C files while splitting the patch into two.
Attachment
On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote: > And Michael just told me that I also missed adding one of the C files > while splitting the patch into two. + if (PageIsNew(page)) + { + /* + * Check if the page is really new or if there's corruption that + * affected PageIsNew detection. Note that PageIsVerified won't try to + * detect checksum corruption in this case, so there's no risk of + * duplicated corruption report. + */ + if (PageIsVerified(page, blkno)) + { + /* No corruption. */ + return true; + } Please note that this part of your patch overlaps with a proposal for a bug fix related to zero-only pages with the checksum verification of base backups: https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru Your patch is trying to adapt itself to the existing logic we have in PageIsVerified() so as you don't get a duplicated report, as does the base backup part. Note that I cannot find in the wild any open code making use of PageIsVerified(), but I'd like to believe that it is rather handy to use for table AMs at least (?), so if we can avoid any useless ABI breakage, it may be better to have a new PageIsVerifiedExtended() able to take additional options, one to report to pgstat and one to generate this elog(WARNING). And then this patch could just make use of it? + /* + * There's corruption, but since this affects PageIsNew, we + * can't compute a checksum, so set NoComputedChecksum for the + * expected checksum. + */ + *chk_expected = NoComputedChecksum; + *chk_found = hdr->pd_checksum; + return false; [...] + /* + * This can happen if corruption makes the block appears as + * PageIsNew() but isn't a new page. + */ + if (chk_expected == NoComputedChecksum) + nulls[i++] = true; + else + values[i++] = UInt16GetDatum(chk_expected); Somewhat related to the first point, NoComputedChecksum exists because, as the current patch is shaped, we need to report an existing checksum to the user even for the zero-only case. PageIsVerified() is not that flexible so we could change it to report a status depending on the error faced (checksum, header or zero-only) on top of getting a checksum. Now, I am not completely sure either that it is worth the complication to return in the SRF of the check function the expected checksum. So, wouldn't it be better to just rely on PageIsVerified() (well it's rather-soon-to-be extended flavor) for the checksum check, the header sanity check and the zero-only check? My point is to keep a single entry point for all the page sanity checks, so as base backups, your patch, and the buffer manager apply the same things. Base backups got that partially wrong because the base backup code wants to keep control of the number of failures and the error reports. Your patch actually wishes to report a failure, but you want to add more context with the fork name and such. Another option we could use here is to add an error context so as PageIsVerified() reports the WARNING, but the SQL function provides more context with the block number and the relation involved in the check. -- Michael
Attachment
On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Oct 16, 2020 at 09:22:02AM +0800, Julien Rouhaud wrote: > > And Michael just told me that I also missed adding one of the C files > > while splitting the patch into two. > > + if (PageIsNew(page)) > + { > + /* > + * Check if the page is really new or if there's corruption that > + * affected PageIsNew detection. Note that PageIsVerified won't try to > + * detect checksum corruption in this case, so there's no risk of > + * duplicated corruption report. > + */ > + if (PageIsVerified(page, blkno)) > + { > + /* No corruption. */ > + return true; > + } > Please note that this part of your patch overlaps with a proposal for > a bug fix related to zero-only pages with the checksum verification of > base backups: > https://www.postgresql.org/message-id/608f3476-0598-2514-2c03-e05c7d2b0cbd@postgrespro.ru > > Your patch is trying to adapt itself to the existing logic we have in > PageIsVerified() so as you don't get a duplicated report, as does the > base backup part. Note that I cannot find in the wild any open code > making use of PageIsVerified(), but I'd like to believe that it is > rather handy to use for table AMs at least (?), so if we can avoid any > useless ABI breakage, it may be better to have a new > PageIsVerifiedExtended() able to take additional options, one to > report to pgstat and one to generate this elog(WARNING). And then > this patch could just make use of it? Indeed, that would be great. > + /* > + * There's corruption, but since this affects PageIsNew, we > + * can't compute a checksum, so set NoComputedChecksum for the > + * expected checksum. > + */ > + *chk_expected = NoComputedChecksum; > + *chk_found = hdr->pd_checksum; > + return false; > [...] > + /* > + * This can happen if corruption makes the block appears as > + * PageIsNew() but isn't a new page. > + */ > + if (chk_expected == NoComputedChecksum) > + nulls[i++] = true; > + else > + values[i++] = UInt16GetDatum(chk_expected); > Somewhat related to the first point, NoComputedChecksum exists > because, as the current patch is shaped, we need to report an existing > checksum to the user even for the zero-only case. I'm not sure that I understand your point. The current patch only returns something to users when there's a corruption. If by "zero-only case" you mean "page corrupted in a way that PageIsNew() returns true while not being all zero", then it's a corrupted page and then obviously yes it needs to be returned to users. > PageIsVerified() is > not that flexible so we could change it to report a status depending > on the error faced (checksum, header or zero-only) on top of getting a > checksum. Now, I am not completely sure either that it is worth the > complication to return in the SRF of the check function the expected > checksum. It seemed to me that it could be something useful to get with this kind of tool. You may be able to recover a corrupted page from backup/WAL if the checksum itself wasn't corrupted so that you know what to look for. There would be a lot of caveats and low level work, but if you're desperate enough that may save you a bit of time. > So, wouldn't it be better to just rely on PageIsVerified() > (well it's rather-soon-to-be extended flavor) for the checksum check, > the header sanity check and the zero-only check? My point is to keep > a single entry point for all the page sanity checks, so as base > backups, your patch, and the buffer manager apply the same things. > Base backups got that partially wrong because the base backup code > wants to keep control of the number of failures and the error > reports. I'm fine with it. > Your patch actually wishes to report a failure, but you want > to add more context with the fork name and such. Another option we > could use here is to add an error context so as PageIsVerified() > reports the WARNING, but the SQL function provides more context with > the block number and the relation involved in the check. Also, returning actual data rather than a bunch of warnings is way easier to process for client code. And as mentioned previously having an API that returns a list of corrupted blocks could be useful for a single-page recovery feature.
On Mon, Oct 19, 2020 at 11:16:38AM +0800, Julien Rouhaud wrote: > On Mon, Oct 19, 2020 at 10:39 AM Michael Paquier <michael@paquier.xyz> wrote: >> Somewhat related to the first point, NoComputedChecksum exists >> because, as the current patch is shaped, we need to report an existing >> checksum to the user even for the zero-only case. > > I'm not sure that I understand your point. The current patch only > returns something to users when there's a corruption. If by > "zero-only case" you mean "page corrupted in a way that PageIsNew() > returns true while not being all zero", then it's a corrupted page and > then obviously yes it needs to be returned to users. Sorry for the confusion, this previous paragraph was confusing. I meant that the reason why NoComputedChecksum exists is that we give up on attempting to calculate the checksum if we detect that the page is new, but failed the zero-only test, and that we want the users to know about this special case by setting this expected checksum to NULL for the SRF. >> So, wouldn't it be better to just rely on PageIsVerified() >> (well it's rather-soon-to-be extended flavor) for the checksum check, >> the header sanity check and the zero-only check? My point is to keep >> a single entry point for all the page sanity checks, so as base >> backups, your patch, and the buffer manager apply the same things. >> Base backups got that partially wrong because the base backup code >> wants to keep control of the number of failures and the error >> reports. > > I'm fine with it. Thanks. >> Your patch actually wishes to report a failure, but you want >> to add more context with the fork name and such. Another option we >> could use here is to add an error context so as PageIsVerified() >> reports the WARNING, but the SQL function provides more context with >> the block number and the relation involved in the check. > > Also, returning actual data rather than a bunch of warnings is way > easier to process for client code. And as mentioned previously having > an API that returns a list of corrupted blocks could be useful for a > single-page recovery feature. No issues with reporting the block number and the fork type in the SRF at least of course as this is helpful to detect the position of the broken blocks. For the checksum found in the header and the one calculated (named expected in the patch), I am less sure how to put a clear definition on top of that but we could always consider that later and extend the SRF as needed. Once the user knows that both do not match, he/she knows that there is a problem. -- Michael
Attachment
On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote: > No issues with reporting the block number and the fork type in the SRF > at least of course as this is helpful to detect the position of the > broken blocks. For the checksum found in the header and the one > calculated (named expected in the patch), I am less sure how to put a > clear definition on top of that but we could always consider that > later and extend the SRF as needed. Once the user knows that both do > not match, he/she knows that there is a problem. So, I have reviewed your patch set, and heavily reworked the logic to be more consistent on many thinks, resulting in a largely simplified patch without sacrificing its usefulness: - The logic of the core routine of bufmgr.c is unchanged. I have simplified it a bit though by merging the subroutines that were part of the patch. SMgrRelation is used as argument instead of a Relation. That's more consistent with the surroundings. The initial read of a page without locks is still on the table as an extra optimization, though I am not completely sure if this should be part of CheckBuffer() or not. I also thought about the previous benchmarks and I think that not using the most-optimized improved performance, because it reduced the successive runes of the SQL functions, reducing the back-pressure on the partition locks (we held on of them at the same time for a longer period, meaning that the other 127 ran free for a longer time). Please note that this part still referred to a "module", which was incorrect. - Removal of the expected and found checksums from the SRF of the function. Based on my recent business with the page checks for base backups, I have arrived at the conclusion that the SRF should return data that we can absolutely trust, and the minimum I think we have to trust here is if a given page is thought as safe or not, considering all the sanity checks done by PageIsVerified() as the main entry point for everything. This has led to a bit of confusion with the addition of NoComputedChecksum for a page that was empty as of the initial of the patch, so it happens that we don't need it anymore. - Removal of the dependency with checksums for this feature. While simplifying the code, I have noticed that this feature can also be beneficial for clusters that do not have have data checksums, as PageIsVerified() is perfectly able to run some page header checks and the zero-page case. That's of course less useful than having the checksums, but there is no need to add a dependency here. The file for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c. - The function is changed to return no tuples if the relkind is not supported, and the same applies for temporary relations. That's more consistent with other system functions like the ones in charge of partition information, and this makes full scans of pg_class much easier to work with. Temporary tables were not handled correctly anyway as these are in local buffers, but the use case of this function in this case is not really obvious to me. - Having the forknum in the SRF is kind of confusing, as the user would need to map a number with the physical on-disk name. Instead, I have made the choice to return the *path* of the corrupted file with a block number. This way, an operator can know immediately where a problem comes from. The patch does not append the segment number, and I am not sure if we should actually do that, but adding it is straight-forward as we have the block number. There is a dependency with table AMs here as well, as this goes down to fd.c, explaining why I have not added it and just. - I really don't know what kind of default ACL should apply for such functions, but I am sure that SCAN_TABLES is not what we are looking for here, and there is nothing preventing us from having a safe default from the start of times, so I moved the function to be superuser-only by default, and GRANT can be used to allow its execution to other roles. We could relax that in the future, of course, this can be discussed separately. - The WARNING report for each block found as corrupted gains an error context, as a result of a switch to PageIsVerified(), giving a user all the information needed in the logs on top of the result in the SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(), and I got to wonder if we should have some progress report for this stuff, though that's a separate discussion. - The function is renamed to something less generic, pg_relation_check_pages(), and I have reduced the number of functions from two to one, where the user can specify the fork name with a new option. The default of NULL means that all the forks of a relation are checked. - The TAP tests are rather bulky. I have moved all the non-corruption test cases into a new SQL test file. That's useful for people willing to do some basic sanity checks with a non-default table AM. At least it provides a minimum coverage. I have not completely finished its review, but I have done some work. Doing some debugging of corrupt_and_test_block() was proving to be rather difficult as the same test names are assigned multiple times. I am tempted to move this test suite to src/test/recovery/ instead. - Reworked the docs and some comments. That's quite a lot of changes, and I think that most of the C code, the main tests in src/test/regress/ and the docs are getting in a rather committable state. The TAP tests emulating corruptions still need a closer lookup (say, check_pg_stat_database_nb_error() should have an error prefix at least). The portions in bufmgr.c and the rest should of course be split into two separate commits, that can easily be done. And the code needs an indentation run and a catalog bump. -- Michael
Attachment
On Fri, Oct 23, 2020 at 3:28 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Mon, Oct 19, 2020 at 04:52:48PM +0900, Michael Paquier wrote: > > No issues with reporting the block number and the fork type in the SRF > > at least of course as this is helpful to detect the position of the > > broken blocks. For the checksum found in the header and the one > > calculated (named expected in the patch), I am less sure how to put a > > clear definition on top of that but we could always consider that > > later and extend the SRF as needed. Once the user knows that both do > > not match, he/she knows that there is a problem. > > So, I have reviewed your patch set, and heavily reworked the logic to > be more consistent on many thinks, resulting in a largely simplified > patch without sacrificing its usefulness: Thanks! > - Removal of the dependency with checksums for this feature. While > simplifying the code, I have noticed that this feature can also be > beneficial for clusters that do not have have data checksums, as > PageIsVerified() is perfectly able to run some page header checks and > the zero-page case. That's of course less useful than having the > checksums, but there is no need to add a dependency here. The file > for the SQL functions is renamed from checksumfuncs.c to pagefuncs.c. I agree. However I'm assuming that this refactor is relying on the not yet committed patch (in the nearby thread dealing with base backup checksums check) to also refactor PageIsVerified? As all the logic you removed was done to avoid spamming a lot of warnings when calling the function. > - The function is changed to return no tuples if the relkind is not > supported, and the same applies for temporary relations. That's more > consistent with other system functions like the ones in charge of > partition information, and this makes full scans of pg_class much > easier to work with. Temporary tables were not handled correctly > anyway as these are in local buffers, but the use case of this > function in this case is not really obvious to me. Agreed > - Having the forknum in the SRF is kind of confusing, as the user > would need to map a number with the physical on-disk name. Instead, I > have made the choice to return the *path* of the corrupted file with a > block number. This way, an operator can know immediately where a > problem comes from. The patch does not append the segment number, and > I am not sure if we should actually do that, but adding it is > straight-forward as we have the block number. There is a dependency > with table AMs here as well, as this goes down to fd.c, explaining why > I have not added it and just. That's a clear improvement, thanks! > - I really don't know what kind of default ACL should apply for such > functions, but I am sure that SCAN_TABLES is not what we are looking > for here, and there is nothing preventing us from having a safe > default from the start of times, so I moved the function to be > superuser-only by default, and GRANT can be used to allow its > execution to other roles. We could relax that in the future, of > course, this can be discussed separately. I don't have a strong opinion here, SCAN_TABLES was maybe not ideal. No objections. > - The WARNING report for each block found as corrupted gains an error > context, as a result of a switch to PageIsVerified(), giving a user > all the information needed in the logs on top of the result in the > SRF. That's useful as well if combined with CHECK_FOR_INTERRUPTS(), > and I got to wonder if we should have some progress report for this > stuff, though that's a separate discussion. Mmm, is it really an improvement to report warnings during this function execution? Note also that PageIsVerified as-is won't report a warning if a page is found as PageIsNew() but isn't actually all zero, while still being reported as corrupted by the SRF. Have you also considered that it's possible to execute pg_relation_check_pages with ignore_checksum_failure = on? That's evidently a bad idea, but doing so would report some of the data corruption as warnings while still not reporting anything in the SRF. Having some progress report would be nice to have, but +1 to have a separate discussion for that. > - The function is renamed to something less generic, > pg_relation_check_pages(), and I have reduced the number of functions > from two to one, where the user can specify the fork name with a new > option. The default of NULL means that all the forks of a relation > are checked. Ok. > - The TAP tests are rather bulky. I have moved all the non-corruption > test cases into a new SQL test file. That's useful for people willing > to do some basic sanity checks with a non-default table AM. At least > it provides a minimum coverage. Agreed
On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: > I agree. However I'm assuming that this refactor is relying on the > not yet committed patch (in the nearby thread dealing with base backup > checksums check) to also refactor PageIsVerified? As all the logic > you removed was done to avoid spamming a lot of warnings when calling > the function. Yeah, it should use a refactored version, but I was as well in the mood of looking at version based on what we have now on HEAD. Even if I am not completely clear where the patch for page verification and base backups will go, I was thinking as well to do the refactoring introducing PageIsVerifiedExtended() first, before considering the next steps for this thread. It seems to me that the path where we generate no WARNINGs at all makes the whole experience more consistent for the user with this function. > Mmm, is it really an improvement to report warnings during this > function execution? Note also that PageIsVerified as-is won't report > a warning if a page is found as PageIsNew() but isn't actually all > zero, while still being reported as corrupted by the SRF. Yep, joining the point of above to just have no WARNINGs at all. > Have you also considered that it's possible to execute > pg_relation_check_pages with ignore_checksum_failure = on? That's > evidently a bad idea, but doing so would report some of the data > corruption as warnings while still not reporting anything in the SRF. Yeah, I thought about that as well, but I did not see a strong argument against preventing this behavior either, even if it sounds a bit strange. We could always tune that later depending on the feedback. -- Michael
Attachment
On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote: > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: >> Mmm, is it really an improvement to report warnings during this >> function execution? Note also that PageIsVerified as-is won't report >> a warning if a page is found as PageIsNew() but isn't actually all >> zero, while still being reported as corrupted by the SRF. > > Yep, joining the point of above to just have no WARNINGs at all. Now that we have d401c57, I got to consider more this one, and opted for not generating a WARNING for now. Hence, PageisVerifiedExtended() is disabled regarding that, but we still report a checksum failure in it. I have spent some time reviewing the tests, and as I felt this was bulky. In the reworked version attached, I have reduced the number of tests by half, without reducing the coverage, mainly: - Removed all the stderr and the return code tests, as we always expected the commands to succeed, and safe_psql() can do the job already. - Merged of the queries using pg_relation_check_pages into a single routine, with the expected output (set of broken pages returned in the SRF) in the arguments. - Added some prefixes to the tests, to generate unique test names. That makes debug easier. - The query on pg_stat_database is run once at the beginning, once at the end with the number of checksum failures correctly updated. - Added comments to document all the routines, and renamed some of them mostly for consistency. - Skipped system relations from the scan of pg_class, making the test more costly for nothing. - I ran some tests on Windows, just-in-case. I have also added a SearchSysCacheExists1() to double-check if the relation is missing before opening it, added a CHECK_FOR_INTERRUPTS() within the main check loop (where the error context is really helpful), indented the code, bumped the catalogs (mostly a self-reminder), etc. So, what do you think? -- Michael
Attachment
On Tue, Oct 27, 2020 at 3:07 PM Michael Paquier <michael@paquier.xyz> wrote: > > On Fri, Oct 23, 2020 at 06:06:30PM +0900, Michael Paquier wrote: > > On Fri, Oct 23, 2020 at 04:31:56PM +0800, Julien Rouhaud wrote: > >> Mmm, is it really an improvement to report warnings during this > >> function execution? Note also that PageIsVerified as-is won't report > >> a warning if a page is found as PageIsNew() but isn't actually all > >> zero, while still being reported as corrupted by the SRF. > > > > Yep, joining the point of above to just have no WARNINGs at all. > > Now that we have d401c57, I got to consider more this one, and opted > for not generating a WARNING for now. Hence, PageisVerifiedExtended() > is disabled regarding that, but we still report a checksum failure in > it. Great, that's also what I had in mind. > I have spent some time reviewing the tests, and as I felt this was > bulky. In the reworked version attached, I have reduced the number of > tests by half, without reducing the coverage, mainly: > - Removed all the stderr and the return code tests, as we always > expected the commands to succeed, and safe_psql() can do the job > already. > - Merged of the queries using pg_relation_check_pages into a single > routine, with the expected output (set of broken pages returned in the > SRF) in the arguments. > - Added some prefixes to the tests, to generate unique test names. > That makes debug easier. > - The query on pg_stat_database is run once at the beginning, once at > the end with the number of checksum failures correctly updated. > - Added comments to document all the routines, and renamed some of > them mostly for consistency. > - Skipped system relations from the scan of pg_class, making the test > more costly for nothing. > - I ran some tests on Windows, just-in-case. > > I have also added a SearchSysCacheExists1() to double-check if the > relation is missing before opening it, added a > CHECK_FOR_INTERRUPTS() within the main check loop (where the error > context is really helpful), indented the code, bumped the catalogs > (mostly a self-reminder), etc. > > So, what do you think? I think it's also worth noting that the IOLock is now acquired just before getting the buffer state, and released after the read (or after finding that the buffer is dirty). This is consistent with how it's done elsewhere, so I'm fine. Other than that I'm quite happy with the changes you made, thanks a lot!
On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote: > I think it's also worth noting that the IOLock is now acquired just > before getting the buffer state, and released after the read (or after > finding that the buffer is dirty). This is consistent with how it's > done elsewhere, so I'm fine. Consistency is the point. This API should be safe to use by design. I have done some extra performance tests similar to what I did upthread, and this version showed similar numbers. > Other than that I'm quite happy with the changes you made, thanks a lot! Thanks for confirming. I have gone through the whole set today, splitted the thing into two commits and applied them. We had buildfarm member florican complain about a mistake in one of the GetDatum() calls that I took care of already, and there is nothing else on my radar. -- Michael
Attachment
RE: Online checksums verification in the backend
From
"Shinoda, Noriyoshi (PN Japan A&PS Delivery)"
Date:
Hi, I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few blocksin a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend terminatedabnormally. The attached file is the operation log. Regards, Noriyoshi Shinoda -----Original Message----- From: Michael Paquier [mailto:michael@paquier.xyz] Sent: Wednesday, October 28, 2020 2:09 PM To: Julien Rouhaud <rjuju123@gmail.com> Cc: Justin Pryzby <pryzby@telsasoft.com>; Masahiko Sawada <masahiko.sawada@2ndquadrant.com>; Robert Haas <robertmhaas@gmail.com>;PostgreSQL Hackers <pgsql-hackers@lists.postgresql.org>; Masahiko Sawada <sawada.mshk@gmail.com> Subject: Re: Online checksums verification in the backend On Tue, Oct 27, 2020 at 07:47:19PM +0800, Julien Rouhaud wrote: > I think it's also worth noting that the IOLock is now acquired just > before getting the buffer state, and released after the read (or after > finding that the buffer is dirty). This is consistent with how it's > done elsewhere, so I'm fine. Consistency is the point. This API should be safe to use by design. I have done some extra performance tests similar to what I did upthread, and this version showed similar numbers. > Other than that I'm quite happy with the changes you made, thanks a lot! Thanks for confirming. I have gone through the whole set today, splitted the thing into two commits and applied them. Wehad buildfarm member florican complain about a mistake in one of the GetDatum() calls that I took care of already, and there is nothing else on my radar. -- Michael
Attachment
Hello, On Thu, Oct 29, 2020 at 7:52 AM Shinoda, Noriyoshi (PN Japan A&PS Delivery) <noriyoshi.shinoda@hpe.com> wrote: > > Hi, > > I have tested this great feature in the latest commit environment on Red Hat Enterprise Linux 7.8. I modified a few blocksin a relation file to raise a checksum error. When I executed the pg_relation_check_pages function, the backend terminatedabnormally. The attached file is the operation log. Thanks for the report! As far as I can see the issue is that the pfree(path) in check_relation_fork() should be outside the for loop.
On Thu, Oct 29, 2020 at 08:12:42AM +0800, Julien Rouhaud wrote: > As far as I can see the issue is that the pfree(path) in > check_relation_fork() should be outside the for loop. Yes, this would be triggered if more than one page is found as broken in a single SRF. Fixed, thanks Shinoda-san. -- Michael
Attachment
Hi, On 2020-10-28 14:08:52 +0900, Michael Paquier wrote: > Thanks for confirming. I have gone through the whole set today, > splitted the thing into two commits and applied them. We had > buildfarm member florican complain about a mistake in one of the > GetDatum() calls that I took care of already, and there is nothing > else on my radar. The code does IO while holding the buffer mapping lock. That seems *entirely* unacceptable to me. That basically locks 1/128 of shared buffers against concurrent mapping changes, while reading data that is likely not to be on disk? Seriously? /* see if the block is in the buffer pool or not */ LWLockAcquire(partLock, LW_SHARED); buf_id = BufTableLookup(&buf_tag, buf_hash); if (buf_id >= 0) { uint32 buf_state; /* * Found it. Now, retrieve its state to know what to do with it, and * release the pin immediately. We do so to limit overhead as much as * possible. We keep the shared LWLock on the target buffer mapping * partition for now, so this buffer cannot be evicted, and we acquire * an I/O Lock on the buffer as we may need to read its contents from * disk. */ bufdesc = GetBufferDescriptor(buf_id); LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); buf_state = LockBufHdr(bufdesc); UnlockBufHdr(bufdesc, buf_state); /* If the page is dirty or invalid, skip it */ if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0) { LWLockRelease(BufferDescriptorGetIOLock(bufdesc)); LWLockRelease(partLock); return true; } /* Read the buffer from disk, with the I/O lock still held */ smgrread(smgr, forknum, blkno, buffer); LWLockRelease(BufferDescriptorGetIOLock(bufdesc)); } else { /* * Simply read the buffer. There's no risk of modification on it as * we are holding the buffer pool partition mapping lock. */ smgrread(smgr, forknum, blkno, buffer); } The justification in the in-shared-buffers case seems to completely mis-judge costs too: * Found it. Now, retrieve its state to know what to do with it, and * release the pin immediately. We do so to limit overhead as much as * possible. We keep the shared LWLock on the target buffer mapping * partition for now, so this buffer cannot be evicted, and we acquire * an I/O Lock on the buffer as we may need to read its contents from * disk. a pin is cheap. Holding the partition lock is not. Also, using char[BLCKSZ] as a buffer isn't ok. This should use PGAlignedBlock: /* * Use this, not "char buf[BLCKSZ]", to declare a field or local variable * holding a page buffer, if that page might be accessed as a page and not * just a string of bytes. Otherwise the variable might be under-aligned, * causing problems on alignment-picky hardware. (In some places, we use * this to declare buffers even though we only pass them to read() and * write(), because copying to/from aligned buffers is usually faster than * using unaligned buffers.) We include both "double" and "int64" in the * union to ensure that the compiler knows the value must be MAXALIGN'ed * (cf. configure's computation of MAXIMUM_ALIGNOF). */ typedef union PGAlignedBlock I think this needs to be quickly reworked or reverted. Greetings, Andres Freund
Hi, On 2020-10-29 11:17:29 -0700, Andres Freund wrote: > LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); > buf_state = LockBufHdr(bufdesc); > UnlockBufHdr(bufdesc, buf_state); > > /* If the page is dirty or invalid, skip it */ > if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0) This is weird as well. What is this supposed to do? Just locking and unlocking a buffer header doesn't do squat? There's no guarantee that the flags haven't changed by this point, so you could just as well not acquire the buffer header lock. Also, why are pages without a valid tag ignored? I can follow the argument for skipping it in the DIRTY case, but that doesn't apply for BM_TAG_VALID? Greetings, Andres Freund
Hi, On 2020-10-29 11:17:29 -0700, Andres Freund wrote: > The code does IO while holding the buffer mapping lock. That seems > *entirely* unacceptable to me. That basically locks 1/128 of shared > buffers against concurrent mapping changes, while reading data that is > likely not to be on disk? Seriously? Also, uh, I don't think the locking of the buffer table provides you with the full guarantees CheckBuffer() seems to assume: * Check the state of a buffer without loading it into the shared buffers. To * avoid torn pages and possible false positives when reading data, a shared * LWLock is taken on the target buffer pool partition mapping, and we check * if the page is in shared buffers or not. An I/O lock is taken on the block * to prevent any concurrent activity from happening. this doesn't actually prevent all concurrent write IO, unless you hold an appropriate lock on the relation. There's a few places that use smgrwrite()/smgrextend() to write out data bypassing shared buffers. Maybe that isn't a problem for the uses of CheckBuffer() is envisioned for, but that'd need a pretty detailed explanation as to when it's safe to use CheckBuffer() for which blocks. Greetings, Andres Freund
Hi, On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote: > The code does IO while holding the buffer mapping lock. That seems > *entirely* unacceptable to me. That basically locks 1/128 of shared > buffers against concurrent mapping changes, while reading data that is > likely not to be on disk? Seriously? The initial implementation had a different approach, reading the buffer once without holding the buffer mapping lock (which could lead to some false positive in some unlikely scenario), and only if a corruption is detected the read is done once again *while holding the buffer mapping lock* to ensure it's not a false positive. Some benchmarking showed that the performance was worse, so we dropped that optimisation. Should we go back to something like that or do you have a better way to ensure a consistent read of a buffer which isn't in shared buffers? > a pin is cheap. Holding the partition lock is not. > The justification in the in-shared-buffers case seems to completely > mis-judge costs too: > * Found it. Now, retrieve its state to know what to do with it, and > * release the pin immediately. We do so to limit overhead as much as > * possible. We keep the shared LWLock on the target buffer mapping > * partition for now, so this buffer cannot be evicted, and we acquire > * an I/O Lock on the buffer as we may need to read its contents from > * disk. > a pin is cheap. Holding the partition lock is not. I clearly did a poor job in that case. Will fix. > Also, using char[BLCKSZ] as a buffer isn't ok. This should use > PGAlignedBlock: I wasn't aware of it, I will fix. > > LWLockAcquire(BufferDescriptorGetIOLock(bufdesc), LW_SHARED); > > buf_state = LockBufHdr(bufdesc); > > UnlockBufHdr(bufdesc, buf_state); > > > > /* If the page is dirty or invalid, skip it */ > > if ((buf_state & BM_DIRTY) != 0 || (buf_state & BM_TAG_VALID) == 0) > > This is weird as well. What is this supposed to do? Just locking and > unlocking a buffer header doesn't do squat? There's no guarantee that > the flags haven't changed by this point, so you could just as well not > acquire the buffer header lock. This is using the same approach as e.g. WaitIO() to get the state. I agree that the state can change after the buffer header lock has been released, but I think that's something out of scope. The only guarantee that we can give is that the database (or subset of relations checked) was healthy at the time the check was started, provided that your cluster survive the checkpoint happening after the check ended. I don't see how we can do better than that. > Also, why are pages without a valid tag ignored? I can follow the > argument for skipping it in the DIRTY case, but that doesn't apply for > BM_TAG_VALID? AFAICT pages that aren't BM_TAG_VALID are pages newly allocated. Those shouldn't be entirely initialized yet, and they'll be eventually written and flushed. > Also, uh, I don't think the locking of the buffer table provides you > with the full guarantees CheckBuffer() seems to assume: > > * Check the state of a buffer without loading it into the shared buffers. To > * avoid torn pages and possible false positives when reading data, a shared > * LWLock is taken on the target buffer pool partition mapping, and we check > * if the page is in shared buffers or not. An I/O lock is taken on the block > * to prevent any concurrent activity from happening. > > this doesn't actually prevent all concurrent write IO, unless you hold > an appropriate lock on the relation. There's a few places that use > smgrwrite()/smgrextend() to write out data bypassing shared buffers. > > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned > for, but that'd need a pretty detailed explanation as to when it's safe > to use CheckBuffer() for which blocks. AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged relation, during creation. Those relations shouldn't be visible to the caller snapshot, so it should be safe. I can add a comment for that if I'm not mistaken. For concurrent smgrextend(), we read the relation size at the beginning of the function, so we shouldn't read newly allocated blocks. But you're right that it's still possible to get the size that includes a newly allocated block that can be concurrently written. We can avoid that be holding a LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok?
Hi, On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote: > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote: > > The code does IO while holding the buffer mapping lock. That seems > > *entirely* unacceptable to me. That basically locks 1/128 of shared > > buffers against concurrent mapping changes, while reading data that is > > likely not to be on disk? Seriously? > > The initial implementation had a different approach, reading the buffer once > without holding the buffer mapping lock (which could lead to some false > positive in some unlikely scenario), and only if a corruption is detected the > read is done once again *while holding the buffer mapping lock* to ensure it's > not a false positive. Some benchmarking showed that the performance was worse, > so we dropped that optimisation. Should we go back to something like that or > do you have a better way to ensure a consistent read of a buffer which isn't in > shared buffers? I suspect that you're gonna need something quite different than what the function is doing right now. Not because such a method will be faster in isolation, but because there's a chance to have it correct and not have a significant performance impact onto the rest of the system. I've not thought about it in detail yet. Is suspect you'll need to ensure there is a valid entry in the buffer mapping table for the buffer you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry you're going to prevent concurrent IO from starting until your part is done. > > Also, why are pages without a valid tag ignored? I can follow the > > argument for skipping it in the DIRTY case, but that doesn't apply for > > BM_TAG_VALID? > > AFAICT pages that aren't BM_TAG_VALID are pages newly allocated. > Those shouldn't > be entirely initialized yet, and they'll be eventually written and flushed. When a page is being read there's a period when the buffer is without BM_TAG_VALID. It's quite possible that the locking prevents this case from being reachable - but in that case you shouldn't just accept it as something to be skipped... > > Also, uh, I don't think the locking of the buffer table provides you > > with the full guarantees CheckBuffer() seems to assume: > > > > * Check the state of a buffer without loading it into the shared buffers. To > > * avoid torn pages and possible false positives when reading data, a shared > > * LWLock is taken on the target buffer pool partition mapping, and we check > > * if the page is in shared buffers or not. An I/O lock is taken on the block > > * to prevent any concurrent activity from happening. > > > > this doesn't actually prevent all concurrent write IO, unless you hold > > an appropriate lock on the relation. There's a few places that use > > smgrwrite()/smgrextend() to write out data bypassing shared buffers. > > > > Maybe that isn't a problem for the uses of CheckBuffer() is envisioned > > for, but that'd need a pretty detailed explanation as to when it's safe > > to use CheckBuffer() for which blocks. > > AFAICT, concurrent smgrwrite() can only happen for init forks of unlogged > relation, during creation. That may be the case right in core right now, but for one, there definitely are extensions going through smgrwrite() without using the buffer pool. Essentially, what you are saying is that the introduction of CheckBuffer() altered what smgrwrite() is allowed to be used for, without having discussed or documented that. Before this an AM/extension could just use smgrwrite() to write data not in shared buffers, as long as a locking scheme is used that prevents multiple backends from doing that at the same time (trivially: AccessExclusiveLock). > Those relations shouldn't be visible to the caller > snapshot, so it should be safe. I can add a comment for that if I'm not > mistaken. There's no comment warning that you shouldn't use CheckBuffer() to check every buffer in shared buffers, or every relfilenode on disk. The latter would be quite a reasonable thing, given it'd avoid needing to connect to every database etc. > For concurrent smgrextend(), we read the relation size at the beginning of the > function, so we shouldn't read newly allocated blocks. But you're right that > it's still possible to get the size that includes a newly allocated block > that can be concurrently written. We can avoid that be holding a > LOCKTAG_RELATION_EXTEND lock when reading the relation size. Would that be ok? That could possibly work - but currently CheckBuffer() doesn't get a relation, nor are the comments explaining that it has to be a relation in the current database or anything. I hadn't yet looked at the caller - I just started looking at CheckBuffer() this because it caused compilation failures after rebasing my aio branch onto master (there's no IO locks anymore). Looking at the caller: - This is not a concurrency safe pattern: /* Check if relation exists. leaving if there is no such relation */ if (!SearchSysCacheExists1(RELOID, ObjectIdGetDatum(relid))) return; relation = relation_open(relid, AccessShareLock); there's a pretty obvious time-to-check-time-to-use danger here. - pg_relation_check_pages()'s docs say "valid enough to safely be loaded into the server's shared buffers". I think that's overpromising by a lot. It sounds like it verifies that the page cannot cause a crash or such when accessed - but it obviously does no such thing. - Why does check_one_relation() *silently* ignore when it's being passed a temporary table, or a relkind without storage? - I don't think it's good that check_one_relation() releases relation locks after access, but I know that others think that's fine (I think it's only fine for catalog relations). - I realize permission to pg_relation_check_pages() is not granted to non-superusers by default, but shouldn't it still perform relation access checks? - why does check_relation_fork() pstrdup the path? Greetings, Andres Freund
On Fri, Oct 30, 2020 at 10:58 AM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-10-30 10:01:08 +0800, Julien Rouhaud wrote: > > On Fri, Oct 30, 2020 at 2:17 AM Andres Freund <andres@anarazel.de> wrote: > > > The code does IO while holding the buffer mapping lock. That seems > > > *entirely* unacceptable to me. That basically locks 1/128 of shared > > > buffers against concurrent mapping changes, while reading data that is > > > likely not to be on disk? Seriously? > > > > The initial implementation had a different approach, reading the buffer once > > without holding the buffer mapping lock (which could lead to some false > > positive in some unlikely scenario), and only if a corruption is detected the > > read is done once again *while holding the buffer mapping lock* to ensure it's > > not a false positive. Some benchmarking showed that the performance was worse, > > so we dropped that optimisation. Should we go back to something like that or > > do you have a better way to ensure a consistent read of a buffer which isn't in > > shared buffers? > > I suspect that you're gonna need something quite different than what the > function is doing right now. Not because such a method will be faster in > isolation, but because there's a chance to have it correct and not have > a significant performance impact onto the rest of the system. > > I've not thought about it in detail yet. Is suspect you'll need to > ensure there is a valid entry in the buffer mapping table for the buffer > you're processing. By virtue of setting BM_IO_IN_PROGRESS on that entry > you're going to prevent concurrent IO from starting until your part is > done. So I'm assuming that the previous optimization to avoid almost every time doing an IO while holding a buffer mapping lock isn't an option? In that case, I don't see any other option than reverting the patch and discussing a new approach.
On 2020-10-30 11:58:13 +0800, Julien Rouhaud wrote: > So I'm assuming that the previous optimization to avoid almost every > time doing an IO while holding a buffer mapping lock isn't an option? > In that case, I don't see any other option than reverting the patch > and discussing a new approach. I think its pretty much *never* OK to do IO while holding a buffer mapping lock. You're locking a significant fraction of shared buffers over IO. That's just not OK. Don't think there's any place doing so currently either.
On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote: > I think its pretty much *never* OK to do IO while holding a buffer > mapping lock. You're locking a significant fraction of shared buffers > over IO. That's just not OK. Don't think there's any place doing so > currently either. There is no place doing that on HEAD. This specific point was mentioned in the first message of this thread, 7th paragraph. That was a long thread, so it is easy to miss: https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com I am wondering what you have in mind regarding the use of BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some consequences for other existing buffers in the table, like a possible eviction? I'd like to think that we should not do any manipulation of the buffer tables in this case. Hence, in order to prevent a concurrent activity to load in shared buffers the page currently checked on disk, I got to think that we would need something new here, like a filtering hash table that would be checked each time a backend tries to insert an entry into the buffer tables. That's something I was wondering here: https://www.postgresql.org/message-id/20200316030638.GA2331@paquier.xyz I called that a preemptive lock, but you could also call that a discard filter or a virtual pin, just something to mean that a page locked this way cannot be loaded into the shared buffers. I'd like to think that this should not touch the existing buffer table, but it would impact the performance when attempting to insert an entry in the tables, as anything would need to be pre-checked. Assuming that we could make this thing work without holding the partition lock, and assuming that we only hold a share lock on the relation, we have two cases: 1) If the buffer is in shared buffers, we have the APIs to solve that by using a pin, unlock the partition, and then do the I/O. (Still that's unsafe with the smgrwrite() argument?) 2) If the buffer is not in shared buffers, we don't have what it takes to solve the problem yet. But even if we solve this problem, we will never really be sure that this is entirely safe, as per the argument with concurrent smgrwrite() calls. Current in-core code assumes that this can happen only for init forks of unlogged relations which would not be visible yet in the backend doing a page check, still it can be really easy to break this assumption with any new code added by a new feature. These arguments bring down to reduce the scope of CheckBuffer() as follows: - Use an AEL on the relation, pass down a Relation instead of SMgrRelation, and add on the way an assertion to make sure that the caller holds an AEL on the relation. I wanted to study the possiblity to use that stuff for base backups, but if you bring the concurrent smgrwrite() calls into the set of possibilities this shuts down the whole study from the start. - It is still useful to check that a page is in shared buffers IMO, so as if it is dirty we just discard it from the checks and rely on the next checkpoint to do a flush. It is also useful to check the state of the on-disk data is good or not if the page is not dirty, as the page could have gone rogue on-disk while a system was up for weeks. -- Michael
Attachment
Hi, I'm a bit limited writing - one handed for a while following an injury on Friday... On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > On Thu, Oct 29, 2020 at 10:08:52PM -0700, Andres Freund wrote: > > I think its pretty much *never* OK to do IO while holding a buffer > > mapping lock. You're locking a significant fraction of shared buffers > > over IO. That's just not OK. Don't think there's any place doing so > > currently either. > > There is no place doing that on HEAD. Err? /* see if the block is in the buffer pool or not */ LWLockAcquire(partLock, LW_SHARED); buf_id = BufTableLookup(&buf_tag, buf_hash); if (buf_id >= 0) { ... /* Read the buffer from disk, with the I/O lock still held */ smgrread(smgr, forknum, blkno, buffer); LWLockRelease(BufferDescriptorGetIOLock(bufdesc)); } else { /* * Simply read the buffer. There's no risk of modification on it as * we are holding the buffer pool partition mapping lock. */ smgrread(smgr, forknum, blkno, buffer); } /* buffer lookup done, so now do its check */ LWLockRelease(partLock); How is this not doing IO while holding a buffer mapping lock? > This specific point was mentioned in the first message of this thread, > 7th paragraph. That was a long thread, so it is easy to miss: > https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com The code clearly doesnt implement it that way. > I am wondering what you have in mind regarding the use of > BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some > consequences for other existing buffers in the table, like a possible > eviction? You'd need exactly one empty buffer for that - it can be reused for the next to-be-checked buffer. > I'd like to think that we should not do any manipulation of > the buffer tables in this case. Why? Its the way we lock buffers - why is this so special that we need to do differently? > Hence, in order to prevent a > concurrent activity to load in shared buffers the page currently > checked on disk, I got to think that we would need something new here, > like a filtering hash table that would be checked each time a backend > tries to insert an entry into the buffer tables. Thats going to slow down everything a bit - the mapping already is a bottleneck. > 1) If the buffer is in shared buffers, we have the APIs to solve that > by using a pin, unlock the partition, and then do the I/O. (Still > that's unsafe with the smgrwrite() argument?) Thats why you need an appropriate relation lock... Something CheckBuffer didnt bother to document. Its a restriction, but one we probably can live with. > 2) If the buffer is not in shared buffers, we don't have what it takes > to solve the problem yet. We do. Set up enough state for the case to be otherwise the same as the in s_b case. > But even if we solve this problem, we will > never really be sure that this is entirely safe, as per the argument > with concurrent smgrwrite() calls. Current in-core code assumes that > this can happen only for init forks of unlogged relations which would > not be visible yet in the backend doing a page check, still it can be > really easy to break this assumption with any new code added by a new > feature. It also happens in a few other cases than just init forks. But visibility & relation locking can take care of that. But you need to document that. If the locking allows concurent readers - and especially concurrent writers, then you cant really use smgrwite for anything but relation extension. Greetings, Andres Freund
On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote: > I'm a bit limited writing - one handed for a while following an injury > on Friday... Oops. Take care. > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > > There is no place doing that on HEAD. > > Err? > > /* see if the block is in the buffer pool or not */ > LWLockAcquire(partLock, LW_SHARED); > buf_id = BufTableLookup(&buf_tag, buf_hash); > > [...] > > How is this not doing IO while holding a buffer mapping lock? Well, other than the one we are discussing of course :) > > > > This specific point was mentioned in the first message of this thread, > > 7th paragraph. That was a long thread, so it is easy to miss: > > https://www.postgresql.org/message-id/CAOBaU_aVvMjQn=ge5qPiJOPMmOj5=ii3st5Q0Y+WuLML5sR17w@mail.gmail.com > > The code clearly doesnt implement it that way. > > > > I am wondering what you have in mind regarding the use of > > BM_IO_IN_PROGRESS or a similar flag. Wouldn't that imply some > > consequences for other existing buffers in the table, like a possible > > eviction? > > You'd need exactly one empty buffer for that - it can be reused for the > next to-be-checked buffer. > > > > I'd like to think that we should not do any manipulation of > > the buffer tables in this case. > > Why? Its the way we lock buffers - why is this so special that we need > to do differently? > > > > Hence, in order to prevent a > > concurrent activity to load in shared buffers the page currently > > checked on disk, I got to think that we would need something new here, > > like a filtering hash table that would be checked each time a backend > > tries to insert an entry into the buffer tables. > > Thats going to slow down everything a bit - the mapping already is a > bottleneck. > > > > 1) If the buffer is in shared buffers, we have the APIs to solve that > > by using a pin, unlock the partition, and then do the I/O. (Still > > that's unsafe with the smgrwrite() argument?) > > Thats why you need an appropriate relation lock... Something CheckBuffer > didnt bother to document. Its a restriction, but one we probably can > live with. > > > > 2) If the buffer is not in shared buffers, we don't have what it takes > > to solve the problem yet. > > We do. Set up enough state for the case to be otherwise the same as the > in s_b case. > > > But even if we solve this problem, we will > > never really be sure that this is entirely safe, as per the argument > > with concurrent smgrwrite() calls. Current in-core code assumes that > > this can happen only for init forks of unlogged relations which would > > not be visible yet in the backend doing a page check, still it can be > > really easy to break this assumption with any new code added by a new > > feature. > > It also happens in a few other cases than just init forks. But > visibility & relation locking can take care of that. But you need to > document that. If the locking allows concurent readers - and especially > concurrent writers, then you cant really use smgrwite for anything but > relation extension. -- Michael
Attachment
Hi On 2020-11-02 10:45:00 +0900, Michael Paquier wrote: > On Sun, Nov 01, 2020 at 05:39:40PM -0800, Andres Freund wrote: > > I'm a bit limited writing - one handed for a while following an injury > > on Friday... > > Oops. Take care. Thanks! > > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > > > There is no place doing that on HEAD. > > > > Err? > > How is this not doing IO while holding a buffer mapping lock? > > Well, other than the one we are discussing of course :) I am not following? Were you just confirming that its not a thing we do? Greetings, Andres Freund
On Sun, Nov 01, 2020 at 05:50:06PM -0800, Andres Freund wrote: > On 2020-11-02 10:45:00 +0900, Michael Paquier wrote: > > On 2020-11-02 10:05:25 +0900, Michael Paquier wrote: > > > > There is no place doing that on HEAD. > > > > > > Err? > > > How is this not doing IO while holding a buffer mapping lock? > > > > Well, other than the one we are discussing of course :) > > I am not following? Were you just confirming that its not a thing we do? I meant that this is not done in any place other than the one introduced by c780a7a. So we have one place where it happens, and no places before c780a7a. -- Michael
Attachment
On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote: > I think this needs to be quickly reworked or reverted. I don't like this patch, either. In addition to what Andres mentioned, CheckBuffer() is a completely-special case mechanism which can't be reused by anything else. In particular, the amcheck for heap stuff which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1) would really like a way to examine a buffer without risking an error if PageIsVerified() should happen to fail, but this patch is of absolutely no use in getting there, because CheckBuffer() doesn't give the caller any way to access the contents of the buffer. It can only do the checks that it knows how to do, and that's it. That doesn't seem like a good design. I don't like the fact that CheckBuffer() silently skips dirty buffers, either. The comment should really say that it checks the state of a buffer without loading it into shared buffers, except sometimes it doesn't actually check it. That doesn't seem like the behavior users really want, and it's not clear that there is any really good reason for it. If the buffer is in shared buffers, we could take a share-lock on the buffer and copy the contents of the page as it exists on disk, and then still check it. It feels really confusing to me that the user-exposed function here is called pg_relation_check_pages(). How is the user supposed to understand the difference between what this function does and what the new verify_heapam() in amcheck does? The answer is that the latter does far more extensive checks, but this isn't obvious from the SGML documentation, which says only that the blocks are "verified," as if an end-user can reasonably be expected to know what that means. It seems likely to lead users to the belief that if this function passes, they are in good shape, which is extremely far from being true. Just look at what PageIsVerified() checks compared to what verify_heapam() checks. In fact, I would argue that this functionality ought to live in amcheck rather than core, though there could usefully be enabling functions in core. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2020-11-02 12:35:30 -0500, Robert Haas wrote: > On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote: > > I think this needs to be quickly reworked or reverted. I think it's fairly clear by now that revert is appropriate for now. > I don't like this patch, either. In addition to what Andres mentioned, > CheckBuffer() is a completely-special case mechanism which can't be > reused by anything else. In particular, the amcheck for heap stuff > which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1) > would really like a way to examine a buffer without risking an error > if PageIsVerified() should happen to fail, but this patch is of > absolutely no use in getting there, because CheckBuffer() doesn't give > the caller any way to access the contents of the buffer. It can only > do the checks that it knows how to do, and that's it. That doesn't > seem like a good design. Wouldn't this be better served by having a ReadBufferExtended() flag, preventing erroring out and zeroing the buffer? I'm not sure that handling both the case where the buffer contents need to be valid and the one where it doesn't will make for a good API. > I don't like the fact that CheckBuffer() silently skips dirty buffers, > either. The comment should really say that it checks the state of a > buffer without loading it into shared buffers, except sometimes it > doesn't actually check it. Yea, I don't see a good reason for that either. There's reasons for dirty buffers that aren't WAL logged - so if the on-disk page is broken, a standby taken outside pg_basebackup would possibly still end up with a corrupt on-disk page. Similar with a crash restart. > If the buffer is in shared buffers, we could take a share-lock > on the buffer and copy the contents of the page as it exists on disk, > and then still check it. Don't think we need a share lock. That still allows the buffer to be written out (and thus a torn read). What we need is to set BM_IO_IN_PROGRESS on the buffer in question - only one backend can set that. And then unset that again, without unsetting BM_DIRTY/BM_JUST_DIRTIED. > It feels really confusing to me that the user-exposed function here is > called pg_relation_check_pages(). How is the user supposed to > understand the difference between what this function does and what the > new verify_heapam() in amcheck does? The answer is that the latter > does far more extensive checks, but this isn't obvious from the SGML > documentation, which says only that the blocks are "verified," as if > an end-user can reasonably be expected to know what that means. It > seems likely to lead users to the belief that if this function passes, > they are in good shape, which is extremely far from being true. Just > look at what PageIsVerified() checks compared to what verify_heapam() > checks. Yea I had similar thoughts, it should just be called pg_checksum_verify_relation() or something. > In fact, I would argue that this functionality ought to live in > amcheck rather than core, though there could usefully be enabling > functions in core. I'm not really convinced by this though. It's not really AM specific - works for all types of relations with storage; don't really object either... Greetings, Andres Freund
On Mon, Nov 2, 2020 at 8:35 PM Andres Freund <andres@anarazel.de> wrote: > > Hi, > > On 2020-11-02 12:35:30 -0500, Robert Haas wrote: > > It feels really confusing to me that the user-exposed function here is > > called pg_relation_check_pages(). How is the user supposed to > > understand the difference between what this function does and what the > > new verify_heapam() in amcheck does? The answer is that the latter > > does far more extensive checks, but this isn't obvious from the SGML > > documentation, which says only that the blocks are "verified," as if > > an end-user can reasonably be expected to know what that means. It > > seems likely to lead users to the belief that if this function passes, > > they are in good shape, which is extremely far from being true. Just > > look at what PageIsVerified() checks compared to what verify_heapam() > > checks. > > Yea I had similar thoughts, it should just be called > pg_checksum_verify_relation() or something. > +1. > > In fact, I would argue that this functionality ought to live in > > amcheck rather than core, though there could usefully be enabling > > functions in core. > > I'm not really convinced by this though. It's not really AM > specific - works for all types of relations with storage; don't really > object either... Yeah, I'm not sure about that one either. Also what would happen if/when we get checksums on things that aren't even relations? (though maybe that goes for other parts of amcheck at some point as well?) -- Magnus Hagander Me: https://www.hagander.net/ Work: https://www.redpill-linpro.com/
On Tue, Nov 03, 2020 at 09:31:20AM +0100, Magnus Hagander wrote: > On Mon, Nov 2, 2020 at 8:35 PM Andres Freund <andres@anarazel.de> wrote: >> On 2020-11-02 12:35:30 -0500, Robert Haas wrote: >>> It feels really confusing to me that the user-exposed function here is >>> called pg_relation_check_pages(). How is the user supposed to >>> understand the difference between what this function does and what the >>> new verify_heapam() in amcheck does? The answer is that the latter >>> does far more extensive checks, but this isn't obvious from the SGML >>> documentation, which says only that the blocks are "verified," as if >>> an end-user can reasonably be expected to know what that means. It >>> seems likely to lead users to the belief that if this function passes, >>> they are in good shape, which is extremely far from being true. Just >>> look at what PageIsVerified() checks compared to what verify_heapam() >>> checks. The cases of verify_heapam() are much wider as they target only one AM, while this stuff should remain more general. There seems to be some overlap in terms of the basic checks done by bufmgr.c, and the fact that you may not want to be that much intrusive with the existing buffer pool as well when running the AM checks. It also seems to me that the use cases are quite different for both, the original goal of this thread is to detect physical corruptions for all AMs, while verify_heapam() looks after logical corruptions in the way heap is handled. >> Yea I had similar thoughts, it should just be called >> pg_checksum_verify_relation() or something. > > +1. I mentioned that upthread, is there really a dependency with checksums here? There are two cases where we can still apply some checks on a page, without any need of checksums: - The state of the page header. - Zeroed page if pd_upper is 0. Those pages are valid, and don't have a checksum computed. So it seems to me that when it comes to relation pages, then the check of a page should answer to the question: is this page loadable in shared buffers, or not? >>> In fact, I would argue that this functionality ought to live in >>> amcheck rather than core, though there could usefully be enabling >>> functions in core. >> >> I'm not really convinced by this though. It's not really AM >> specific - works for all types of relations with storage; don't really >> object either... > > Yeah, I'm not sure about that one either. Also what would happen > if/when we get checksums on things that aren't even relations? (though > maybe that goes for other parts of amcheck at some point as well?) I also thought about amcheck when looking at this thread, but it did not seem the right place as this applies to any AM able that could load stuff into the shared buffers. -- Michael
Attachment
On Mon, Nov 02, 2020 at 11:34:57AM -0800, Andres Freund wrote: > On 2020-11-02 12:35:30 -0500, Robert Haas wrote: >> On Thu, Oct 29, 2020 at 2:17 PM Andres Freund <andres@anarazel.de> wrote: >>> I think this needs to be quickly reworked or reverted. > > I think it's fairly clear by now that revert is appropriate for now. Yep, that's clear. I'll deal with that tomorrow. That's more than a simple rework. >> I don't like this patch, either. In addition to what Andres mentioned, >> CheckBuffer() is a completely-special case mechanism which can't be >> reused by anything else. In particular, the amcheck for heap stuff >> which I recently committed (866e24d47db1743dfcff5bd595b57e3a143f2cb1) >> would really like a way to examine a buffer without risking an error >> if PageIsVerified() should happen to fail, but this patch is of >> absolutely no use in getting there, because CheckBuffer() doesn't give >> the caller any way to access the contents of the buffer. It can only >> do the checks that it knows how to do, and that's it. That doesn't >> seem like a good design. > > Wouldn't this be better served by having a ReadBufferExtended() flag, > preventing erroring out and zeroing the buffer? I'm not sure that > handling both the case where the buffer contents need to be valid and > the one where it doesn't will make for a good API. If you grep for ReadBuffer_common() is some of the emails I sent.. I was rather interested in something like that. >> I don't like the fact that CheckBuffer() silently skips dirty buffers, >> either. The comment should really say that it checks the state of a >> buffer without loading it into shared buffers, except sometimes it >> doesn't actually check it. > > Yea, I don't see a good reason for that either. There's reasons for > dirty buffers that aren't WAL logged - so if the on-disk page is broken, > a standby taken outside pg_basebackup would possibly still end up with a > corrupt on-disk page. Similar with a crash restart. Er, if you don't skip dirty buffers, wouldn't you actually report some pages as broken if attempting to run those in a standby who may have some torn pages from a previous base backup? You could still run into problems post-promotion, until the first checkpoint post-recovery happens, no? >> If the buffer is in shared buffers, we could take a share-lock >> on the buffer and copy the contents of the page as it exists on disk, >> and then still check it. > > Don't think we need a share lock. That still allows the buffer to be > written out (and thus a torn read). What we need is to set > BM_IO_IN_PROGRESS on the buffer in question - only one backend can set > that. And then unset that again, without unsetting > BM_DIRTY/BM_JUST_DIRTIED. If that can work, we could make use of some of that for base backups for a single retry of a page that initially failed. -- Michael
Attachment
On Mon, Nov 2, 2020 at 2:35 PM Andres Freund <andres@anarazel.de> wrote: > Wouldn't this be better served by having a ReadBufferExtended() flag, > preventing erroring out and zeroing the buffer? I'm not sure that > handling both the case where the buffer contents need to be valid and > the one where it doesn't will make for a good API. I'm not sure. The goal I had in mind was giving a caller a way to get a copy of a buffer even if it's one we wouldn't normally admit into shared_buffers. I think it's probably a bad idea to allow for a back door where things that fail PageIsVerified() can nevertheless escape into the buffer, but that doesn't mean a checker or recovery tool shouldn't be allowed to see them. > > If the buffer is in shared buffers, we could take a share-lock > > on the buffer and copy the contents of the page as it exists on disk, > > and then still check it. > > Don't think we need a share lock. That still allows the buffer to be > written out (and thus a torn read). What we need is to set > BM_IO_IN_PROGRESS on the buffer in question - only one backend can set > that. And then unset that again, without unsetting > BM_DIRTY/BM_JUST_DIRTIED. OK. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Tue, Nov 03, 2020 at 06:46:12PM +0900, Michael Paquier wrote: > Yep, that's clear. I'll deal with that tomorrow. That's more than a > simple rework. This part is now done as of e152506a. -- Michael