Thread: Online enabling of checksums
Once more, here is an attempt to solve the problem of on-line enabling of checksums that me and Daniel have been hacking on for a bit. See for example https://www.postgresql.org/message-id/CABUevEx8KWhZE_XkZQpzEkZypZmBp3GbM9W90JLp%3D-7OJWBbcg%40mail.gmail.com and https://www.postgresql.org/message-id/flat/FF393672-5608-46D6-9224-6620EC532693%40endpoint.com#FF393672-5608-46D6-9224-6620EC532693@endpoint.com for some previous discussions.
Base design:
Change the checksum flag to instead of on and off be an enum. off/inprogress/on. When checksums are off and on, they work like today. When checksums are in progress, checksums are *written* but not verified. State can go from “off” to “inprogress”, from “inprogress” to either “on” or “off”, or from “on” to “off”.
Two new functions are added, pg_enable_data_checksums() and pg_disable_data_checksums(). The disable one is easy -- it just changes to disable. The enable one will change the state to inprogress, and then start a background worker (the “checksumhelper launcher”). This worker in turn will start one sub-worker (“checksumhelper worker”) in each database (currently all done sequentially). This worker will enumerate all tables/indexes/etc in the database and validate their checksums. If there is no checksum, or the checksum is incorrect, it will compute a new checksum and write it out. When all databases have been processed, the checksum state changes to “on” and the launcher shuts down. At this point, the cluster has checksums enabled as if it was initdb’d with checksums turned on.
If the cluster shuts down while “inprogress”, the DBA will have to manually either restart the worker (by calling pg_enable_checksums()) or turn checksums off again. Checksums “in progress” only carries a cost and no benefit.
The change of the checksum state is WAL logged with a new xlog record. All the buffers written by the background worker are forcibly enabled full page writes to make sure the checksum is fully updated on the standby even if no actual contents of the buffer changed.
We’ve also included a small commandline tool, bin/pg_verify_checksums, that can be run against an offline cluster to validate all checksums. Future improvements includes being able to use the background worker/launcher to perform an online check as well. Being able to run more parallel workers in the checksumhelper might also be of interest.
The patch includes two sets of tests, an isolation test turning on checksums while one session is writing to the cluster and another is continuously reading, to simulate turning on checksums in a production database. There is also a TAP test which enables checksums with streaming replication turned on to test the new xlog record. The isolation test ran into the 1024 character limit of the isolation test lexer, with a separate patch and discussion at https://www.postgresql.org/message-id/8D628BE4-6606-4FF6-A3FF-8B2B0E9B43D0@yesql.se
Attachment
Attachment
On 2/21/18 15:53, Magnus Hagander wrote: > *Two new functions are added, pg_enable_data_checksums() and > pg_disable_data_checksums(). The disable one is easy -- it just changes > to disable. The enable one will change the state to inprogress, and then > start a background worker (the “checksumhelper launcher”). This worker > in turn will start one sub-worker (“checksumhelper worker”) in each > database (currently all done sequentially).* This is at least the fourth version of the pattern launcher plus worker background workers. I wonder whether we can do something to make this easier and less repetitive. Not in this patch, of course. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
22 февр. 2018 г., в 18:22, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а):On 2/21/18 15:53, Magnus Hagander wrote:*Two new functions are added, pg_enable_data_checksums() and
pg_disable_data_checksums(). The disable one is easy -- it just changes
to disable. The enable one will change the state to inprogress, and then
start a background worker (the “checksumhelper launcher”). This worker
in turn will start one sub-worker (“checksumhelper worker”) in each
database (currently all done sequentially).*
This is at least the fourth version of the pattern launcher plus worker
background workers. I wonder whether we can do something to make this
easier and less repetitive. Not in this patch, of course.
Hello, Magnus, Peter!I'm excited that this feature emerged, thanks for the patch. Hope it will help to fix some mistakes made during initdb long time ago...22 февр. 2018 г., в 18:22, Peter Eisentraut <peter.eisentraut@2ndquadrant.com> написал(а): On 2/21/18 15:53, Magnus Hagander wrote:*Two new functions are added, pg_enable_data_checksums() and
pg_disable_data_checksums(). The disable one is easy -- it just changes
to disable. The enable one will change the state to inprogress, and then
start a background worker (the “checksumhelper launcher”). This worker
in turn will start one sub-worker (“checksumhelper worker”) in each
database (currently all done sequentially).*
This is at least the fourth version of the pattern launcher plus worker
background workers. I wonder whether we can do something to make this
easier and less repetitive. Not in this patch, of course.Peter, can I ask for some pointers in searching for previous versions?I want to review patch this patch and some code comparision could be handy....So far I've found only this [0,1] (without code) and threads mentioned by Magnus [2,3]Or do you mean extracting "worker+lancher" for reuse for other purposes?
On 2018-02-22 08:22:48 -0500, Peter Eisentraut wrote: > On 2/21/18 15:53, Magnus Hagander wrote: > > *Two new functions are added, pg_enable_data_checksums() and > > pg_disable_data_checksums(). The disable one is easy -- it just changes > > to disable. The enable one will change the state to inprogress, and then > > start a background worker (the “checksumhelper launcher”). This worker > > in turn will start one sub-worker (“checksumhelper worker”) in each > > database (currently all done sequentially).* > > This is at least the fourth version of the pattern launcher plus worker > background workers. I wonder whether we can do something to make this > easier and less repetitive. Not in this patch, of course. I suspect I'm going to get some grief for this, but I think the time has come to bite the bullet and support changing databases in the same process... Greetings, Andres Freund
On 2018-02-22 08:22:48 -0500, Peter Eisentraut wrote:
> On 2/21/18 15:53, Magnus Hagander wrote:
> > *Two new functions are added, pg_enable_data_checksums() and
> > pg_disable_data_checksums(). The disable one is easy -- it just changes
> > to disable. The enable one will change the state to inprogress, and then
> > start a background worker (the “checksumhelper launcher”). This worker
> > in turn will start one sub-worker (“checksumhelper worker”) in each
> > database (currently all done sequentially).*
>
> This is at least the fourth version of the pattern launcher plus worker
> background workers. I wonder whether we can do something to make this
> easier and less repetitive. Not in this patch, of course.
I suspect I'm going to get some grief for this, but I think the time has
come to bite the bullet and support changing databases in the same
process...
Hi, On 2018-02-22 20:30:52 +0100, Magnus Hagander wrote: > On Thu, Feb 22, 2018 at 8:24 PM, Andres Freund <andres@anarazel.de> wrote: > > I suspect I'm going to get some grief for this, but I think the time has > > come to bite the bullet and support changing databases in the same > > process... > > > > Hey, I can't even see the goalposts anymore :P Hah. I vote for making this a hard requirement :P > Are you saying this should be done *in general*, or specifically for > background workers? I'm assuming you mean the general case? I'd say bgworkers first. It's a lot clearer how to exactly do it there. Refactoring the mainloop handling in PostgresMain() would be a bigger task. > That would be very useful, but is probably a fairly non-trivial task > (TM). I'm not actually that sure it is. We have nearly all the code, I think. Syscache inval, ProcKill(), and then you're nearly ready to do the normal connection dance again. Greetings, Andres Freund
On 2018-02-22 20:30:52 +0100, Magnus Hagander wrote:
> On Thu, Feb 22, 2018 at 8:24 PM, Andres Freund <andres@anarazel.de> wrote:
> > I suspect I'm going to get some grief for this, but I think the time has
> > come to bite the bullet and support changing databases in the same
> > process...
> >
>
> Hey, I can't even see the goalposts anymore :P
Hah. I vote for making this a hard requirement :P
> Are you saying this should be done *in general*, or specifically for
> background workers? I'm assuming you mean the general case?
I'd say bgworkers first. It's a lot clearer how to exactly do it
there. Refactoring the mainloop handling in PostgresMain() would be a
bigger task.
> That would be very useful, but is probably a fairly non-trivial task
> (TM).
I'm not actually that sure it is. We have nearly all the code, I
think. Syscache inval, ProcKill(), and then you're nearly ready to do
the normal connection dance again.
On February 22, 2018 11:44:17 AM PST, Magnus Hagander <magnus@hagander.net> wrote: >On Thu, Feb 22, 2018 at 8:41 PM, Andres Freund <andres@anarazel.de> >wrote: >In this particular case that would at least phase 1 simplify it because >we'd only need one process instead of worker/launcher. However, if we'd >ever want to parallellize it -- or any other process of the style, like >autovacuum -- you'd still need a launcher+worker combo. So making that >particular scenario simpler might be worthwhile on it's own. Why is that needed? You can just start two bgworkers and process a list of items stored in shared memory. Or even just check,I assume there'd be a catalog flag somewhere, whether a database / table / object of granularity has already been processedand use locking to prevent concurrent access. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 2/22/18 12:38, Magnus Hagander wrote: > I'm not entirely sure which the others ones are. Auto-Vacuum obviously > is one, which doesn't use the worker infrastructure. But I'm not sure > which the others are referring to? autovacuum, subscription workers, auto prewarm -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On February 22, 2018 11:44:17 AM PST, Magnus Hagander <magnus@hagander.net> wrote:
>On Thu, Feb 22, 2018 at 8:41 PM, Andres Freund <andres@anarazel.de>
>wrote:
>In this particular case that would at least phase 1 simplify it because
>we'd only need one process instead of worker/launcher. However, if we'd
>ever want to parallellize it -- or any other process of the style, like
>autovacuum -- you'd still need a launcher+worker combo. So making that
>particular scenario simpler might be worthwhile on it's own.
Why is that needed? You can just start two bgworkers and process a list of items stored in shared memory. Or even just check, I assume there'd be a catalog flag somewhere, whether a database / table / object of granularity has already been processed and use locking to prevent concurrent access.
Hi, On 2018-02-22 21:16:02 +0100, Magnus Hagander wrote: > You could do that, but then you've moving the complexity to managing that > list in shared memory instead. Maybe I'm missing something, but how are you going to get quick parallel processing if you don't have a shmem piece? You can't assign one database per worker because commonly there's only one database. You don't want to start/stop a worker for each relation because that'd be extremely slow for databases with a lot of tables. Without shmem you can't pass more than an oid to a bgworker. To me the combination of these things imply that you need some other synchronization mechanism *anyway*. > I'm not sure that's any easier... And > certainly adding a catalog flag for a usecase like this one is not making > it easier. Hm, I imagined you'd need that anyway. Imagine a 10TB database that's online converted to checksums. I assume you'd not want to reread 9TB if you crash after processing most of the cluster already? Regards, Andres Freund
Hi,
On 2018-02-22 21:16:02 +0100, Magnus Hagander wrote:
> You could do that, but then you've moving the complexity to managing that
> list in shared memory instead.
Maybe I'm missing something, but how are you going to get quick parallel
processing if you don't have a shmem piece? You can't assign one
database per worker because commonly there's only one database. You
don't want to start/stop a worker for each relation because that'd be
extremely slow for databases with a lot of tables. Without shmem you
can't pass more than an oid to a bgworker. To me the combination of
these things imply that you need some other synchronization mechanism
*anyway*.
> I'm not sure that's any easier... And
> certainly adding a catalog flag for a usecase like this one is not making
> it easier.
Hm, I imagined you'd need that anyway. Imagine a 10TB database that's
online converted to checksums. I assume you'd not want to reread 9TB if
you crash after processing most of the cluster already?
On Thu, Feb 22, 2018 at 11:24:37AM -0800, Andres Freund wrote: > I suspect I'm going to get some grief for this, but I think the time has > come to bite the bullet and support changing databases in the same > process... I'd like to see that. Last time this has been discussed, and Robert complained to me immediately when I suggested it, is that this is not worth it with the many complications around syscache handling and resource cleanup. It is in the long term more stable to use a model where a parent process handles a set of children and decides to which database each child should spawn, which is what autovacuum does. -- Michael
Attachment
How does: On 2018-02-23 11:48:16 +0900, Michael Paquier wrote: > On Thu, Feb 22, 2018 at 11:24:37AM -0800, Andres Freund wrote: > > I suspect I'm going to get some grief for this, but I think the time has > > come to bite the bullet and support changing databases in the same > > process... > > I'd like to see that. Last time this has been discussed, and Robert > complained to me immediately when I suggested it, is that this is not > worth it with the many complications around syscache handling and > resource cleanup. relate to: > It is in the long term more stable to use a model > where a parent process handles a set of children and decides to which > database each child should spawn, which is what autovacuum does. ?
On Thu, Feb 22, 2018 at 9:48 PM, Michael Paquier <michael@paquier.xyz> wrote: > On Thu, Feb 22, 2018 at 11:24:37AM -0800, Andres Freund wrote: >> I suspect I'm going to get some grief for this, but I think the time has >> come to bite the bullet and support changing databases in the same >> process... > > I'd like to see that. Last time this has been discussed, and Robert > complained to me immediately when I suggested it, is that this is not > worth it with the many complications around syscache handling and > resource cleanup. It is in the long term more stable to use a model > where a parent process handles a set of children and decides to which > database each child should spawn, which is what autovacuum does. My position is that allowing processes to change databases is a good idea but (1) it will probably take some work to get correct and (2) it probably won't be super-fast due to the need to flush absolutely every bit of state in sight that might've been influenced by the choice of database. I also agree with Andres that this email is not very easy to understand, although my complaint is not so much that I don't see how the parts relate as that you seem to be contradicting yourself. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2/22/18 12:38, Magnus Hagander wrote:
> I'm not entirely sure which the others ones are. Auto-Vacuum obviously
> is one, which doesn't use the worker infrastructure. But I'm not sure
> which the others are referring to?
autovacuum, subscription workers, auto prewarm
On Thu, Feb 22, 2018 at 3:28 PM, Magnus Hagander <magnus@hagander.net> wrote: > I would prefer that yes. But having to re-read 9TB is still significantly > better than not being able to turn on checksums at all (state today). And > adding a catalog column for it will carry the cost of the migration > *forever*, both for clusters that never have checksums and those that had it > from the beginning. > > Accepting that the process will start over (but only read, not re-write, the > blocks that have already been processed) in case of a crash does > significantly simplify the process, and reduce the long-term cost of it in > the form of entries in the catalogs. Since this is a on-time operation (or > for many people, a zero-time operation), paying that cost that one time is > probably better than paying a much smaller cost but constantly. That's not totally illogical, but to be honest I'm kinda surprised that you're approaching it that way. I would have thought that relchecksums and datchecksums columns would have been a sort of automatic design choice for this feature. The thing to keep in mind is that nobody's going to notice the overhead of adding those columns in practice, but someone will surely notice the pain that comes from having to restart the whole operation. You're talking about trading an effectively invisible overhead for a very noticeable operational problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 02/24/2018 01:34 AM, Robert Haas wrote: > On Thu, Feb 22, 2018 at 3:28 PM, Magnus Hagander <magnus@hagander.net> wrote: >> I would prefer that yes. But having to re-read 9TB is still significantly >> better than not being able to turn on checksums at all (state today). And >> adding a catalog column for it will carry the cost of the migration >> *forever*, both for clusters that never have checksums and those that had it >> from the beginning. >> >> Accepting that the process will start over (but only read, not re-write, the >> blocks that have already been processed) in case of a crash does >> significantly simplify the process, and reduce the long-term cost of it in >> the form of entries in the catalogs. Since this is a on-time operation (or >> for many people, a zero-time operation), paying that cost that one time is >> probably better than paying a much smaller cost but constantly. > > That's not totally illogical, but to be honest I'm kinda surprised > that you're approaching it that way. I would have thought that > relchecksums and datchecksums columns would have been a sort of > automatic design choice for this feature. The thing to keep in mind > is that nobody's going to notice the overhead of adding those columns > in practice, but someone will surely notice the pain that comes from > having to restart the whole operation. You're talking about trading > an effectively invisible overhead for a very noticeable operational > problem. > I agree having to restart the whole operation after a crash is not ideal, but I don't see how adding a flag actually solves it. The problem is the large databases often store most of the data (>80%) in one or two central tables (think fact tables in star schema, etc.). So if you crash, it's likely half-way while processing this table, so the whole table would still have relchecksums=false and would have to be processed from scratch. But perhaps you meant something like "position" instead of just a simple true/false flag? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-02-24 03:07:28 +0100, Tomas Vondra wrote: > I agree having to restart the whole operation after a crash is not > ideal, but I don't see how adding a flag actually solves it. The problem > is the large databases often store most of the data (>80%) in one or two > central tables (think fact tables in star schema, etc.). So if you > crash, it's likely half-way while processing this table, so the whole > table would still have relchecksums=false and would have to be processed > from scratch. I don't think it's quite as large a problem as you make it out to be. Even in those cases you'll usually have indexes, toast tables and so forth. > But perhaps you meant something like "position" instead of just a simple > true/false flag? I think that'd incur a much larger complexity cost. Greetings, Andres Freund
On 02/24/2018 03:11 AM, Andres Freund wrote: > Hi, > > On 2018-02-24 03:07:28 +0100, Tomas Vondra wrote: >> I agree having to restart the whole operation after a crash is not >> ideal, but I don't see how adding a flag actually solves it. The problem >> is the large databases often store most of the data (>80%) in one or two >> central tables (think fact tables in star schema, etc.). So if you >> crash, it's likely half-way while processing this table, so the whole >> table would still have relchecksums=false and would have to be processed >> from scratch. > > I don't think it's quite as large a problem as you make it out to > be. Even in those cases you'll usually have indexes, toast tables and so > forth. > Hmmm, right. I've been focused on tables and kinda forgot that the other objects need to be transformed too ... :-/ >> But perhaps you meant something like "position" instead of just a simple >> true/false flag? > > I think that'd incur a much larger complexity cost. > Yep, that was part of the point that I was getting to - that actually addressing the issue would be more expensive than simple flags. But as you pointed out, that was not quite ... well thought through. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 02/24/2018 03:11 AM, Andres Freund wrote: > > On 2018-02-24 03:07:28 +0100, Tomas Vondra wrote: > >> I agree having to restart the whole operation after a crash is not > >> ideal, but I don't see how adding a flag actually solves it. The problem > >> is the large databases often store most of the data (>80%) in one or two > >> central tables (think fact tables in star schema, etc.). So if you > >> crash, it's likely half-way while processing this table, so the whole > >> table would still have relchecksums=false and would have to be processed > >> from scratch. > > > > I don't think it's quite as large a problem as you make it out to > > be. Even in those cases you'll usually have indexes, toast tables and so > > forth. > > Hmmm, right. I've been focused on tables and kinda forgot that the other > objects need to be transformed too ... :-/ There's also something of a difference between just scanning a table or index, where you don't have to do much in the way of actual writes because most of the table already has valid checksums, and having to actually write out all the changes. > >> But perhaps you meant something like "position" instead of just a simple > >> true/false flag? > > > > I think that'd incur a much larger complexity cost. > > Yep, that was part of the point that I was getting to - that actually > addressing the issue would be more expensive than simple flags. But as > you pointed out, that was not quite ... well thought through. No, but it's also not entirely wrong. Huge tables aren't uncommon. That said, I'm not entirely convinced that these new flags would be as unnoticed as is being suggested here, but rather than focus on either side of that, I'm thinking about what we want to have *next*- we know that enabling/disabling checksums is an issue that needs to be solved, and this patch is making progress towards that, but the next question is what does one do when a page has been detected as corrupted? Are there flag fields which would be useful to have at a per-relation level to support some kind of corrective action or setting that says "don't care about checksums on this table, even though the entire database is supposed to have valid checksums, but instead do X with failed pages" or similar. Beyond dealing with corruption-recovery cases, are there other use cases for having a given table not have checksums? Would it make sense to introduce a flag or field which indicates that an entire table's pages has some set of attributes, of which 'checksums' is just one attribute? Perhaps a page version, which potentially allows us to have a way to change page layouts in the future? I'm happy to be told that we simply don't have enough information at this point to make anything larger than a relchecksums field-level decision, but perhaps these thoughts will spark an idea about how we could define something a bit broader with clear downstream usefulness that happens to also cover the "does this relation have checksums?" question. Thanks! Stephen
Attachment
Hi, I see the patch also does throttling by calling vacuum_delay_point(). Being able to throttle the checksum workers not to affect user activity definitely seems like a useful feature, no complaints here. But perhaps binding it to vacuum_cost_limit/vacuum_cost_delay is not the best idea? I mean, enabling checksums seems rather unrelated to vacuum, so it seems a bit strange to configure it by vacuum_* options. Also, how am I supposed to set the cost limit? Perhaps I'm missing something, but the vacuum_delay_point call happens in the bgworker, so setting the cost limit before running pg_enable_data_checksums() will get there, right? I really don't want to be setting it in the config file, because then it will suddenly affect all user VACUUM commands. And if this patch gets improved to use multiple parallel workers, we'll need a global limit (something like we have for autovacuum workers). In any case, I suggest mentioning this in the docs. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On Wed, Feb 21, 2018 at 09:53:31PM +0100, Magnus Hagander wrote: > We’ve also included a small commandline tool, bin/pg_verify_checksums, > that can be run against an offline cluster to validate all checksums. The way it is coded in the patch will make pg_verify_checksums fail for heap files with multiple segments, i.e. tables over 1 GB, becuase the block number is consecutive and you start over from 0: $ pgbench -i -s 80 -h /tmp [...] $ pg_verify_checksums -D data1 pg_verify_checksums: data1/base/12364/16396.1, block 0, invalid checksum in file 6D61, calculated 6D5F pg_verify_checksums: data1/base/12364/16396.1, block 1, invalid checksum in file 7BE5, calculated 7BE7 [...] Checksum scan completed Data checksum version: 1 Files scanned: 943 Blocks scanned: 155925 Bad checksums: 76 Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
On Thu, Feb 22, 2018 at 3:28 PM, Magnus Hagander <magnus@hagander.net> wrote:
> I would prefer that yes. But having to re-read 9TB is still significantly
> better than not being able to turn on checksums at all (state today). And
> adding a catalog column for it will carry the cost of the migration
> *forever*, both for clusters that never have checksums and those that had it
> from the beginning.
>
> Accepting that the process will start over (but only read, not re-write, the
> blocks that have already been processed) in case of a crash does
> significantly simplify the process, and reduce the long-term cost of it in
> the form of entries in the catalogs. Since this is a on-time operation (or
> for many people, a zero-time operation), paying that cost that one time is
> probably better than paying a much smaller cost but constantly.
That's not totally illogical, but to be honest I'm kinda surprised
that you're approaching it that way. I would have thought that
relchecksums and datchecksums columns would have been a sort of
automatic design choice for this feature. The thing to keep in mind
is that nobody's going to notice the overhead of adding those columns
in practice, but someone will surely notice the pain that comes from
having to restart the whole operation. You're talking about trading
an effectively invisible overhead for a very noticeable operational
problem.
Hi,
I see the patch also does throttling by calling vacuum_delay_point().
Being able to throttle the checksum workers not to affect user activity
definitely seems like a useful feature, no complaints here.
But perhaps binding it to vacuum_cost_limit/vacuum_cost_delay is not the
best idea? I mean, enabling checksums seems rather unrelated to vacuum,
so it seems a bit strange to configure it by vacuum_* options.
Also, how am I supposed to set the cost limit? Perhaps I'm missing
something, but the vacuum_delay_point call happens in the bgworker, so
setting the cost limit before running pg_enable_data_checksums() will
get there, right? I really don't want to be setting it in the config
file, because then it will suddenly affect all user VACUUM commands.
And if this patch gets improved to use multiple parallel workers, we'll
need a global limit (something like we have for autovacuum workers).
In any case, I suggest mentioning this in the docs.
Hi,
On Wed, Feb 21, 2018 at 09:53:31PM +0100, Magnus Hagander wrote:
> We’ve also included a small commandline tool, bin/pg_verify_checksums,
> that can be run against an offline cluster to validate all checksums.
The way it is coded in the patch will make pg_verify_checksums fail for
heap files with multiple segments, i.e. tables over 1 GB, becuase the
block number is consecutive and you start over from 0:
$ pgbench -i -s 80 -h /tmp
[...]
$ pg_verify_checksums -D data1
pg_verify_checksums: data1/base/12364/16396.1, block 0, invalid checksum
in file 6D61, calculated 6D5F
pg_verify_checksums: data1/base/12364/16396.1, block 1, invalid checksum
in file 7BE5, calculated 7BE7
[...]
Checksum scan completed
Data checksum version: 1
Files scanned: 943
Blocks scanned: 155925
Bad checksums: 76
On 2018-02-24 22:45:09 +0100, Magnus Hagander wrote: > Is it really that invisible? Given how much we argue over adding single > counters to the stats system, I'm not sure it's quite that low. That's appears to be entirely unrelated. The stats stuff is expensive because we currently have to essentialy write out the stats for *all* tables in a database, once a counter is updated. And those counters are obviously constantly updated. Thus the overhead of adding one column is essentially multiplied by the number of tables in the system. Whereas here it's a single column that can be updated on a per-row basis, which is barely ever going to be written to. Am I missing something? > We did consider doing it at a per-table basis as well. But this is also an > overhead that has to be paid forever, whereas the risk of having to read > the database files more than once (because it'd only have to read them on > the second pass, not write anything) is a one-off operation. And for all > those that have initialized with checksums in the first place don't have to > pay any overhead at all in the current design. Why does it have to be paid forever? > I very strongly doubg it's a "very noticeable operational problem". People > don't restart their databases very often... Let's say it takes 2-3 weeks to > complete a run in a fairly large database. How many such large databases > actually restart that frequently? I'm not sure I know of any. And the only > effect of it is you have to start the process over (but read-only for the > part you have already done). It's certainly not ideal, but I don't agree > it's in any form a "very noticeable problem". I definitely know large databases that fail over more frequently than that. Greetings, Andres Freund
On 2018-02-24 22:45:09 +0100, Magnus Hagander wrote:
> Is it really that invisible? Given how much we argue over adding single
> counters to the stats system, I'm not sure it's quite that low.
That's appears to be entirely unrelated. The stats stuff is expensive
because we currently have to essentialy write out the stats for *all*
tables in a database, once a counter is updated. And those counters are
obviously constantly updated. Thus the overhead of adding one column is
essentially multiplied by the number of tables in the system. Whereas
here it's a single column that can be updated on a per-row basis, which
is barely ever going to be written to.
Am I missing something?
> We did consider doing it at a per-table basis as well. But this is also an
> overhead that has to be paid forever, whereas the risk of having to read
> the database files more than once (because it'd only have to read them on
> the second pass, not write anything) is a one-off operation. And for all
> those that have initialized with checksums in the first place don't have to
> pay any overhead at all in the current design.
Why does it have to be paid forever?
> I very strongly doubg it's a "very noticeable operational problem". People
> don't restart their databases very often... Let's say it takes 2-3 weeks to
> complete a run in a fairly large database. How many such large databases
> actually restart that frequently? I'm not sure I know of any. And the only
> effect of it is you have to start the process over (but read-only for the
> part you have already done). It's certainly not ideal, but I don't agree
> it's in any form a "very noticeable problem".
I definitely know large databases that fail over more frequently than
that.
Hi, On 2018-02-24 22:56:57 +0100, Magnus Hagander wrote: > On Sat, Feb 24, 2018 at 10:49 PM, Andres Freund <andres@anarazel.de> wrote: > > > We did consider doing it at a per-table basis as well. But this is also > > an > > > overhead that has to be paid forever, whereas the risk of having to read > > > the database files more than once (because it'd only have to read them on > > > the second pass, not write anything) is a one-off operation. And for all > > > those that have initialized with checksums in the first place don't have > > to > > > pay any overhead at all in the current design. > > > > Why does it have to be paid forever? > > > > The size of the pg_class row would be there forever. Granted, it's not that > big an overhead given that there are already plenty of columns there. But > the point being you can never remove that column, and it will be there for > users who never even considered running without checksums. It's certainly > not a large overhead, but it's also not zero. But it can be removed in the next major version, if we decide it's a good idea? We're not bound on compatibility for catalog layout. FWIW' there's some padding space available where we currently could store two booleans without any space overhead. Also, If we decide that the boolean columns (which don't matter much in comparison to the rest of the data, particularly relname), we can compress them into a bitfield. I don't think this is a valid reason for not supporting interrupability. You can make fair arguments about adding incremental support incrementally and whatnot, but the catalog size argument doesn't seem part of a valid argument. > > I very strongly doubg it's a "very noticeable operational problem". People > > > don't restart their databases very often... Let's say it takes 2-3 weeks > > to > > > complete a run in a fairly large database. How many such large databases > > > actually restart that frequently? I'm not sure I know of any. And the > > only > > > effect of it is you have to start the process over (but read-only for the > > > part you have already done). It's certainly not ideal, but I don't agree > > > it's in any form a "very noticeable problem". > > > > I definitely know large databases that fail over more frequently than > > that. > > > > I would argue that they have bigger issues than enabling checksums... By > far. In one case it's intentional, to make sure the overall system copes. Not that insane. Greetings, Andres Freund
> The change of the checksum state is WAL logged with a new xlog record. All the buffers written by the background workerare forcibly enabled full page writes to make sure the checksum is fully updated on the standby even if no actual contentsof the buffer changed. Hm. That doesn't sound necessary to me. If you generate a checkpoint (or just wait until a new checkpoint has started) then go through and do a normal xlog record for every page (any xlog record, a noop record even) then the normal logic for full page writes ought to be sufficient. If the noop record doesn't need a full page write it's because someone else has already come in and done one and that one will set the checksum. In fact if any page has an lsn > the checkpoint start lsn for the checkpoint after the flag was flipped then you wouldn't need to issue any record at all.
> The change of the checksum state is WAL logged with a new xlog record. All the buffers written by the background worker are forcibly enabled full page writes to make sure the checksum is fully updated on the standby even if no actual contents of the buffer changed.
Hm. That doesn't sound necessary to me. If you generate a checkpoint
(or just wait until a new checkpoint has started) then go through and
do a normal xlog record for every page (any xlog record, a noop record
even) then the normal logic for full page writes ought to be
sufficient. If the noop record doesn't need a full page write it's
because someone else has already come in and done one and that one
will set the checksum. In fact if any page has an lsn > the checkpoint
start lsn for the checkpoint after the flag was flipped then you
wouldn't need to issue any record at all.
Hi,
On 2018-02-24 22:56:57 +0100, Magnus Hagander wrote:
> On Sat, Feb 24, 2018 at 10:49 PM, Andres Freund <andres@anarazel.de> wrote:
> > > We did consider doing it at a per-table basis as well. But this is also
> > an
> > > overhead that has to be paid forever, whereas the risk of having to read
> > > the database files more than once (because it'd only have to read them on
> > > the second pass, not write anything) is a one-off operation. And for all
> > > those that have initialized with checksums in the first place don't have
> > to
> > > pay any overhead at all in the current design.
> >
> > Why does it have to be paid forever?
> >
>
> The size of the pg_class row would be there forever. Granted, it's not that
> big an overhead given that there are already plenty of columns there. But
> the point being you can never remove that column, and it will be there for
> users who never even considered running without checksums. It's certainly
> not a large overhead, but it's also not zero.
But it can be removed in the next major version, if we decide it's a
good idea? We're not bound on compatibility for catalog layout.
FWIW' there's some padding space available where we currently could
store two booleans without any space overhead. Also, If we decide that
the boolean columns (which don't matter much in comparison to the rest
of the data, particularly relname), we can compress them into a
bitfield.
I don't think this is a valid reason for not supporting
interrupability. You can make fair arguments about adding incremental
support incrementally and whatnot, but the catalog size argument doesn't
seem part of a valid argument.
> > > don't restart their databases very often... Let's say it takes 2-3 weeks
> > to
> > > complete a run in a fairly large database. How many such large databases
> > > actually restart that frequently? I'm not sure I know of any. And the
> > only
> > > effect of it is you have to start the process over (but read-only for the
> > > part you have already done). It's certainly not ideal, but I don't agree
> > > it's in any form a "very noticeable problem".
> >
> > I definitely know large databases that fail over more frequently than
> > that.
> >
>
> I would argue that they have bigger issues than enabling checksums... By
> far.
In one case it's intentional, to make sure the overall system copes. Not
that insane.
On Sat, Feb 24, 2018 at 4:29 AM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:Hi,
I see the patch also does throttling by calling vacuum_delay_point().
Being able to throttle the checksum workers not to affect user activity
definitely seems like a useful feature, no complaints here.
But perhaps binding it to vacuum_cost_limit/vacuum_cost_delay is not the
best idea? I mean, enabling checksums seems rather unrelated to vacuum,
so it seems a bit strange to configure it by vacuum_* options.
Also, how am I supposed to set the cost limit? Perhaps I'm missing
something, but the vacuum_delay_point call happens in the bgworker, so
setting the cost limit before running pg_enable_data_checksums() will
get there, right? I really don't want to be setting it in the config
file, because then it will suddenly affect all user VACUUM commands.
And if this patch gets improved to use multiple parallel workers, we'll
need a global limit (something like we have for autovacuum workers).
In any case, I suggest mentioning this in the docs.Ah yes. I actually have it on my TODO to work on that, but I forgot to put that in the email I sent out. Apologies for that, and thanks for pointing it out!Right now you have to set the limit in the configuration file. That's of course not the way we want to have it long term (but as long as it is that way it should at least be documented). My plan is to either pick it up from the current session that calls pg_enable_data_checksums(), or to simply pass it down as parameters to the function directly. I'm thinking the second one (pass a cost_delay and a cost_limit as optional parameters to the function) is the best one because as you say actually overloading it on the user visible GUCs seems a bit ugly. Once there I think the easiest is to just pass it down to the workers through the shared memory segment.
Attachment
On 02/24/2018 03:51 AM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 02/24/2018 03:11 AM, Andres Freund wrote: >>> On 2018-02-24 03:07:28 +0100, Tomas Vondra wrote: >>>> I agree having to restart the whole operation after a crash is not >>>> ideal, but I don't see how adding a flag actually solves it. The problem >>>> is the large databases often store most of the data (>80%) in one or two >>>> central tables (think fact tables in star schema, etc.). So if you >>>> crash, it's likely half-way while processing this table, so the whole >>>> table would still have relchecksums=false and would have to be processed >>>> from scratch. >>> >>> I don't think it's quite as large a problem as you make it out to >>> be. Even in those cases you'll usually have indexes, toast tables and so >>> forth. >> >> Hmmm, right. I've been focused on tables and kinda forgot that the other >> objects need to be transformed too ... :-/ > > There's also something of a difference between just scanning a table or > index, where you don't have to do much in the way of actual writes > because most of the table already has valid checksums, and having to > actually write out all the changes. > >>>> But perhaps you meant something like "position" instead of just a simple >>>> true/false flag? >>> >>> I think that'd incur a much larger complexity cost. >> >> Yep, that was part of the point that I was getting to - that actually >> addressing the issue would be more expensive than simple flags. But as >> you pointed out, that was not quite ... well thought through. > > No, but it's also not entirely wrong. Huge tables aren't uncommon. > > That said, I'm not entirely convinced that these new flags would be > as unnoticed as is being suggested here, but rather than focus on > either side of that, I'm thinking about what we want to have *next*- > we know that enabling/disabling checksums is an issue that needs to > be solved, and this patch is making progress towards that, but the > next question is what does one do when a page has been detected as > corrupted? Are there flag fields which would be useful to have at a > per-relation level to support some kind of corrective action or > setting that says "don't care about checksums on this table, even > though the entire database is supposed to have valid checksums, but > instead do X with failed pages" or similar. > Those questions are definitely worth asking, and I agree our ability to respond to data corruption (incorrect checksums) needs improvements. But I don't really see how a single per-relation flag will make any that any easier? Perhaps there are other flags/fields that might help, like for example the maximum number of checksum errors per relation (although I don't consider that very useful in practice), but that seems rather unrelated to this patch. > Beyond dealing with corruption-recovery cases, are there other use > cases for having a given table not have checksums? > Well, I see checksums are a way to detect data corruption caused by storage, so if you have tablespaces backed by different storage systems, you could disable checksums for objects on the storage you 100% trust. That would limit the overhead of computing checksums. But then again, this seems entirely unrelated to the patch discussed here. That would obviously require flags in catalogs, and if the patch eventually adds flags those would need to be separate. > Would it make sense to introduce a flag or field which indicates that > an entire table's pages has some set of attributes, of which > 'checksums' is just one attribute? Perhaps a page version, which > potentially allows us to have a way to change page layouts in the > future? > > I'm happy to be told that we simply don't have enough information at > this point to make anything larger than a relchecksums field-level > decision, but perhaps these thoughts will spark an idea about how we > could define something a bit broader with clear downstream > usefulness that happens to also cover the "does this relation have > checksums?" question. > I don't know. But I think we need to stop moving the goalposts further and further away, otherwise we won't get anything until PostgreSQL 73. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 02/25/2018 03:57 PM, Magnus Hagander wrote: > > ... > >> > I very strongly doubg it's a "very noticeable operational problem". People >> > > don't restart their databases very often... Let's say it takes 2-3 weeks >> > to >> > > complete a run in a fairly large database. How many such large databases >> > > actually restart that frequently? I'm not sure I know of any. And the >> > only >> > > effect of it is you have to start the process over (but read-only for the >> > > part you have already done). It's certainly not ideal, but I don't agree >> > > it's in any form a "very noticeable problem". >> > >> > I definitely know large databases that fail over more frequently than >> > that. >> > >> >> I would argue that they have bigger issues than enabling checksums... By >> far. > > In one case it's intentional, to make sure the overall system copes. Not > that insane. > > That I can understand. But in a scenario like that, you can also stop > doing that for the period of time when you're rebuilding checksums, if > re-reading the database over and over again is an issue. > > Note, I'm not saying it wouldn't be nice to have the incremental > functionality. I'm just saying it's not needed in a first version. > I agree with this sentiment. I don't think we can make each patch perfect for everyone - certainly not in v1 :-/ Sure, it would be great to allow resume after a restart, but if that means we won't get anything in PG 11 then I think it's not a good service to our users. OTOH if the patch without a resume addresses the issue for 99% of users, and we can improve it in PG 12, why not? That seems exactly like the incremental thing we do for many other features. So +1 to not make the "incremental resume" mandatory. If we can support it, great! But I think the patch may seem less complex than it actually is, and figuring out how the resume should work will take some time. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 02/24/2018 10:45 PM, Magnus Hagander wrote: > On Sat, Feb 24, 2018 at 1:34 AM, Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > On Thu, Feb 22, 2018 at 3:28 PM, Magnus Hagander > <magnus@hagander.net <mailto:magnus@hagander.net>> wrote: > > I would prefer that yes. But having to re-read 9TB is still significantly > > better than not being able to turn on checksums at all (state today). And > > adding a catalog column for it will carry the cost of the migration > > *forever*, both for clusters that never have checksums and those that had it > > from the beginning. > > > > Accepting that the process will start over (but only read, not re-write, the > > blocks that have already been processed) in case of a crash does > > significantly simplify the process, and reduce the long-term cost of it in > > the form of entries in the catalogs. Since this is a on-time operation (or > > for many people, a zero-time operation), paying that cost that one time is > > probably better than paying a much smaller cost but constantly. > > That's not totally illogical, but to be honest I'm kinda surprised > that you're approaching it that way. I would have thought that > relchecksums and datchecksums columns would have been a sort of > automatic design choice for this feature. The thing to keep in mind > is that nobody's going to notice the overhead of adding those columns > in practice, but someone will surely notice the pain that comes from > having to restart the whole operation. You're talking about trading > an effectively invisible overhead for a very noticeable operational > problem. > > > Is it really that invisible? Given how much we argue over adding > single counters to the stats system, I'm not sure it's quite that > low. I'm a bit unsure where would the flags be stored - I initially assumed pg_database/pg_class, but now I see mentions of the stats system. But I wonder why should this be stored in a catalog at all? The info is only needed by the bgworker(s), so they could easily flush the current status to a file every now and then and fsync it. Then after restart, if you find a valid file, use it to resume from the last OK position. If not, start from scratch. FWIW this is pretty much what the stats collector does. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 26 Feb 2018, at 05:48, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > On 02/24/2018 10:45 PM, Magnus Hagander wrote: >> Is it really that invisible? Given how much we argue over adding >> single counters to the stats system, I'm not sure it's quite that >> low. > > I'm a bit unsure where would the flags be stored - I initially assumed > pg_database/pg_class, but now I see mentions of the stats system. > > But I wonder why should this be stored in a catalog at all? The info is > only needed by the bgworker(s), so they could easily flush the current > status to a file every now and then and fsync it. Then after restart, if > you find a valid file, use it to resume from the last OK position. If > not, start from scratch. Since this allows checksums to be turned off as well, storing a flag in the catalog would mean a issuing a fairly wide update in that case to switch it to False, which might not be ideal. Not that I expect (hope) turning checksums off on a cluster will be a common operation, but thats a fairly large side effect of doing so. cheers ./daniel
Hi, Magus! > 25 февр. 2018 г., в 21:17, Magnus Hagander <magnus@hagander.net> написал(а): > > PFA an updated patch that adds this, and also fixes the problem in pg_verify_checksums spotted by Michael Banck. I had to ALLOW_CONNECTIONS to template0 to make it work postgres=# 2018-02-27 21:29:05.993 +05 [57259] ERROR: Database template0 does not allow connections. Is it a problem with my installation or some regression in the patch? 2018-02-27 21:40:47.132 +05 [57402] HINT: either disable or enable checksums by calling the pg_data_checksums_enable()/disable()functions Function names are wrong in this hint: pg_enable_data_checksums() The code is nice and clear. One minor spot in the comment >This option can only _be_ enabled when data checksums are enabled. Is there any way we could provide this functionality for previous versions (9.6,10)? Like implement utility for offline checksumenabling, without WAL-logging, surely. Thanks for the patch! Best regards, Andrey Borodin.
> On 28 Feb 2018, at 00:51, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > >> 25 февр. 2018 г., в 21:17, Magnus Hagander <magnus@hagander.net> написал(а): >> >> PFA an updated patch that adds this, and also fixes the problem in pg_verify_checksums spotted by Michael Banck. > > I had to ALLOW_CONNECTIONS to template0 to make it work > postgres=# 2018-02-27 21:29:05.993 +05 [57259] ERROR: Database template0 does not allow connections. > Is it a problem with my installation or some regression in the patch? This is due to a limitation that apply to bgworkers, in order to add checksums the bgworker must connect to the database and template0 does not allow connections by default. There is a discussion, and patch, for lifting this restriction but until this makes it in (if it does), the user will have to allow connections manually. For reference, the thread for allowing bypassing allowconn in bgworker is here: https://www.postgresql.org/message-id/CABUevEwWT9ZmonBMRFF0owneoN3DAPgOVzwHAN0bUkxaqY3eNQ@mail.gmail.com > 2018-02-27 21:40:47.132 +05 [57402] HINT: either disable or enable checksums by calling the pg_data_checksums_enable()/disable()functions > Function names are wrong in this hint: pg_enable_data_checksums() Fixed. The format of this hint (and errmsg) is actually also incorrect, which I fixed while in there. > The code is nice and clear. One minor spot in the comment >> This option can only _be_ enabled when data checksums are enabled. Fixed. > Is there any way we could provide this functionality for previous versions (9.6,10)? Like implement utility for offlinechecksum enabling, without WAL-logging, surely. While outside the scope of the patch in question (since it deals with enabling checksums online), such a utility should be perfectly possible to write. Thanks for reviewing! cheers ./daniel
Attachment
On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander <magnus@hagander.net> wrote: > Also if that wasn't clear -- we only do the full page write if there isn't > already a checksum on the page and that checksum is correct. Hmm. Suppose that on the master there is a checksum on the page and that checksum is correct, but on the standby the page contents differ in some way that we don't always WAL-log, like as to hint bits, and there the checksum is incorrect. Then you'll enable checksums when the standby still has some pages without valid checksums, and disaster will ensue. I think this could be hard to prevent if checksums are turned on and off multiple times. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
I noticed that pg_verify_checksum takes an "-o oid" argument to only check the relation with that OID; but that's wrong, because the number is a relfilenode, not an OID (since it's compared to the on-disk file name). I would suggest changing everything to clarify that it's a pg_class.relfilenode value, otherwise it's going to be very confusing. Maybe use "-f filenode" if -f is available? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 02/28/2018 08:42 PM, Alvaro Herrera wrote: > I noticed that pg_verify_checksum takes an "-o oid" argument to only > check the relation with that OID; but that's wrong, because the number > is a relfilenode, not an OID (since it's compared to the on-disk file > name). I would suggest changing everything to clarify that it's a > pg_class.relfilenode value, otherwise it's going to be very confusing. > Maybe use "-f filenode" if -f is available? > I'd argue this is merely a mistake in the --help text. Firstly, relfilenodes are OIDs too, so I don't think "-o" is incorrect. Secondly, the SGML docs actually say: <varlistentry> <term><option>-o <replaceable>relfilenode</replaceable></option></term> <listitem> <para> Only validate checksums in the relation with specified relfilenode. </para> </listitem> </varlistentry> regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> On 01 Mar 2018, at 05:07, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > > On 02/28/2018 08:42 PM, Alvaro Herrera wrote: >> I noticed that pg_verify_checksum takes an "-o oid" argument to only >> check the relation with that OID; but that's wrong, because the number >> is a relfilenode, not an OID (since it's compared to the on-disk file >> name). I would suggest changing everything to clarify that it's a >> pg_class.relfilenode value, otherwise it's going to be very confusing. >> Maybe use "-f filenode" if -f is available? > > I'd argue this is merely a mistake in the --help text. Agreed, the --help text isn’t really clear in this case and should be updated to say something along the lines of: printf(_(" -o relfilenode check only relation with specified relfilenode\n")); cheers ./daniel
I noticed that pg_verify_checksum takes an "-o oid" argument to only
check the relation with that OID; but that's wrong, because the number
is a relfilenode, not an OID (since it's compared to the on-disk file
name). I would suggest changing everything to clarify that it's a
pg_class.relfilenode value, otherwise it's going to be very confusing.
Maybe use "-f filenode" if -f is available?
> 28 февр. 2018 г., в 22:06, Robert Haas <robertmhaas@gmail.com> написал(а): > > On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander <magnus@hagander.net> wrote: >> Also if that wasn't clear -- we only do the full page write if there isn't >> already a checksum on the page and that checksum is correct. > > Hmm. > > Suppose that on the master there is a checksum on the page and that > checksum is correct, but on the standby the page contents differ in > some way that we don't always WAL-log, like as to hint bits, and there > the checksum is incorrect. Then you'll enable checksums when the > standby still has some pages without valid checksums, and disaster > will ensue. > > I think this could be hard to prevent if checksums are turned on and > off multiple times. This seems 100% valid concern. If pages can be binary different (on master and standby), we have to log act of checksumminga page. And WAL replay would have to verify that his checksum is OK. What is unclear to me is what standby should do if he sees incorrect checksum? Change his page? Request page from master?Shutdown? Or should we just WAL-log page whenever it is checksummed by worker? Even if the checksum was correct? Best regards, Andrey Borodin.
> 28 февр. 2018 г., в 6:22, Daniel Gustafsson <daniel@yesql.se> написал(а): > >> Is there any way we could provide this functionality for previous versions (9.6,10)? Like implement utility for offlinechecksum enabling, without WAL-logging, surely. > > While outside the scope of the patch in question (since it deals with enabling > checksums online), such a utility should be perfectly possible to write. I've tried to rebase this patch to 10 and, despite minor rebase issues (oids, bgw_type, changes to specscanner), patch worksfine. Do we provide backporting for such features? Best regards, Andrey Borodin.
On Thu, Mar 01, 2018 at 12:56:35PM +0500, Andrey Borodin wrote: > I've tried to rebase this patch to 10 and, despite minor rebase issues > (oids, bgw_type, changes to specscanner), patch works fine. Do we > provide backporting for such features? New features are not backported in upstream. Project policy is only to address bugs to keep the branches already released stable a maximum. -- Michael
Attachment
On 2018-03-01 12:56:35 +0500, Andrey Borodin wrote: > I've tried to rebase this patch to 10 and, despite minor rebase issues (oids, bgw_type, changes to specscanner), patchworks fine. > Do we provide backporting for such features? Definitely not. With very rare exceptions (OS compatibility and the like), features aren't backported. - Andres
On 2018-03-01 12:56:35 +0500, Andrey Borodin wrote:
> I've tried to rebase this patch to 10 and, despite minor rebase issues (oids, bgw_type, changes to specscanner), patch works fine.
> Do we provide backporting for such features?
Definitely not. With very rare exceptions (OS compatibility and the
like), features aren't backported.
On 2018-03-01 16:18:48 +0100, Magnus Hagander wrote: > On Thu, Mar 1, 2018 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote: > > > On 2018-03-01 12:56:35 +0500, Andrey Borodin wrote: > > > I've tried to rebase this patch to 10 and, despite minor rebase issues > > (oids, bgw_type, changes to specscanner), patch works fine. > > > Do we provide backporting for such features? > > > > Definitely not. With very rare exceptions (OS compatibility and the > > like), features aren't backported. > > > > Yeah. And definitely not something that both changes the format of > pg_control (by adding new possible values to the checksum field) *and* adds > a new WAL record type... And even more so, I'm not even sure it makes sense to try to get this into v11. This is a medium-large complicated feature, submitted to the last CF for v11. That's pretty late. Now, Magnus is a committer, but nevertheless... Greetings, Andres Freund
On 2018-03-01 16:18:48 +0100, Magnus Hagander wrote:
> On Thu, Mar 1, 2018 at 9:04 AM, Andres Freund <andres@anarazel.de> wrote:
>
> > On 2018-03-01 12:56:35 +0500, Andrey Borodin wrote:
> > > I've tried to rebase this patch to 10 and, despite minor rebase issues
> > (oids, bgw_type, changes to specscanner), patch works fine.
> > > Do we provide backporting for such features?
> >
> > Definitely not. With very rare exceptions (OS compatibility and the
> > like), features aren't backported.
> >
>
> Yeah. And definitely not something that both changes the format of
> pg_control (by adding new possible values to the checksum field) *and* adds
> a new WAL record type...
And even more so, I'm not even sure it makes sense to try to get this
into v11. This is a medium-large complicated feature, submitted to the
last CF for v11. That's pretty late. Now, Magnus is a committer, but
nevertheless...
On 02/28/2018 08:42 PM, Alvaro Herrera wrote:
> I noticed that pg_verify_checksum takes an "-o oid" argument to only
> check the relation with that OID; but that's wrong, because the number
> is a relfilenode, not an OID (since it's compared to the on-disk file
> name). I would suggest changing everything to clarify that it's a
> pg_class.relfilenode value, otherwise it's going to be very confusing.
> Maybe use "-f filenode" if -f is available?
>
I'd argue this is merely a mistake in the --help text. Firstly,
relfilenodes are OIDs too, so I don't think "-o" is incorrect. Secondly,
the SGML docs actually say:
<varlistentry>
<term><option>-o <replaceable>relfilenode</replaceable></option></term>
<listitem>
<para>
Only validate checksums in the relation with specified relfilenode.
</para>
</listitem>
</varlistentry>
On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander <magnus@hagander.net> wrote:
> Also if that wasn't clear -- we only do the full page write if there isn't
> already a checksum on the page and that checksum is correct.
Hmm.
Suppose that on the master there is a checksum on the page and that
checksum is correct, but on the standby the page contents differ in
some way that we don't always WAL-log, like as to hint bits, and there
the checksum is incorrect. Then you'll enable checksums when the
standby still has some pages without valid checksums, and disaster
will ensue.
I think this could be hard to prevent if checksums are turned on and
off multiple times.
* All checksums are correct
Magnus Hagander wrote: > On Wed, Feb 28, 2018 at 10:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com > > I'd argue this is merely a mistake in the --help text. Firstly, > > relfilenodes are OIDs too, so I don't think "-o" is incorrect. Secondly, > > the SGML docs actually say: > Yeah, that one is my fault. It used to say oid all over but I noticed and > fixed it. Except I clearly missed the --help. Obviously option names are completely arbitrary -- you could say "-P relfilenode" and it'd still be 'correct', since it works as documented. But we try to make these options mnemotechnic when we can, and I don't see any relation between "-o" and "relfilenode", so I suggest picking some other letter. There's a whole alphabet out there. Either "-r" or "-f" works better for me than "-o". -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Magnus Hagander wrote:
> On Wed, Feb 28, 2018 at 10:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com
> > I'd argue this is merely a mistake in the --help text. Firstly,
> > relfilenodes are OIDs too, so I don't think "-o" is incorrect. Secondly,
> > the SGML docs actually say:
> Yeah, that one is my fault. It used to say oid all over but I noticed and
> fixed it. Except I clearly missed the --help.
Obviously option names are completely arbitrary -- you could say
"-P relfilenode" and it'd still be 'correct', since it works as
documented. But we try to make these options mnemotechnic when we can,
and I don't see any relation between "-o" and "relfilenode", so I
suggest picking some other letter. There's a whole alphabet out there.
Either "-r" or "-f" works better for me than "-o".
On 03/02/2018 03:22 PM, Magnus Hagander wrote: > > > On Fri, Mar 2, 2018 at 3:17 PM, Alvaro Herrera <alvherre@alvh.no-ip.org > <mailto:alvherre@alvh.no-ip.org>> wrote: > > Magnus Hagander wrote: > > On Wed, Feb 28, 2018 at 10:07 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com> > > > > I'd argue this is merely a mistake in the --help text. Firstly, > > > relfilenodes are OIDs too, so I don't think "-o" is incorrect. Secondly, > > > the SGML docs actually say: > > > Yeah, that one is my fault. It used to say oid all over but I noticed and > > fixed it. Except I clearly missed the --help. > > Obviously option names are completely arbitrary -- you could say > "-P relfilenode" and it'd still be 'correct', since it works as > documented. But we try to make these options mnemotechnic when we can, > and I don't see any relation between "-o" and "relfilenode", so I > suggest picking some other letter. There's a whole alphabet out there. > > Either "-r" or "-f" works better for me than "-o". > > > I have no problem with changing it to -r. -f seems a bit wrong to me, as > it might read as a file. And in the future we might want to implement > the ability to take full filename (with path), in which case it would > make sense to use -f for that. > +1 to -r -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 2, 2018 at 8:35 AM, Magnus Hagander <magnus@hagander.net> wrote: > Do we ever make hintbit changes on the standby for example? If so, it would > definitely cause problems. I didn't realize we did, actually... We do not. > I guess we could get there even if we don't by: > * All checksums are correct > * Checkums are disabled (which replicates) > * Non-WAL logged change on the master, which updates checksum but does *not* > replicate > * Checksums re-enabled > * Worker sees the checksum as correct, and thus does not force a full page > write. > * Worker completes and flips checksums on which replicates. At this point, > if the replica reads the page, boom. Exactly. > I guess we have to remove that optimisation. It's definitely a bummer, but I > don't think it's an absolute dealbreaker. I don't disagree. > We could say that we keep the optimisation if wal_level=minimal for example, > because then we know there is no replica. But I doubt that's worth it? I don't have a strong feeling about this. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Mar 2, 2018 at 2:44 AM, Andres Freund <andres@anarazel.de> wrote: > And even more so, I'm not even sure it makes sense to try to get this > into v11. This is a medium-large complicated feature, submitted to the > last CF for v11. That's pretty late. Now, Magnus is a committer, but > nevertheless... Yeah, I would also favor bumping this one out to a later release. I think there is a significant risk either that the design is flawed -- and as evidence, I offer that I found a flaw in it which I noticed only because of a passing remark in an email, not because I opened the patch -- or that the design boxes us into a corner such that it will be hard to improve this later. I think that there are is a good chance that there are other serious problems with this patch and to be honest I don't really want to go try to find them right this minute; I want to work on other patches that were submitted earlier and have been waiting for a long time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/02/2018 02:35 PM, Magnus Hagander wrote: > > > On Wed, Feb 28, 2018 at 6:06 PM, Robert Haas <robertmhaas@gmail.com > <mailto:robertmhaas@gmail.com>> wrote: > > On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander > <magnus@hagander.net <mailto:magnus@hagander.net>> wrote: > > Also if that wasn't clear -- we only do the full page write if there isn't > > already a checksum on the page and that checksum is correct. > > Hmm. > > Suppose that on the master there is a checksum on the page and that > checksum is correct, but on the standby the page contents differ in > some way that we don't always WAL-log, like as to hint bits, and there > the checksum is incorrect. Then you'll enable checksums when the > standby still has some pages without valid checksums, and disaster > will ensue. > > I think this could be hard to prevent if checksums are turned on and > off multiple times. > > > Do we ever make hintbit changes on the standby for example? If so, it > would definitely cause problems. I didn't realize we did, actually... > I don't think we do. SetHintBits does TransactionIdIsValid(xid) and AFAIK that can't be true on a standby. > I guess we could get there even if we don't by: > * All checksums are correct > * Checkums are disabled (which replicates) > * Non-WAL logged change on the master, which updates checksum but does > *not* replicate > * Checksums re-enabled > * Worker sees the checksum as correct, and thus does not force a full > page write. > * Worker completes and flips checksums on which replicates. At this > point, if the replica reads the page, boom. > Maybe. My understanding of Robert's example is that you can start with an instance that has wal_log_hints=off, and so pages on master/standby may not be 100% identical. Then we do the online checksum thing, and the standby may get pages with incorrect checksums. > I guess we have to remove that optimisation. It's definitely a > bummer, but I don't think it's an absolute dealbreaker. > I agree it's not a deal-breaker. Or at least I don't see why it should be - any other maintenance activity on the database (freezing etc.) will also generate full-page writes. The good thing is the throttling also limits the amount of WAL, so it's possible to prevent generating too many checkpoints etc. I suggest we simply: 1) set the checksums to in-progress 2) wait for a checkpoint 3) use the regular logic for full-pages (i.e. first change after checkpoint does a FPW) BTW speaking of checkpoints, I see ChecksumHelperLauncherMain does RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | \ CHECKPOINT_IMMEDIATE); I'm rather unhappy about that - immediate checkpoints have massive impact on production systems, so we try not doing them (That's one of the reasons why CREATE DATABASE is somewhat painful). It usually requires a bit of thinking about when to do such commands. But in this case it's unpredictable when exactly the checksumming completes, so it may easily be in the middle of peak activity. Why not to simply wait for regular spread checkpoint, the way pg_basebackup does it? > We could say that we keep the optimisation if wal_level=minimal for > example, because then we know there is no replica. But I doubt > that's worth it? > If it doesn't require a lot of code, why not? But I don't really see much point in doing that. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/02/2018 02:35 PM, Magnus Hagander wrote:
>
>
> On Wed, Feb 28, 2018 at 6:06 PM, Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>> wrote:
>
> On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander
> <magnus@hagander.net <mailto:magnus@hagander.net>> wrote:
> > Also if that wasn't clear -- we only do the full page write if there isn't
> > already a checksum on the page and that checksum is correct.
>
> Hmm.
>
> Suppose that on the master there is a checksum on the page and that
> checksum is correct, but on the standby the page contents differ in
> some way that we don't always WAL-log, like as to hint bits, and there
> the checksum is incorrect. Then you'll enable checksums when the
> standby still has some pages without valid checksums, and disaster
> will ensue.
>
> I think this could be hard to prevent if checksums are turned on and
> off multiple times.
>
>
> Do we ever make hintbit changes on the standby for example? If so, it
> would definitely cause problems. I didn't realize we did, actually...
>
I don't think we do. SetHintBits does TransactionIdIsValid(xid) and
AFAIK that can't be true on a standby.
> I guess we could get there even if we don't by:
> * All checksums are correct
> * Checkums are disabled (which replicates)
> * Non-WAL logged change on the master, which updates checksum but does
> *not* replicate
> * Checksums re-enabled
> * Worker sees the checksum as correct, and thus does not force a full
> page write.
> * Worker completes and flips checksums on which replicates. At this
> point, if the replica reads the page, boom.
>
Maybe.
My understanding of Robert's example is that you can start with an
instance that has wal_log_hints=off, and so pages on master/standby may
not be 100% identical. Then we do the online checksum thing, and the
standby may get pages with incorrect checksums.
> I guess we have to remove that optimisation. It's definitely a
> bummer, but I don't think it's an absolute dealbreaker.
>
I agree it's not a deal-breaker. Or at least I don't see why it should
be - any other maintenance activity on the database (freezing etc.) will
also generate full-page writes.
The good thing is the throttling also limits the amount of WAL, so it's
possible to prevent generating too many checkpoints etc.
I suggest we simply:
1) set the checksums to in-progress
2) wait for a checkpoint
3) use the regular logic for full-pages (i.e. first change after
checkpoint does a FPW)
BTW speaking of checkpoints, I see ChecksumHelperLauncherMain does
RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | \
CHECKPOINT_IMMEDIATE);
I'm rather unhappy about that - immediate checkpoints have massive
impact on production systems, so we try not doing them (That's one of
the reasons why CREATE DATABASE is somewhat painful). It usually
requires a bit of thinking about when to do such commands. But in this
case it's unpredictable when exactly the checksumming completes, so it
may easily be in the middle of peak activity.
Why not to simply wait for regular spread checkpoint, the way
pg_basebackup does it?
> We could say that we keep the optimisation if wal_level=minimal for
> example, because then we know there is no replica. But I doubt
> that's worth it?
>
If it doesn't require a lot of code, why not? But I don't really see
much point in doing that.
On 03/02/2018 11:01 PM, Magnus Hagander wrote: > On Fri, Mar 2, 2018 at 5:50 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > > > On 03/02/2018 02:35 PM, Magnus Hagander wrote: > > > > > > On Wed, Feb 28, 2018 at 6:06 PM, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com> > > <mailto:robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>>> wrote: > > > > On Sun, Feb 25, 2018 at 9:54 AM, Magnus Hagander > > <magnus@hagander.net <mailto:magnus@hagander.net> > <mailto:magnus@hagander.net <mailto:magnus@hagander.net>>> wrote: > > > Also if that wasn't clear -- we only do the full page write if there isn't > > > already a checksum on the page and that checksum is correct. > > > > Hmm. > > > > Suppose that on the master there is a checksum on the page and that > > checksum is correct, but on the standby the page contents differ in > > some way that we don't always WAL-log, like as to hint bits, and there > > the checksum is incorrect. Then you'll enable checksums when the > > standby still has some pages without valid checksums, and disaster > > will ensue. > > > > I think this could be hard to prevent if checksums are turned on and > > off multiple times. > > > > > > Do we ever make hintbit changes on the standby for example? If so, it > > would definitely cause problems. I didn't realize we did, actually... > > > > I don't think we do. SetHintBits does TransactionIdIsValid(xid) and > AFAIK that can't be true on a standby. > > > I guess we could get there even if we don't by: > > * All checksums are correct > > * Checkums are disabled (which replicates) > > * Non-WAL logged change on the master, which updates checksum but does > > *not* replicate > > * Checksums re-enabled > > * Worker sees the checksum as correct, and thus does not force a full > > page write. > > * Worker completes and flips checksums on which replicates. At this > > point, if the replica reads the page, boom. > > > > Maybe. > > My understanding of Robert's example is that you can start with an > instance that has wal_log_hints=off, and so pages on master/standby may > not be 100% identical. Then we do the online checksum thing, and the > standby may get pages with incorrect checksums. > > > No, in that case the master will issue full page writes for *all* pages, > since they dind't hvae a checksum. The current patch only avoids doing > that if the checksum on the master is correct, which it isn't when you > start from checksums=off. So this particular problem only shows up if > you iterate between off/on/off multiple times. > Hmmm, OK. So we need to have a valid checksum on a page, disable checksums, set some hint bits on the page (which won't be WAL-logged), enable checksums again and still get a valid checksum even with the new hint bits? That's possible, albeit unlikely. > > > I guess we have to remove that optimisation. It's definitely a > > bummer, but I don't think it's an absolute dealbreaker. > > > > I agree it's not a deal-breaker. Or at least I don't see why it should > be - any other maintenance activity on the database (freezing etc.) will > also generate full-page writes. > > > Yes. > > > > The good thing is the throttling also limits the amount of WAL, so it's > possible to prevent generating too many checkpoints etc. > > I suggest we simply: > > 1) set the checksums to in-progress > 2) wait for a checkpoint > 3) use the regular logic for full-pages (i.e. first change after > checkpoint does a FPW) > > > This is very close to what it does now, except it does not wait for a > checkpoint in #2. Why does it need that? > To guarantee that the page has a FPW with all the hint bits, before we start messing with the checksums (or that setting the checksum itself triggers a FPW). > > BTW speaking of checkpoints, I see ChecksumHelperLauncherMain does > > RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | \ > CHECKPOINT_IMMEDIATE); > > I'm rather unhappy about that - immediate checkpoints have massive > impact on production systems, so we try not doing them (That's one of > the reasons why CREATE DATABASE is somewhat painful). It usually > requires a bit of thinking about when to do such commands. But in this > case it's unpredictable when exactly the checksumming completes, so it > may easily be in the middle of peak activity. > > Why not to simply wait for regular spread checkpoint, the way > pg_basebackup does it? > > > Actually, that was my original idea. I changed it for testing, and shuld > go change it back. > OK > > > > We could say that we keep the optimisation if wal_level=minimal for > > example, because then we know there is no replica. But I doubt > > that's worth it? > > > > If it doesn't require a lot of code, why not? But I don't really see > much point in doing that. > > > Yeah, I doubt there are a lot of people using "minimal" these days, not > since we changed the default. > Yeah. Although as I said, it depends on how much code would be needed to enable that optimization (I guess not much). If someone is running with wal_level=minimal intentionally, why not to help them. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Fri, Mar 2, 2018 at 6:26 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote: > Hmmm, OK. So we need to have a valid checksum on a page, disable > checksums, set some hint bits on the page (which won't be WAL-logged), > enable checksums again and still get a valid checksum even with the new > hint bits? That's possible, albeit unlikely. No, the problem is if - as is much more likely - the checksum is not still valid. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sat, Mar 3, 2018 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Mar 2, 2018 at 6:26 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> Hmmm, OK. So we need to have a valid checksum on a page, disable >> checksums, set some hint bits on the page (which won't be WAL-logged), >> enable checksums again and still get a valid checksum even with the new >> hint bits? That's possible, albeit unlikely. > > No, the problem is if - as is much more likely - the checksum is not > still valid. Hmm, on second thought ... maybe I didn't think this through carefully enough. If the checksum matches on the master by chance, and the page is the same on the standby, then we're fine, right? It's a weird accident, but nothing is actually broken. The failure scenario is where the standby has a version of the page with a bad checksum, but the master has a good checksum. So for example: checksums disabled, master modifies the page (which is replicated), master sets some hint bits (coincidentally making the checksum match), now we try to turn checksums on and don't re-replicate the page because the checksum already looks correct. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 03/03/2018 01:38 PM, Robert Haas wrote: > On Sat, Mar 3, 2018 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote: >> On Fri, Mar 2, 2018 at 6:26 PM, Tomas Vondra >> <tomas.vondra@2ndquadrant.com> wrote: >>> Hmmm, OK. So we need to have a valid checksum on a page, disable >>> checksums, set some hint bits on the page (which won't be >>> WAL-logged), enable checksums again and still get a valid >>> checksum even with the new hint bits? That's possible, albeit >>> unlikely. >> >> No, the problem is if - as is much more likely - the checksum is >> not still valid. > > Hmm, on second thought ... maybe I didn't think this through > carefully enough. If the checksum matches on the master by chance, > and the page is the same on the standby, then we're fine, right? It's > a weird accident, but nothing is actually broken. The failure > scenario is where the standby has a version of the page with a bad > checksum, but the master has a good checksum. So for example: > checksums disabled, master modifies the page (which is replicated), > master sets some hint bits (coincidentally making the checksum > match), now we try to turn checksums on and don't re-replicate the > page because the checksum already looks correct. > Yeah. Doesn't that pretty much mean we can't skip any pages that have correct checksum, because we can't rely on standby having the same page data? That is, this block in ProcessSingleRelationFork: /* * If checksum was not set or was invalid, mark the buffer as dirty * and force a full page write. If the checksum was already valid, we * can leave it since we know that any other process writing the * buffer will update the checksum. */ if (checksum != pagehdr->pd_checksum) { START_CRIT_SECTION(); MarkBufferDirty(buf); log_newpage_buffer(buf, false); END_CRIT_SECTION(); } That would mean this optimization - only doing the write when the checksum does not match - is broken. If that's the case, it probably makes restarts/resume more expensive, because this optimization was why after restart the already processed data was only read (and the checksums verified) but not written. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/03/2018 01:38 PM, Robert Haas wrote:
> On Sat, Mar 3, 2018 at 7:32 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Fri, Mar 2, 2018 at 6:26 PM, Tomas Vondra
>> <tomas.vondra@2ndquadrant.com> wrote:
>>> Hmmm, OK. So we need to have a valid checksum on a page, disable
>>> checksums, set some hint bits on the page (which won't be
>>> WAL-logged), enable checksums again and still get a valid
>>> checksum even with the new hint bits? That's possible, albeit
>>> unlikely.
>>
>> No, the problem is if - as is much more likely - the checksum is
>> not still valid.
>
> Hmm, on second thought ... maybe I didn't think this through
> carefully enough. If the checksum matches on the master by chance,
> and the page is the same on the standby, then we're fine, right? It's
> a weird accident, but nothing is actually broken. The failure
> scenario is where the standby has a version of the page with a bad
> checksum, but the master has a good checksum. So for example:
> checksums disabled, master modifies the page (which is replicated),
> master sets some hint bits (coincidentally making the checksum
> match), now we try to turn checksums on and don't re-replicate the
> page because the checksum already looks correct.
>
Yeah. Doesn't that pretty much mean we can't skip any pages that have
correct checksum, because we can't rely on standby having the same page
data? That is, this block in ProcessSingleRelationFork:
/*
* If checksum was not set or was invalid, mark the buffer as dirty
* and force a full page write. If the checksum was already valid, we
* can leave it since we know that any other process writing the
* buffer will update the checksum.
*/
if (checksum != pagehdr->pd_checksum)
{
START_CRIT_SECTION();
MarkBufferDirty(buf);
log_newpage_buffer(buf, false);
END_CRIT_SECTION();
}
That would mean this optimization - only doing the write when the
checksum does not match - is broken.
If that's the case, it probably makes restarts/resume more expensive,
because this optimization was why after restart the already processed
data was only read (and the checksums verified) but not written.
On 03/03/2018 05:08 PM, Magnus Hagander wrote: > > > On Sat, Mar 3, 2018 at 5:06 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > On 03/03/2018 01:38 PM, Robert Haas wrote: > > On Sat, Mar 3, 2018 at 7:32 AM, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>> wrote: > >> On Fri, Mar 2, 2018 at 6:26 PM, Tomas Vondra > >> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> > wrote: > >>> Hmmm, OK. So we need to have a valid checksum on a page, disable > >>> checksums, set some hint bits on the page (which won't be > >>> WAL-logged), enable checksums again and still get a valid > >>> checksum even with the new hint bits? That's possible, albeit > >>> unlikely. > >> > >> No, the problem is if - as is much more likely - the checksum is > >> not still valid. > > > > Hmm, on second thought ... maybe I didn't think this through > > carefully enough. If the checksum matches on the master by chance, > > and the page is the same on the standby, then we're fine, right? It's > > a weird accident, but nothing is actually broken. The failure > > scenario is where the standby has a version of the page with a bad > > checksum, but the master has a good checksum. So for example: > > checksums disabled, master modifies the page (which is replicated), > > master sets some hint bits (coincidentally making the checksum > > match), now we try to turn checksums on and don't re-replicate the > > page because the checksum already looks correct. > > > > Yeah. Doesn't that pretty much mean we can't skip any pages that have > correct checksum, because we can't rely on standby having the same page > data? That is, this block in ProcessSingleRelationFork: > > /* > * If checksum was not set or was invalid, mark the buffer as dirty > * and force a full page write. If the checksum was already valid, we > * can leave it since we know that any other process writing the > * buffer will update the checksum. > */ > if (checksum != pagehdr->pd_checksum) > { > START_CRIT_SECTION(); > MarkBufferDirty(buf); > log_newpage_buffer(buf, false); > END_CRIT_SECTION(); > } > > That would mean this optimization - only doing the write when the > checksum does not match - is broken. > > > Yes. I think that was the conclusion of this, as posted > in https://www.postgresql.org/message-id/CABUevExDZu__5KweT8fr3Ox45YcuvTDEEu%3DaDpGBT8Sk0RQE_g%40mail.gmail.com > :) > Oh, right. I did have a "deja vu" feeling, when writing that. Good that I came to the same conclusion, though. > > If that's the case, it probably makes restarts/resume more expensive, > because this optimization was why after restart the already processed > data was only read (and the checksums verified) but not written. > > > Yes, it definitely does. It's not a dealbreaker, but it's certainly > a bit painful not to be able to resume as cheap. > Yeah. It probably makes the more elaborate resuming more valuable, but I still think it's not a "must have" for PG11. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/03/2018 05:08 PM, Magnus Hagander wrote:
>
>
> On Sat, Mar 3, 2018 at 5:06 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
> On 03/03/2018 01:38 PM, Robert Haas wrote:
> > On Sat, Mar 3, 2018 at 7:32 AM, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>> wrote:
> >> On Fri, Mar 2, 2018 at 6:26 PM, Tomas Vondra
> >> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> Oh, right. I did have a "deja vu" feeling, when writing that. Good that> wrote:
> >>> Hmmm, OK. So we need to have a valid checksum on a page, disable
> >>> checksums, set some hint bits on the page (which won't be
> >>> WAL-logged), enable checksums again and still get a valid
> >>> checksum even with the new hint bits? That's possible, albeit
> >>> unlikely.
> >>
> >> No, the problem is if - as is much more likely - the checksum is
> >> not still valid.
> >
> > Hmm, on second thought ... maybe I didn't think this through
> > carefully enough. If the checksum matches on the master by chance,
> > and the page is the same on the standby, then we're fine, right? It's
> > a weird accident, but nothing is actually broken. The failure
> > scenario is where the standby has a version of the page with a bad
> > checksum, but the master has a good checksum. So for example:
> > checksums disabled, master modifies the page (which is replicated),
> > master sets some hint bits (coincidentally making the checksum
> > match), now we try to turn checksums on and don't re-replicate the
> > page because the checksum already looks correct.
> >
>
> Yeah. Doesn't that pretty much mean we can't skip any pages that have
> correct checksum, because we can't rely on standby having the same page
> data? That is, this block in ProcessSingleRelationFork:
>
> /*
> * If checksum was not set or was invalid, mark the buffer as dirty
> * and force a full page write. If the checksum was already valid, we
> * can leave it since we know that any other process writing the
> * buffer will update the checksum.
> */
> if (checksum != pagehdr->pd_checksum)
> {
> START_CRIT_SECTION();
> MarkBufferDirty(buf);
> log_newpage_buffer(buf, false);
> END_CRIT_SECTION();
> }
>
> That would mean this optimization - only doing the write when the
> checksum does not match - is broken.
>
>
> Yes. I think that was the conclusion of this, as posted
> in https://www.postgresql.org/message-id/CABUevExDZu__ 5KweT8fr3Ox45YcuvTDEEu% 3DaDpGBT8Sk0RQE_g%40mail. gmail.com
> :)
>
I came to the same conclusion, though.
>
> If that's the case, it probably makes restarts/resume more expensive,
> because this optimization was why after restart the already processed
> data was only read (and the checksums verified) but not written.
>
>
> Yes, it definitely does. It's not a dealbreaker, but it's certainly
> a bit painful not to be able to resume as cheap.
>
Yeah. It probably makes the more elaborate resuming more valuable, but I
still think it's not a "must have" for PG11.
Attachment
Hi, On Sat, Mar 03, 2018 at 07:23:31PM +0100, Magnus Hagander wrote: > diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c > new file mode 100644 > index 0000000000..a4bfe7284d > --- /dev/null > +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c > @@ -0,0 +1,308 @@ [...] > +/* > + * pg_verify_checksums > + * > + * Verifies page level checksums in an offline cluster > + * > + * Copyright (c) 2010-2018, PostgreSQL Global Development Group Weird copyright statement for a new file, did you base it off another one, or just copy-pasted the boilerplate? [...] > + csum = pg_checksum_page(buf, blockno + segmentno*RELSEG_SIZE); > + if (csum != header->pd_checksum) > + { > + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION) > + fprintf(stderr, _("%s: %s, block %d, invalid checksum in file %X, calculated %X\n"), > + progname, fn, blockno, header->pd_checksum, csum); The error message sounds a bit strange to me, I would expect the filename after "in file [...]", but you print the expected checksum. Also, 'invalid' sounds a bit like we found something which is malformed checksum (no hex), so maybe "checksum mismatch in file, expected %X, found %X" or something? Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
> On 04 Mar 2018, at 15:24, Michael Banck <michael.banck@credativ.de> wrote: >> + csum = pg_checksum_page(buf, blockno + segmentno*RELSEG_SIZE); >> + if (csum != header->pd_checksum) >> + { >> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION) >> + fprintf(stderr, _("%s: %s, block %d, invalid checksum in file %X, calculated %X\n"), >> + progname, fn, blockno, header->pd_checksum, csum); > > The error message sounds a bit strange to me, I would expect the > filename after "in file [...]", but you print the expected checksum. > Also, 'invalid' sounds a bit like we found something which is malformed > checksum (no hex), so maybe "checksum mismatch in file, expected %X, > found %X" or something? Agreed. Looking at our current error messages, “in file” is conventionally followed by the filename. I do however think “calculated” is better than “expected” since it conveys clearly that the compared checksum is calculated by pg_verify_checksum and not read from somewhere. How about something like this? _(“%s: checksum mismatch in file \”%s\”, block %d: calculated %X, found %X”), progname, fn, blockno, csum, header->pd_checksum); cheers ./daniel
Hi, Am Sonntag, den 04.03.2018, 23:30 +0100 schrieb Daniel Gustafsson: > Agreed. Looking at our current error messages, “in file” is conventionally > followed by the filename. I do however think “calculated” is better than > “expected” since it conveys clearly that the compared checksum is calculated by > pg_verify_checksum and not read from somewhere. > > How about something like this? > > _(“%s: checksum mismatch in file \”%s\”, block %d: calculated %X, found %X”), > progname, fn, blockno, csum, header->pd_checksum); I still find that confusing, but maybe it's just me. I thought the one in the pageheader is the "expected" checksum, and we compare the "found" or "computed/calculated" (in the page itself) against it. I had the same conversation with an external tool author, by the way: https://github.com/uptimejp/postgres-toolkit/issues/48 Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Hi,
Am Sonntag, den 04.03.2018, 23:30 +0100 schrieb Daniel Gustafsson:
> Agreed. Looking at our current error messages, “in file” is conventionally
> followed by the filename. I do however think “calculated” is better than
> “expected” since it conveys clearly that the compared checksum is calculated by
> pg_verify_checksum and not read from somewhere.
>
> How about something like this?
>
> _(“%s: checksum mismatch in file \”%s\”, block %d: calculated %X, found %X”),
> progname, fn, blockno, csum, header->pd_checksum);
I still find that confusing, but maybe it's just me. I thought the one
in the pageheader is the "expected" checksum, and we compare the "found"
or "computed/calculated" (in the page itself) against it.
I had the same conversation with an external tool author, by the way:
Hi, On Mon, Mar 05, 2018 at 11:09:02AM +0100, Magnus Hagander wrote: > On Mon, Mar 5, 2018 at 10:43 AM, Michael Banck <michael.banck@credativ.de> > wrote: > > I still find that confusing, but maybe it's just me. I thought the one > > in the pageheader is the "expected" checksum, and we compare the "found" > > or "computed/calculated" (in the page itself) against it. > > > > I had the same conversation with an external tool author, by the way: > > Maybe we should just say "on disk" for the one that's on disk, would that > break the confusion? So "calculated %X, found %X on disk"? I found that there is a precedent in bufpage.c: | ereport(WARNING, | (ERRCODE_DATA_CORRUPTED, | errmsg("page verification failed, calculated checksum %u but expected %u", | checksum, p->pd_checksum))); apart from the fact that it doesn't print out the hex value (which I find strange), it sounds like a sensible message to me. But "found %X on disk" would work as well I guess. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Hi, I had a closer look at v3 of the patch now. On Sat, Mar 03, 2018 at 07:23:31PM +0100, Magnus Hagander wrote: > Attached is a rebased patch which removes this optimization, updates the > pg_proc entry for the new format, and changes pg_verify_checksums to use -r > instead of -o for relfilenode. The patch applies fine with minimal fuzz and compiles with no warnings; make check and the added isolation tests, as well as the added checksum tests pass. If I blindly run "SELECT pg_enable_data_checksums();" on new cluster, I get: |postgres=# SELECT pg_enable_data_checksums(); | pg_enable_data_checksums |-------------------------- | t |(1 row) | |postgres=# SHOW data_checksums ; | data_checksums |---------------- | inprogress |(1 row) However, inspecting the log one sees: |2018-03-10 14:15:57.702 CET [3313] ERROR: Database template0 does not allow connections. |2018-03-10 14:15:57.702 CET [3313] HINT: Allow connections using ALTER DATABASE and try again. |2018-03-10 14:15:57.702 CET [3152] LOG: background worker "checksum helper launcher" (PID 3313) exited with exit code 1 and the background worker is no longer running without any obvious hint to the client. I am aware that this is discussed already, but as-is the user experience is pretty bad, I think pg_enable_data_checksums() should either bail earlier if it cannot connect to all databases, or it should be better documented. Otherwise, if I allow connections to template0, the patch works as expected, I have not managed to break it so far. Some further review comments: > diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml > index f4bc2d4161..89afecb341 100644 > --- a/doc/src/sgml/wal.sgml > +++ b/doc/src/sgml/wal.sgml > @@ -230,6 +230,73 @@ [...] > + <para> > + Checksums can be enabled or disabled online, by calling the appropriate > + <link linkend="functions-admin-checksum">functions</link>. > + Disabling of checksums takes effect immediately when the function is called. > + </para> > + > + <para> > + Enabling checksums will put the cluster in <literal>inprogress</literal> mode. > + During this time, checksums will be written but not verified. In addition to > + this, a background worker process is started that enables checksums on all > + existing data in the cluster. Once this worker has completed processing all > + databases in the cluster, the checksum mode will automatically switch to > + <literal>on</literal>. > + </para> > + > + <para> > + If the cluster is stopped while in <literal>inprogress</literal> mode, for > + any reason, then this process must be restarted manually. To do this, > + re-execute the function <function>pg_enable_data_checksums()</function> > + once the cluster has been restarted. > + </para> Maybe document the above issue here, unless it is clear that the templat0-needs-to-allow-connections issue will go away before the patch is pushed. > diff --git a/src/backend/access/transam/xlog.c b/src/backend/access/transam/xlog.c > index 47a6c4d895..56aaa88de1 100644 > --- a/src/backend/access/transam/xlog.c > +++ b/src/backend/access/transam/xlog.c [...] > +void > +SetDataChecksumsOn(void) > +{ > + if (!DataChecksumsEnabledOrInProgress()) > + elog(ERROR, "Checksums not enabled or in progress"); > + > + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); > + > + if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION) > + { > + LWLockRelease(ControlFileLock); > + elog(ERROR, "Checksums not in in_progress mode"); The string used in "SHOW data_checksums" is "inprogress", not "in_progress". [...] > @@ -7769,6 +7839,16 @@ StartupXLOG(void) > CompleteCommitTsInitialization(); > > /* > + * If we reach this point with checksums in inprogress state, we notify > + * the user that they need to manually restart the process to enable > + * checksums. > + */ > + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION) > + ereport(WARNING, > + (errmsg("checksum state is \"in progress\" with no worker"), > + errhint("Either disable or enable checksums by calling the pg_disable_data_checksums() or pg_enable_data_checksums()functions."))); Again, string is "inprogress", not "in progress", not sure if that matters. > diff --git a/src/backend/postmaster/checksumhelper.c b/src/backend/postmaster/checksumhelper.c > new file mode 100644 > index 0000000000..6aa71bcf30 > --- /dev/null > +++ b/src/backend/postmaster/checksumhelper.c > @@ -0,0 +1,631 @@ > +/*------------------------------------------------------------------------- > + * > + * checksumhelper.c > + * Backend worker to walk the database and write checksums to pages Backend or Background? [...] > +typedef struct ChecksumHelperShmemStruct > +{ > + pg_atomic_flag launcher_started; > + bool success; > + bool process_shared_catalogs; > + /* Parameter values set on start */ double space. [...] > +static bool > +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy) > +{ > + BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum); > + BlockNumber b; > + > + for (b = 0; b < numblocks; b++) > + { > + Buffer buf = ReadBufferExtended(reln, forkNum, b, RBM_NORMAL, strategy); > + Page page; > + PageHeader pagehdr; > + uint16 checksum; > + > + /* Need to get an exclusive lock before we can flag as dirty */ > + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); > + > + /* Do we already have a valid checksum? */ > + page = BufferGetPage(buf); > + pagehdr = (PageHeader) page; > + checksum = pg_checksum_page((char *) page, b); This looks like it does not take the segment number into account; however it is also unclear to me what the purpose of this is, as checksum is never validated against the pagehdr, and nothing is done with it. Indeed, I even get a compiler warning about pagehdr and checksum: git/postgresql/build/../src/backend/postmaster/checksumhelper.c: In function ‘ProcessSingleRelationFork’: git/postgresql/build/../src/backend/postmaster/checksumhelper.c:155:11: warning: variable ‘checksum’ set but not used [-Wunused-but-set-variable] uint16 checksum; ^~~~~~~~ git/postgresql/build/../src/backend/postmaster/checksumhelper.c:154:14: warning: variable ‘pagehdr’ set but not used [-Wunused-but-set-variable] PageHeader pagehdr; ^~~~~~~ I guess the above block running pg_checksum_page() is a leftover from previous versions of the patch and should be removed... > + /* > + * Mark the buffer as dirty and force a full page write. > + * We have to re-write the page to wal even if the checksum hasn't > + * changed, because if there is a replica it might have a slightly > + * different version of the page with an invalid checksum, caused > + * by unlogged changes (e.g. hintbits) on the master happening while > + * checksums were off. This can happen if there was a valid checksum > + * on the page at one point in the past, so only when checksums > + * are first on, then off, and then turned on again. > + */ > + START_CRIT_SECTION(); > + MarkBufferDirty(buf); > + log_newpage_buffer(buf, false); > + END_CRIT_SECTION(); ... seeing how MarkBufferDirty(buf) is now run unconditionally. [...] > + /* > + * Initialize a connection to shared catalogs only. > + */ > + BackgroundWorkerInitializeConnection(NULL, NULL); > + > + /* > + * Set up so first run processes shared catalogs, but not once in every db > + */ Comment should have a full stop like the above and below ones? > + ChecksumHelperShmem->process_shared_catalogs = true; > + > + /* > + * Create a database list. We don't need to concern ourselves with > + * rebuilding this list during runtime since any new created database will > + * be running with checksums turned on from the start. > + */ [...] > + DatabaseList = BuildDatabaseList(); > + > + /* > + * If there are no databases at all to checksum, we can exit immediately > + * as there is no work to do. > + */ > + if (DatabaseList == NIL || list_length(DatabaseList) == 0) > + return; > + > + while (true) > + { > + List *remaining = NIL; > + ListCell *lc, > + *lc2; > + List *CurrentDatabases = NIL; > + > + foreach(lc, DatabaseList) > + { > + ChecksumHelperDatabase *db = (ChecksumHelperDatabase *) lfirst(lc); > + > + if (ProcessDatabase(db)) > + { > + pfree(db->dbname); > + pfree(db); > + > + if (ChecksumHelperShmem->process_shared_catalogs) > + > + /* > + * Now that one database has completed shared catalogs, we > + * don't have to process them again . Stray space. > + */ > + ChecksumHelperShmem->process_shared_catalogs = false; > + } > + else > + { > + /* > + * Put failed databases on the remaining list. > + */ > + remaining = lappend(remaining, db); > + } > + } > + list_free(DatabaseList); > + > + DatabaseList = remaining; > + remaining = NIL; > + > + /* > + * DatabaseList now has all databases not yet processed. This can be > + * because they failed for some reason, or because the database was > + * DROPed between us getting the database list and trying to process DROPed looks wrong, and there's no other occurence of it in the source tree. DROPped looks even weirder, so maybe just "dropped"? > + * it. Get a fresh list of databases to detect the second case with. That sentence looks unfinished or at least is unclear to me. > + * Any database that still exists but failed we retry for a limited I'm not a native speaker, but this looks wrong to me as well, maybe "We retry any database that still exists but failed for a limited [...]"? > diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c > index 0fe98a550e..4bb2b7e6ec 100644 > --- a/src/bin/pg_upgrade/controldata.c > +++ b/src/bin/pg_upgrade/controldata.c > @@ -591,6 +591,15 @@ check_control_data(ControlData *oldctrl, > */ > > /* > + * If checksums have been turned on in the old cluster, but the > + * checksumhelper have yet to finish, then disallow upgrading. The user > + * should either let the process finish, or turn off checksums, before > + * retrying. > + */ > + if (oldctrl->data_checksum_version == 2) > + pg_fatal("transition to data checksums not completed in old cluster\n"); > + > + /* > * We might eventually allow upgrades from checksum to no-checksum > * clusters. > */ See below about src/bin/pg_upgrade/pg_upgrade.h having data_checksum_version be a bool. I checked pg_ugprade (from master to master though), and could find no off-hand issues, i.e. it reported all issues correctly. > --- /dev/null > +++ b/src/bin/pg_verify_checksums/Makefile [...] > +check: > + $(prove_check) > + > +installcheck: > + $(prove_installcheck) If I run "make check" in src/bin/pg_verify_checksums, I get a fat perl error: |src/bin/pg_verify_checksums$ LANG=C make check |rm -rf '/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build'/tmp_install |/bin/mkdir -p '/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build'/tmp_install/log |make -C '../../..' DESTDIR='/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build'/tmp_install install >'/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build'/tmp_install/log/install.log2>&1 |rm -rf '/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/src/bin/pg_verify_checksums'/tmp_check |/bin/mkdir -p '/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/src/bin/pg_verify_checksums'/tmp_check |cd /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/bin/pg_verify_checksums && TESTDIR='/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/src/bin/pg_verify_checksums' PATH="/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/tmp_install//bin:$PATH" LD_LIBRARY_PATH="/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/tmp_install//lib"PGPORT='65432' PG_REGRESS='/home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/src/bin/pg_verify_checksums/../../../src/test/regress/pg_regress' /usr/bin/prove-I /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/test/perl/ -I /home/mba/Projekte/OSS/PostgreSQL/git/postgresql/build/../src/bin/pg_verify_checksums t/*.pl |Cannot detect source of 't/*.pl'! at /usr/share/perl/5.24/TAP/Parser/IteratorFactory.pm line 261. | TAP::Parser::IteratorFactory::detect_source(TAP::Parser::IteratorFactory=HASH(0x55eed10df3e8), TAP::Parser::Source=HASH(0x55eed10bd358))called at /usr/share/perl/5.24/TAP/Parser/IteratorFactory.pm line 211 | TAP::Parser::IteratorFactory::make_iterator(TAP::Parser::IteratorFactory=HASH(0x55eed10df3e8), TAP::Parser::Source=HASH(0x55eed10bd358))called at /usr/share/perl/5.24/TAP/Parser.pm line 472 | TAP::Parser::_initialize(TAP::Parser=HASH(0x55eed10df328), HASH(0x55eed0ea2e90)) called at /usr/share/perl/5.24/TAP/Object.pmline 55 | TAP::Object::new("TAP::Parser", HASH(0x55eed0ea2e90)) called at /usr/share/perl/5.24/TAP/Object.pm line 130 | TAP::Object::_construct(TAP::Harness=HASH(0x55eed09176b0), "TAP::Parser", HASH(0x55eed0ea2e90)) called at /usr/share/perl/5.24/TAP/Harness.pmline 852 | TAP::Harness::make_parser(TAP::Harness=HASH(0x55eed09176b0), TAP::Parser::Scheduler::Job=HASH(0x55eed0fdc708)) calledat /usr/share/perl/5.24/TAP/Harness.pm line 651 | TAP::Harness::_aggregate_single(TAP::Harness=HASH(0x55eed09176b0), TAP::Parser::Aggregator=HASH(0x55eed091e520), TAP::Parser::Scheduler=HASH(0x55eed0fdc6a8))called at /usr/share/perl/5.24/TAP/Harness.pm line 743 | TAP::Harness::aggregate_tests(TAP::Harness=HASH(0x55eed09176b0), TAP::Parser::Aggregator=HASH(0x55eed091e520), "t/*.pl")called at /usr/share/perl/5.24/TAP/Harness.pm line 558 | TAP::Harness::__ANON__() called at /usr/share/perl/5.24/TAP/Harness.pm line 571 | TAP::Harness::runtests(TAP::Harness=HASH(0x55eed09176b0), "t/*.pl") called at /usr/share/perl/5.24/App/Prove.pm line546 | App::Prove::_runtests(App::Prove=HASH(0x55eed090b0c8), HASH(0x55eed0d79cf0), "t/*.pl") called at /usr/share/perl/5.24/App/Prove.pmline 504 | App::Prove::run(App::Prove=HASH(0x55eed090b0c8)) called at /usr/bin/prove line 13 |Makefile:39: recipe for target 'check' failed |make: *** [check] Error 2 > diff --git a/src/bin/pg_verify_checksums/pg_verify_checksums.c b/src/bin/pg_verify_checksums/pg_verify_checksums.c > new file mode 100644 > index 0000000000..a4bfe7284d > --- /dev/null > +++ b/src/bin/pg_verify_checksums/pg_verify_checksums.c [...] > + if (DataDir == NULL) > + { > + if (optind < argc) > + DataDir = argv[optind++]; > + else > + DataDir = getenv("PGDATA"); > + } > + > + /* Complain if any arguments remain */ > + if (optind < argc) > + { > + fprintf(stderr, _("%s: too many command-line arguments (first is \"%s\")\n"), > + progname, argv[optind]); > + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), > + progname); > + exit(1); > + } > + > + if (DataDir == NULL) > + { > + fprintf(stderr, _("%s: no data directory specified\n"), progname); > + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); > + exit(1); > + } Those two if (DataDir == NULL) checks could maybe be put together into one block. > diff --git a/src/include/postmaster/checksumhelper.h b/src/include/postmaster/checksumhelper.h > new file mode 100644 > index 0000000000..7f296264a9 > --- /dev/null > +++ b/src/include/postmaster/checksumhelper.h > @@ -0,0 +1,31 @@ > +/*------------------------------------------------------------------------- > + * > + * checksumhelper.h > + * header file for checksum helper deamon "deamon" is surely wrong (it'd be "daemon"), but maybe "(background) worker" is better? > diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h > index 85dd10c45a..bd46bf2ce6 100644 > --- a/src/include/storage/bufpage.h > +++ b/src/include/storage/bufpage.h > @@ -194,6 +194,7 @@ typedef PageHeaderData *PageHeader; > */ > #define PG_PAGE_LAYOUT_VERSION 4 > #define PG_DATA_CHECKSUM_VERSION 1 > +#define PG_DATA_CHECKSUM_INPROGRESS_VERSION 2 I am not very sure about the semantics of PG_DATA_CHECKSUM_VERSION being 1, but I assumed it was a version, like, if we ever decide to use a different checksumming algorithm, we'd bump it to 2. Now PG_DATA_CHECKSUM_INPROGRESS_VERSION is defined to 2, which I agree is convenient, but is there some strategy what to do about this in case the PG_DATA_CHECKSUM_VERSION needs to be increased? In any case, src/bin/pg_upgrade/pg_upgrade.h has bool data_checksum_version; in the ControlData struct, which might need updating? That's all for now. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
> On 10 Mar 2018, at 16:09, Michael Banck <michael.banck@credativ.de> wrote: > I had a closer look at v3 of the patch now. Thanks, much appreciated! Sorry for the late response, just came back from a conference and have had little time for hacking. All whitespace, punctuation and capitalization comments have been addressed with your recommendations, so I took the liberty to trim them from the response. > I am aware that this is discussed already, but as-is the user experience > is pretty bad, I think pg_enable_data_checksums() should either bail > earlier if it cannot connect to all databases, or it should be better > documented. Personally I think we should first attempt to solve the "allow-connections in background workers” issue which has been raised on another thread. For now I’m documenting this better. >> diff --git a/doc/src/sgml/wal.sgml b/doc/src/sgml/wal.sgml >> index f4bc2d4161..89afecb341 100644 >> --- a/doc/src/sgml/wal.sgml >> +++ b/doc/src/sgml/wal.sgml >> @@ -230,6 +230,73 @@ > [...] >> + <para> >> + Checksums can be enabled or disabled online, by calling the appropriate >> + <link linkend="functions-admin-checksum">functions</link>. >> + Disabling of checksums takes effect immediately when the function is called. >> + </para> >> + >> + <para> >> + Enabling checksums will put the cluster in <literal>inprogress</literal> mode. >> + During this time, checksums will be written but not verified. In addition to >> + this, a background worker process is started that enables checksums on all >> + existing data in the cluster. Once this worker has completed processing all >> + databases in the cluster, the checksum mode will automatically switch to >> + <literal>on</literal>. >> + </para> >> + >> + <para> >> + If the cluster is stopped while in <literal>inprogress</literal> mode, for >> + any reason, then this process must be restarted manually. To do this, >> + re-execute the function <function>pg_enable_data_checksums()</function> >> + once the cluster has been restarted. >> + </para> > > Maybe document the above issue here, unless it is clear that the > templat0-needs-to-allow-connections issue will go away before the patch > is pushed. I have added a paragraph on allowing connections here, as well as a note that template0 will need to be handled. >> +void >> +SetDataChecksumsOn(void) >> +{ >> + if (!DataChecksumsEnabledOrInProgress()) >> + elog(ERROR, "Checksums not enabled or in progress"); >> + >> + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); >> + >> + if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION) >> + { >> + LWLockRelease(ControlFileLock); >> + elog(ERROR, "Checksums not in in_progress mode"); > > The string used in "SHOW data_checksums" is "inprogress", not > "in_progress”. Fixed. >> @@ -7769,6 +7839,16 @@ StartupXLOG(void) >> CompleteCommitTsInitialization(); >> >> /* >> + * If we reach this point with checksums in inprogress state, we notify >> + * the user that they need to manually restart the process to enable >> + * checksums. >> + */ >> + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION) >> + ereport(WARNING, >> + (errmsg("checksum state is \"in progress\" with no worker"), >> + errhint("Either disable or enable checksums by calling the pg_disable_data_checksums() or pg_enable_data_checksums()functions."))); > > Again, string is "inprogress", not "in progress", not sure if that > matters. I think it does, we need to be consistent in userfacing naming. Updated. >> diff --git a/src/backend/postmaster/checksumhelper.c b/src/backend/postmaster/checksumhelper.c >> new file mode 100644 >> index 0000000000..6aa71bcf30 >> --- /dev/null >> +++ b/src/backend/postmaster/checksumhelper.c >> @@ -0,0 +1,631 @@ >> +/*------------------------------------------------------------------------- >> + * >> + * checksumhelper.c >> + * Backend worker to walk the database and write checksums to pages > > Backend or Background? “Background” is the right term here, fixed. >> +static bool >> +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy) >> +{ >> + BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum); >> + BlockNumber b; >> + >> + for (b = 0; b < numblocks; b++) >> + { >> + Buffer buf = ReadBufferExtended(reln, forkNum, b, RBM_NORMAL, strategy); >> + Page page; >> + PageHeader pagehdr; >> + uint16 checksum; >> + >> + /* Need to get an exclusive lock before we can flag as dirty */ >> + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); >> + >> + /* Do we already have a valid checksum? */ >> + page = BufferGetPage(buf); >> + pagehdr = (PageHeader) page; >> + checksum = pg_checksum_page((char *) page, b); > > This looks like it does not take the segment number into account; > however it is also unclear to me what the purpose of this is, as > checksum is never validated against the pagehdr, and nothing is done > with it. > I guess the above block running pg_checksum_page() is a leftover from > previous versions of the patch and should be removed… Correct, the checksum and pagehdr was used in the previous optimization to skip forced writes where the checksum was valid. That optimization was however based on a faulty assumption and was removed in the v3 patch. The leftover variables are now removed. >> + * it. Get a fresh list of databases to detect the second case with. > > That sentence looks unfinished or at least is unclear to me. Updated to indicate what we mean by “second case”. >> + * Any database that still exists but failed we retry for a limited > > I'm not a native speaker, but this looks wrong to me as well, maybe "We > retry any database that still exists but failed for a limited [...]”? Updated and extended a bit for clarity. Fixing this review comment also made me realize that checksumhelper.c was using %s for outputting the database name in errmsg(), but \”%s\” is what we commonly use. Updated all errmsg() invocations to quote the database name. >> diff --git a/src/bin/pg_upgrade/controldata.c b/src/bin/pg_upgrade/controldata.c >> index 0fe98a550e..4bb2b7e6ec 100644 >> --- a/src/bin/pg_upgrade/controldata.c >> +++ b/src/bin/pg_upgrade/controldata.c >> @@ -591,6 +591,15 @@ check_control_data(ControlData *oldctrl, >> */ >> >> /* >> + * If checksums have been turned on in the old cluster, but the >> + * checksumhelper have yet to finish, then disallow upgrading. The user >> + * should either let the process finish, or turn off checksums, before >> + * retrying. >> + */ >> + if (oldctrl->data_checksum_version == 2) >> + pg_fatal("transition to data checksums not completed in old cluster\n"); >> + >> + /* >> * We might eventually allow upgrades from checksum to no-checksum >> * clusters. >> */ > > See below about src/bin/pg_upgrade/pg_upgrade.h having > data_checksum_version be a bool. > > I checked pg_ugprade (from master to master though), and could find no > off-hand issues, i.e. it reported all issues correctly. data_checksum_version is indeed defined bool, but rather than being an actual bool it’s a bool backed by a char via a typedef in c.h. This is why it works to assign 0, 1 or 2 without issues. That being said, I agree that it reads odd now that checksum version isn’t just 0 or 1 which mentally translate neatly to false or true. Since this is the patch introducing version 2, I changed data_checksum_version to a uint32 as that makes the intent much clearer. >> +check: >> + $(prove_check) >> + >> +installcheck: >> + $(prove_installcheck) > > If I run "make check" in src/bin/pg_verify_checksums, I get a fat perl > error: Thats because the check and installcheck targets are copy-pasteos, there are no tests for pg_verify_checksums currently so the targets should not be in the Makefile. Fixed. >> + if (DataDir == NULL) >> + { >> + if (optind < argc) >> + DataDir = argv[optind++]; >> + else >> + DataDir = getenv("PGDATA"); >> + } >> + >> + /* Complain if any arguments remain */ >> + if (optind < argc) >> + { >> + fprintf(stderr, _("%s: too many command-line arguments (first is \"%s\")\n"), >> + progname, argv[optind]); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), >> + progname); >> + exit(1); >> + } >> + >> + if (DataDir == NULL) >> + { >> + fprintf(stderr, _("%s: no data directory specified\n"), progname); >> + fprintf(stderr, _("Try \"%s --help\" for more information.\n"), progname); >> + exit(1); >> + } > > Those two if (DataDir == NULL) checks could maybe be put together into > one block. Moved the check into the first block, as it makes code clearr and doesn’t change the order in which the error messages for missing datadir and too-many-arguments will be output. >> diff --git a/src/include/postmaster/checksumhelper.h b/src/include/postmaster/checksumhelper.h >> new file mode 100644 >> index 0000000000..7f296264a9 >> --- /dev/null >> +++ b/src/include/postmaster/checksumhelper.h >> @@ -0,0 +1,31 @@ >> +/*------------------------------------------------------------------------- >> + * >> + * checksumhelper.h >> + * header file for checksum helper deamon > > "deamon" is surely wrong (it'd be "daemon"), but maybe "(background) > worker" is better? Yes, "background worker” is better, updated. >> diff --git a/src/include/storage/bufpage.h b/src/include/storage/bufpage.h >> index 85dd10c45a..bd46bf2ce6 100644 >> --- a/src/include/storage/bufpage.h >> +++ b/src/include/storage/bufpage.h >> @@ -194,6 +194,7 @@ typedef PageHeaderData *PageHeader; >> */ >> #define PG_PAGE_LAYOUT_VERSION 4 >> #define PG_DATA_CHECKSUM_VERSION 1 >> +#define PG_DATA_CHECKSUM_INPROGRESS_VERSION 2 > > I am not very sure about the semantics of PG_DATA_CHECKSUM_VERSION being > 1, but I assumed it was a version, like, if we ever decide to use a > different checksumming algorithm, we'd bump it to 2. > > Now PG_DATA_CHECKSUM_INPROGRESS_VERSION is defined to 2, which I agree > is convenient, but is there some strategy what to do about this in case > the PG_DATA_CHECKSUM_VERSION needs to be increased? I don’t think this has been discussed in any thread dealing with enabling checksums in an online cluster, where using the version was mentioned (or my archive search is failing me). Iff another algorithm was added I assume we’d need something like the proposed checksumhelper to allow migrations from version 1. Just as we now use version 2 to indicate that we are going from version 0 to version 1, the same can be used for going from version 1 to 3 as long as there is a source and destination version recorded. I don’t know what such a process would look like, but I don’t see that we are blocking a future new checksum algorithm by using version 2 for inprogress (although I agree that using the version here is closer to convenient than elegant). Attached v4 of this patch, which addresses this review, and flipping status back in the CF app back to Needs Review. cheers ./daniel
Attachment
Hi, On Thu, Mar 15, 2018 at 02:01:26PM +0100, Daniel Gustafsson wrote: > > On 10 Mar 2018, at 16:09, Michael Banck <michael.banck@credativ.de> wrote: > > I am aware that this is discussed already, but as-is the user experience > > is pretty bad, I think pg_enable_data_checksums() should either bail > > earlier if it cannot connect to all databases, or it should be better > > documented. > > Personally I think we should first attempt to solve the "allow-connections in > background workers” issue which has been raised on another thread. For now I’m > documenting this better. I had a look at that thread and it seems stalled, I am a bit worried that this will not be solved before the end of the CF. So I think unless the above gets solved, pg_enable_data_checksums() should error out with the hint. I've had a quick look and it seems one can partly duplicate the check from BuildDatabaseList() (or optionally move it) to the beginning of StartChecksumHelperLauncher(), see attached. That results in: postgres=# SELECT pg_enable_data_checksums(); ERROR: Database "template0" does not allow connections. HINT: Allow connections using ALTER DATABASE and try again. postgres=# Which I think is much nice than what we have right now: postgres=# SELECT pg_enable_data_checksums(); pg_enable_data_checksums -------------------------- t (1 row) postgres=# \q postgres@fock:~$ tail -3 pg1.log 2018-03-18 14:00:08.512 CET [25514] ERROR: Database "template0" does not allow connections. 2018-03-18 14:00:08.512 CET [25514] HINT: Allow connections using ALTER DATABASE and try again. 2018-03-18 14:00:08.513 CET [24930] LOG: background worker "checksum helper launcher" (PID 25514) exited with exit code1 > Attached v4 of this patch, which addresses this review, and flipping status > back in the CF app back to Needs Review. Thanks! The following errmsg() capitalize the error message without the first word being a specific term, which I believe is not project style: + (errmsg("Checksum helper is currently running, cannot disable checksums"), + (errmsg("Database \"%s\" dropped, skipping", db->dbname))); + (errmsg("Checksums enabled, checksumhelper launcher shutting down"))); + (errmsg("Database \"%s\" does not allow connections.", NameStr(pgdb->datname)), + (errmsg("Checksum worker starting for database oid %d", dboid))); + (errmsg("Checksum worker completed in database oid %d", dboid))); Also, in src/backend/postmaster/checksumhelper.c there are few multi-line comments (which are not function comments) that do not have a full stop at the end, which I think is also project style: + * Failed to set means somebody else started Could be changed to a one-line (/* ... */) comment? + * Force a checkpoint to get everything out to disk Should have a full stop. + * checksummed, so skip Should have a full stop. + * Enable vacuum cost delay, if any Could be changed to a one-line comment? + * Create and set the vacuum strategy as our buffer strategy Could be changed to a one-line comment? Apart from that, I previously complained about the error in pg_verify_checksums: + fprintf(stderr, _("%s: %s, block %d, invalid checksum in file %X, calculated %X\n"), + progname, fn, blockno, header->pd_checksum, csum); I still propose something like in backend/storage/page/bufpage.c's PageIsVerified(), e.g.: |fprintf(stderr, _("%s: checksum verification failed in file \"%s\", block %d: calculated checksum %X but expected %X\n"), | progname, fn, blockno, csum, header->pd_checksum); Otherwise, I had a quick look over v4 and found no further issues. Hopefully I will be able to test it on some bigger test databases next week. I'm switching the state back to 'Waiting on Author'; if you think the above points are moot, maybe switch it back to 'Needs Review' as Andrey Borodin also marked himself down as reviewer and might want to have another look as well. Cheers, Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Attachment
Hi! > 18 марта 2018 г., в 19:02, Michael Banck <michael.banck@credativ.de> написал(а): > > Otherwise, I had a quick look over v4 and found no further issues. > Hopefully I will be able to test it on some bigger test databases next > week. > > I'm switching the state back to 'Waiting on Author'; if you think the > above points are moot, maybe switch it back to 'Needs Review' as Andrey > Borodin also marked himself down as reviewer and might want to have > another look as well. Yep, i'm already doing another pass on the code again. Hope to finish tomorrow. My 2 cents, there's typo in the word "connections" + <literal>template0</literal> is by default not accepting connetions, to Best regards, Andrey Borodin.
> On 18 Mar 2018, at 15:02, Michael Banck <michael.banck@credativ.de> wrote: > On Thu, Mar 15, 2018 at 02:01:26PM +0100, Daniel Gustafsson wrote: >>> On 10 Mar 2018, at 16:09, Michael Banck <michael.banck@credativ.de> wrote: >>> I am aware that this is discussed already, but as-is the user experience >>> is pretty bad, I think pg_enable_data_checksums() should either bail >>> earlier if it cannot connect to all databases, or it should be better >>> documented. >> >> Personally I think we should first attempt to solve the "allow-connections in >> background workers” issue which has been raised on another thread. For now I’m >> documenting this better. > > I had a look at that thread and it seems stalled, I am a bit worried > that this will not be solved before the end of the CF. > > So I think unless the above gets solved, pg_enable_data_checksums() > should error out with the hint. I've had a quick look and it seems one > can partly duplicate the check from BuildDatabaseList() (or optionally > move it) to the beginning of StartChecksumHelperLauncher(), see > attached. I’ve incorporated a slightly massaged version of your patch. While it is a little ugly to duplicate the logic, it’s hopefully a short-term fix, and I agree that silently failing is even uglier. Thanks for the proposal! It should be noted that a database may well be altered to not allow connections between the checksum helper starts, or gets the database list, and when it tries to checksum it, so we might still fail on this very issue with the checks bypassed. Still improves the UX to catch low hanging fruit of course. > The following errmsg() capitalize the error message without the first > word being a specific term, which I believe is not project style: Fixed. > Also, in src/backend/postmaster/checksumhelper.c there are few > multi-line comments (which are not function comments) that do not have a > full stop at the end, which I think is also project style: I’ve addressed all of these, but I did leave one as a multi-line which I think looks better as that. I also did some spellchecking and general tidying up of the error messages and comments in the checksum helper. > Apart from that, I previously complained about the error in > pg_verify_checksums: Updated to your suggestion. While in there I re-wrote a few other error messages to be consistent in message and quoting. > Otherwise, I had a quick look over v4 and found no further issues. > Hopefully I will be able to test it on some bigger test databases next > week. Thanks a lot for your reviews! cheers ./daniel
Attachment
> On 18 Mar 2018, at 17:21, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > >> 18 марта 2018 г., в 19:02, Michael Banck <michael.banck@credativ.de> написал(а): >> >> Otherwise, I had a quick look over v4 and found no further issues. >> Hopefully I will be able to test it on some bigger test databases next >> week. >> >> I'm switching the state back to 'Waiting on Author'; if you think the >> above points are moot, maybe switch it back to 'Needs Review' as Andrey >> Borodin also marked himself down as reviewer and might want to have >> another look as well. > > Yep, i'm already doing another pass on the code again. Hope to finish tomorrow. > My 2 cents, there's typo in the word "connections" > + <literal>template0</literal> is by default not accepting connetions, to Fixed in patch just posted in 84693D0C-772F-45C2-88A1-85B4983A5780@yesql.se (version 5). Thanks! cheers ./daniel
Hi, The patch looks good to me at a high level. Some notes below. I didn't read through the whole thread, so sorry if some of these have been discussed already. > +void > +ShutdownChecksumHelperIfRunning(void) > +{ > + if (pg_atomic_unlocked_test_flag(&ChecksumHelperShmem->launcher_started)) > + { > + /* Launcher not started, so nothing to shut down */ > + return; > + } > + > + ereport(ERROR, > + (errmsg("checksum helper is currently running, cannot disable checksums"), > + errhint("Restart the cluster or wait for the worker to finish."))); > +} Is there no way to stop the checksum helper once it's started? That seems rather user-unfriendly. I can imagine it being a pretty common mistake to call pg_enable_data_checksums() on a 10 TB cluster, only to realize that you forgot to set the cost limit, and that it's hurting queries too much. At that point, you want to abort. > + * This is intended to create the worklist for the workers to go through, and > + * as we are only concerned with already existing databases we need to ever > + * rebuild this list, which simplifies the coding. I can't parse this sentence. > +extern bool DataChecksumsEnabledOrInProgress(void); > +extern bool DataChecksumsInProgress(void); > +extern bool DataChecksumsDisabled(void); I find the name of the DataChecksumsEnabledOrInProgress() function a bit long. And doing this in PageIsVerified looks a bit weird: > if (DataChecksumsEnabledOrInProgress() && !DataChecksumsInProgress()) I think I'd prefer functions like: /* checksums should be computed on write? */ bool DataChecksumNeedWrite() /* checksum should be verified on read? */ bool DataChecksumNeedVerify() > + <literal>template0</literal> is by default not accepting connections, to > + enable checksums you'll need to temporarily make it accept connections. This was already discussed, and I agree with the other comments that it's bad. > +CREATE OR REPLACE FUNCTION pg_enable_data_checksums ( > + cost_delay int DEFAULT 0, cost_limit int DEFAULT 100) > + RETURNS boolean STRICT VOLATILE LANGUAGE internal AS 'enable_data_checksums' > + PARALLEL RESTRICTED; pg_[enable|disable]_checksums() functions return a bool. It's not clear to me when they would return what. I'd suggest marking them as 'void' instead. > --- a/src/include/storage/bufpage.h > +++ b/src/include/storage/bufpage.h > @@ -194,6 +194,7 @@ typedef PageHeaderData *PageHeader; > */ > #define PG_PAGE_LAYOUT_VERSION 4 > #define PG_DATA_CHECKSUM_VERSION 1 > +#define PG_DATA_CHECKSUM_INPROGRESS_VERSION 2 > This seems like a weird place for these PG_DATA_CHECKSUM_* constants. They're not actually stored in the page header, as you might think. I think the idea was to keep PG_DATA_CHECKSUM_VERSION close to PG_PAGE_LAYOUT_VERSION, because the checksums affected the on-disk format. I think it was a bit weird even before this patch, but now it's worse. At least a better comment would be in order, or maybe move these elsewhere. Feature request: it'd be nice to update the 'ps status' with set_ps_display, to show a rudimentary progress indicator. Like the name and block number of the relation being processed. It won't tell you how much is left to go, but at least it will give a warm fuzzy feeling to the DBA that something is happening. I didn't review the offline tool, pg_verify_checksums. - Heikki
Hi, Daniel! > 19 марта 2018 г., в 4:01, Daniel Gustafsson <daniel@yesql.se> написал(а): > > Fixed in patch just posted in 84693D0C-772F-45C2-88A1-85B4983A5780@yesql.se > (version 5). Thanks! I've been hacking a bit in neighboring thread. And come across one interesting thing. There was a patch on this CF on enabling checksums for SLRU. The thing is CLOG isnot protected with checksums right now. But the bad thing about it is that there's no reserved place for checksums in SLRU. And this conversion from page without checksum to page with checksum is quite impossible online. If we commit online checksums before SLRU checksums, we will need very neat hacks if we decide to protect SLRU eventually. What do you think about this problem? Best regards, Andrey Borodin.
Hi, Daniel!
> 19 марта 2018 г., в 4:01, Daniel Gustafsson <daniel@yesql.se> написал(а):
>
> Fixed in patch just posted in 84693D0C-772F-45C2-88A1-85B4983A5780@yesql.se
> (version 5). Thanks!
I've been hacking a bit in neighboring thread.
And come across one interesting thing. There was a patch on this CF on enabling checksums for SLRU. The thing is CLOG is not protected with checksums right now. But the bad thing about it is that there's no reserved place for checksums in SLRU.
And this conversion from page without checksum to page with checksum is quite impossible online.
If we commit online checksums before SLRU checksums, we will need very neat hacks if we decide to protect SLRU eventually.
What do you think about this problem?
On Mon, Mar 19, 2018 at 11:40 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:Hi, Daniel!
> 19 марта 2018 г., в 4:01, Daniel Gustafsson <daniel@yesql.se> написал(а):
>
> Fixed in patch just posted in 84693D0C-772F-45C2-88A1-85B4983A5780@yesql.se
> (version 5). Thanks!
I've been hacking a bit in neighboring thread.
And come across one interesting thing. There was a patch on this CF on enabling checksums for SLRU. The thing is CLOG is not protected with checksums right now. But the bad thing about it is that there's no reserved place for checksums in SLRU.
And this conversion from page without checksum to page with checksum is quite impossible online.
If we commit online checksums before SLRU checksums, we will need very neat hacks if we decide to protect SLRU eventually.
What do you think about this problem?One would be adjusted to work with the other, yes. It makes no sense to now allow online enabling once SLRU protection is in there, and it doesn't make sense for either of these patches to be blocking the other one for commit, though it would of course be best to get both included.
Makes no sense to *not* allow it, of course. Meaning yes, that should be handled.
We do need to convert from "page format with support for checksums but no checksums enabled" (11+) to "checksums enabled" online.
19 марта 2018 г., в 16:57, Magnus Hagander <magnus@hagander.net> написал(а):On Mon, Mar 19, 2018 at 12:24 PM, Magnus Hagander <magnus@hagander.net> wrote:On Mon, Mar 19, 2018 at 11:40 AM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:Hi, Daniel!
If we commit online checksums before SLRU checksums, we will need very neat hacks if we decide to protect SLRU eventually.
What do you think about this problem?One would be adjusted to work with the other, yes. It makes no sense to now allow online enabling once SLRU protection is in there, and it doesn't make sense for either of these patches to be blocking the other one for commit, though it would of course be best to get both included.
Makes no sense to *not* allow it, of course. Meaning yes, that should be handled.We don' t need to convert from "page format with no support for checksums" (pre-11) to "page format with support for checksums" (11+) online.
We do need to convert from "page format with support for checksums but no checksums enabled" (11+) to "checksums enabled" online.
Hi! > 19 марта 2018 г., в 11:27, Heikki Linnakangas <hlinnaka@iki.fi> написал(а): > > Is there no way to stop the checksum helper once it's started? That seems rather user-unfriendly. I can imagine it beinga pretty common mistake to call pg_enable_data_checksums() on a 10 TB cluster, only to realize that you forgot to setthe cost limit, and that it's hurting queries too much. At that point, you want to abort. I've tried to pg_cancel_backend() and it worked. But only if I cancel "checksum helper launcher" and then "checksum helper worker". If I cancel helper first - it spawns new. Magnus, Daniel, is it safe to cancel worker or launcher? BTW, I have some questions on pg_verify_chechsums. It does not check catalog version. It it true that it will work for any? Also, pg_verify_chechsums will stop on short reads. But we do not stop on wrong checksum, may be we should not stop on shortreads either? I agree with all of Heikki's comments. Besides these I have no other questions, patch looks good. Best regards, Andrey Borodin.
Hi!
> 19 марта 2018 г., в 11:27, Heikki Linnakangas <hlinnaka@iki.fi> написал(а):
>
> Is there no way to stop the checksum helper once it's started? That seems rather user-unfriendly. I can imagine it being a pretty common mistake to call pg_enable_data_checksums() on a 10 TB cluster, only to realize that you forgot to set the cost limit, and that it's hurting queries too much. At that point, you want to abort.
I've tried to pg_cancel_backend() and it worked.
But only if I cancel "checksum helper launcher" and then "checksum helper worker". If I cancel helper first - it spawns new.
Magnus, Daniel, is it safe to cancel worker or launcher?
Daniel is working on a proper way to stop things.
BTW, I have some questions on pg_verify_chechsums.
It does not check catalog version. It it true that it will work for any?
Also, pg_verify_chechsums will stop on short reads. But we do not stop on wrong checksum, may be we should not stop on short reads either?
Hi,
The patch looks good to me at a high level. Some notes below. I didn't read through the whole thread, so sorry if some of these have been discussed already.+void
+ShutdownChecksumHelperIfRunning(void)
+{
+ if (pg_atomic_unlocked_test_flag(&ChecksumHelperShmem->launcher _started))
+ {
+ /* Launcher not started, so nothing to shut down */
+ return;
+ }
+
+ ereport(ERROR,
+ (errmsg("checksum helper is currently running, cannot disable checksums"),
+ errhint("Restart the cluster or wait for the worker to finish.")));
+}
Is there no way to stop the checksum helper once it's started? That seems rather user-unfriendly. I can imagine it being a pretty common mistake to call pg_enable_data_checksums() on a 10 TB cluster, only to realize that you forgot to set the cost limit, and that it's hurting queries too much. At that point, you want to abort.
+extern bool DataChecksumsEnabledOrInProgress(void);
+extern bool DataChecksumsInProgress(void);
+extern bool DataChecksumsDisabled(void);
I find the name of the DataChecksumsEnabledOrInProgress() function a bit long. And doing this in PageIsVerified looks a bit weird:
> if (DataChecksumsEnabledOrInProgress() && !DataChecksumsInProgress())
I think I'd prefer functions like:
/* checksums should be computed on write? */
bool DataChecksumNeedWrite()
/* checksum should be verified on read? */
bool DataChecksumNeedVerify()
+ <literal>template0</literal> is by default not accepting connections, to
+ enable checksums you'll need to temporarily make it accept connections.
This was already discussed, and I agree with the other comments that it's bad.
+CREATE OR REPLACE FUNCTION pg_enable_data_checksums (
+ cost_delay int DEFAULT 0, cost_limit int DEFAULT 100)
+ RETURNS boolean STRICT VOLATILE LANGUAGE internal AS 'enable_data_checksums'
+ PARALLEL RESTRICTED;
pg_[enable|disable]_checksums() functions return a bool. It's not clear to me when they would return what. I'd suggest marking them as 'void' instead.
--- a/src/include/storage/bufpage.h
+++ b/src/include/storage/bufpage.h
@@ -194,6 +194,7 @@ typedef PageHeaderData *PageHeader;
*/
#define PG_PAGE_LAYOUT_VERSION 4
#define PG_DATA_CHECKSUM_VERSION 1
+#define PG_DATA_CHECKSUM_INPROGRESS_VERSION 2
This seems like a weird place for these PG_DATA_CHECKSUM_* constants. They're not actually stored in the page header, as you might think. I think the idea was to keep PG_DATA_CHECKSUM_VERSION close to PG_PAGE_LAYOUT_VERSION, because the checksums affected the on-disk format. I think it was a bit weird even before this patch, but now it's worse. At least a better comment would be in order, or maybe move these elsewhere.
Feature request: it'd be nice to update the 'ps status' with set_ps_display, to show a rudimentary progress indicator. Like the name and block number of the relation being processed. It won't tell you how much is left to go, but at least it will give a warm fuzzy feeling to the DBA that something is happening.
I didn't review the offline tool, pg_verify_checksums.
Attachment
Hi, I see enable_data_checksums() does this: if (cost_limit <= 0) ereport(ERROR, (errmsg("cost limit must be a positive value"))); Is there a reason not to allow -1 (no limit), just like for vacuum_cost? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I see enable_data_checksums() does this:
if (cost_limit <= 0)
ereport(ERROR,
(errmsg("cost limit must be a positive value")));
Is there a reason not to allow -1 (no limit), just like for vacuum_cost?
On 03/27/2018 08:56 AM, Magnus Hagander wrote: > On Mon, Mar 26, 2018 at 10:09 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > Hi, > > I see enable_data_checksums() does this: > > if (cost_limit <= 0) > ereport(ERROR, > (errmsg("cost limit must be a positive value"))); > > Is there a reason not to allow -1 (no limit), just like for vacuum_cost? > > > Eh. vaccum_cost_limit cannot be set to -1 (1 is the lowest). Neither can > vacuum_cost_delay -- it is set to *0* to disable it (which is how the > cost_delay parameter is handled here as well). > > Are you thinking autovacuum_vacuum_cost_limit where -1 means "use > vacuum_cost_limit"? > > The reason to disallow cost_limit=0 is to avoid divide-by-zero. We could > allow -1 and have it mean "use vacuum_cost_limit", but I'm not sure how > relevant that really would be in this context? > D'oh! You're right, of course. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I've just noticed the patch does not update src/backend/storage/page/README which is in fact about checksums. Most of it remains valid, but it also mentions that currently it's an initdb-time choice. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, I've been looking at the patch a bit more, and I think there are a couple of fairly serious issues in the error handling. Firstly ChecksumHelperLauncherMain spends quite a bit of effort on skipping dropped databases, but ChecksumHelperWorkerMain does not do the same thing with tables. I'm not exactly sure why, but I'd say dropped tables are more likely than dropped databases (e.g. because of temporary tables) and it's strange to gracefully handle the more rare case. Now, when a table gets dropped after BuildRelationList() does it's work, we end up calling ProcessSingleRelationByOid() on that OID. Which calls relation_open(), which fails with elog(ERROR), terminating the whole bgworker with an error like this: ERROR: could not open relation with OID 16632 LOG: background worker "checksumhelper worker" (PID 27152) exited with exit code 1 Which however means the error handling in ChecksumHelperWorkerMain() has no chance to kick in, because the bgworker dies right away. The code looks like this: foreach(lc, RelationList) { ChecksumHelperRelation *rel = (ChecksumHelperRelation *) lfirst(lc); if (!ProcessSingleRelationByOid(rel->reloid, strategy)) { ChecksumHelperShmem->success = ABORTED; break; } else ChecksumHelperShmem->success = SUCCESSFUL; } list_free_deep(RelationList); Now, assume the first relation in the list still exists and gets processed correctly, so "success" ends up being SUCCESSFUL. Then the second OID is the dropped relation, which kills the bgworker ... The launcher however does not realize anything went wrong, because the flag still says SUCCESSFUL. And so it merrily switches checksums to "on", leading to this on the rest of the relations: WARNING: page verification failed, calculated checksum 58644 but expected 0 ERROR: invalid page in block 0 of relation base/16631/16653 Yikes! IMHO this error handling is broken by design - two things need to happen, I think: (a) graceful handling of dropped relations and (b) proper error reporting from the bgworder. (a) Should not be difficult to do, I think. We don't have relation_open with a missing_ok flag, but implementing something like that should not be difficult. Even a simple "does OID exist" should be enough. (b) But just handling dropped relations is not enough, because I could simply kill the bgworker directly, and it would have exactly the same consequences. What needs to happen is something like this: ChecksumHelperResult local_success = SUCCESFUL; foreach(lc, RelationList) { ChecksumHelperRelation *rel = (ChecksumHelperRelation *) lfirst(lc); if (!ProcessSingleRelationByOid(rel->reloid, strategy)) { local_success = ABORTED; break; } } list_free_deep(RelationList); ChecksumHelperShmem->success = local_success; That is, leave the flag in shred memory set to FAILED until the very last moment, and only when everything went fine set it to SUCCESSFUL. BTW I don't think handling dropped relations by letting the bgworker crash and restart is an acceptable approach. That would pretty much mean any DDL changes are prohibited on the system while the checksum process is running, which is not quite possible (e.g. for systems doing stuff with temporary tables). Which however reminds me I've also ran into a bug in the automated retry system, because you may get messages like this: ERROR: failed to enable checksums in "test", giving up (attempts 639968292). This happens because BuildDatabaseList() does just palloc() and does not initialize the 'attempts' field. It may get initialized to 0 by chance, but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely high value. BTW both ChecksumHelperRelation and ChecksumHelperDatabase have 'success' field which is actually unused (and uninitialized). But wait - there is more ;-) BuildRelationList is using heap_beginscan with the regular snapshot, so it does not see uncommitted transactions. So if you do this: BEGIN; CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i); -- run pg_enable_data_checksums() from another session SELECT COUNT(*) FROM t; then the table will be invisible to the checksum worker, it won't have checksums updated and the cluster will get checksums enabled. Which means this: test=# SELECT COUNT(*) FROM t; WARNING: page verification failed, calculated checksum 27170 but expected 0 ERROR: invalid page in block 0 of relation base/16677/16683 Not sure what's the best way to fix this - maybe we could wait for all running transactions to end, before starting the work. And if you try this with a temporary table (not hidden in transaction, so the bgworker can see it), the worker will fail with this: ERROR: cannot access temporary tables of other sessions But of course, this is just another way how to crash without updating the result for the launcher, so checksums may end up being enabled anyway. Not great, I guess :-( regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi,
I've been looking at the patch a bit more, and I think there are a
couple of fairly serious issues in the error handling.
Firstly ChecksumHelperLauncherMain spends quite a bit of effort on
skipping dropped databases, but ChecksumHelperWorkerMain does not do the
same thing with tables. I'm not exactly sure why, but I'd say dropped
tables are more likely than dropped databases (e.g. because of temporary
tables) and it's strange to gracefully handle the more rare case.
Now, when a table gets dropped after BuildRelationList() does it's work,
we end up calling ProcessSingleRelationByOid() on that OID. Which calls
relation_open(), which fails with elog(ERROR), terminating the whole
bgworker with an error like this:
ERROR: could not open relation with OID 16632
LOG: background worker "checksumhelper worker" (PID 27152) exited
with exit code 1
Which however means the error handling in ChecksumHelperWorkerMain() has
no chance to kick in, because the bgworker dies right away. The code
looks like this:
foreach(lc, RelationList)
{
ChecksumHelperRelation *rel
= (ChecksumHelperRelation *) lfirst(lc);
if (!ProcessSingleRelationByOid(rel->reloid, strategy))
{
ChecksumHelperShmem->success = ABORTED;
break;
}
else
ChecksumHelperShmem->success = SUCCESSFUL;
}
list_free_deep(RelationList);
Now, assume the first relation in the list still exists and gets
processed correctly, so "success" ends up being SUCCESSFUL. Then the
second OID is the dropped relation, which kills the bgworker ...
IMHO this error handling is broken by design - two things need to
happen, I think: (a) graceful handling of dropped relations and (b)
proper error reporting from the bgworder.
(a) Should not be difficult to do, I think. We don't have relation_open
with a missing_ok flag, but implementing something like that should not
be difficult. Even a simple "does OID exist" should be enough.
(b) But just handling dropped relations is not enough, because I could
simply kill the bgworker directly, and it would have exactly the same
consequences. What needs to happen is something like this:
BTW I don't think handling dropped relations by letting the bgworker
crash and restart is an acceptable approach. That would pretty much mean
any DDL changes are prohibited on the system while the checksum process
is running, which is not quite possible (e.g. for systems doing stuff
with temporary tables).
Which however reminds me I've also ran into a bug in the automated retry
system, because you may get messages like this:
ERROR: failed to enable checksums in "test", giving up (attempts
639968292).
This happens because BuildDatabaseList() does just palloc() and does not
initialize the 'attempts' field. It may get initialized to 0 by chance,
but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely
high value.
BTW both ChecksumHelperRelation and ChecksumHelperDatabase have
'success' field which is actually unused (and uninitialized).
But wait - there is more ;-) BuildRelationList is using heap_beginscan
with the regular snapshot, so it does not see uncommitted transactions.
So if you do this:
BEGIN;
CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i);
-- run pg_enable_data_checksums() from another session
SELECT COUNT(*) FROM t;
then the table will be invisible to the checksum worker, it won't have
checksums updated and the cluster will get checksums enabled. Which
means this:
test=# SELECT COUNT(*) FROM t;
WARNING: page verification failed, calculated checksum 27170 but
expected 0
ERROR: invalid page in block 0 of relation base/16677/16683
Not sure what's the best way to fix this - maybe we could wait for all
running transactions to end, before starting the work.
And if you try this with a temporary table (not hidden in transaction,
so the bgworker can see it), the worker will fail with this:
ERROR: cannot access temporary tables of other sessions
But of course, this is just another way how to crash without updating
the result for the launcher, so checksums may end up being enabled anyway.
Attachment
Hi, On 03/31/2018 02:02 PM, Magnus Hagander wrote: > On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > ... > > (a) Should not be difficult to do, I think. We don't have relation_open > with a missing_ok flag, but implementing something like that should not > be difficult. Even a simple "does OID exist" should be enough. > > > Not entirely sure what you mean with "even a simple does oid exist" > means? I mean, if we check for the file, that won't help us -- there > will still be a tiny race between the check and us opening it won't it? > I meant to say "even a simple check if the OID still exists" but it was a bit too late / not enough caffeine issue. You're right there would be a tiny window of race condition - it'd be much shorter, possibly enough to make the error+restart approach acceptable. > However, we have try_relation_open(). Which is documented as: > *Same as relation_open, except return NULL instead of failing > *if the relation does not exist. > > So I'm pretty sure it's just a matter of using try_relation_open() > instead, and checking for NULL? > Oh, right. I thought we had a relation_open variant that handles this, but have been looking for one with missing_ok flag and so I missed this. try_relation_open should do the trick when it comes to dropped tables. > > (b) But just handling dropped relations is not enough, because I could > simply kill the bgworker directly, and it would have exactly the same > consequences. What needs to happen is something like this: > > > <snip> > And now I see your code, which was below-fold when I first read. After > having writing a very similar fix myself. I'm glad this code looks > mostly identical to what I suggested above, so I think that's a good > solution. > Good ;-) > > > BTW I don't think handling dropped relations by letting the bgworker > crash and restart is an acceptable approach. That would pretty much mean > any DDL changes are prohibited on the system while the checksum process > is running, which is not quite possible (e.g. for systems doing stuff > with temporary tables). > > > No, I don't like that at all. We need to handle them gracefully, by > skipping them - crash and restart is not acceptable for something that > common. > Yeah, I was just thinking aloud. > > Which however reminds me I've also ran into a bug in the automated retry > system, because you may get messages like this: > > ERROR: failed to enable checksums in "test", giving up (attempts > 639968292). > > This happens because BuildDatabaseList() does just palloc() and does not > initialize the 'attempts' field. It may get initialized to 0 by chance, > but I'm running with -DRANDOMIZE_ALLOCATED_MEMORY, hence the insanely > high value. > > > Eh. I don't have that "(attempts" part in my code at all. Is that either > from some earlier version of the patch, or did you add that yourself for > testing? > Apologies, you're right I tweaked the message a bit (just adding the number of attempts to it). The logic however remains the same, and the bug is real. > > But wait - there is more ;-) BuildRelationList is using heap_beginscan > with the regular snapshot, so it does not see uncommitted transactions. > So if you do this: > > BEGIN; > CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i); > -- run pg_enable_data_checksums() from another session > SELECT COUNT(*) FROM t; > > then the table will be invisible to the checksum worker, it won't have > checksums updated and the cluster will get checksums enabled. Which > means this: > > > Ugh. Interestingly enough I just put that on my TODO list *yesterday* > that I forgot to check that specific case :/ > But I was faster in reporting it ;-) > > test=# SELECT COUNT(*) FROM t; > WARNING: page verification failed, calculated checksum 27170 but > expected 0 > ERROR: invalid page in block 0 of relation base/16677/16683 > > Not sure what's the best way to fix this - maybe we could wait for all > running transactions to end, before starting the work. > > > I was thinking of that as one way to deal with it, yes. > > I guess a reasonable way to do that would be to do it as part of > BuildRelationList() -- basically have that one wait until there is no > other running transaction in that specific database before we take the > snapshot? > > A first thought I had was to try to just take an access exclusive lock > on pg_class for a very short time, but a transaction that does create > table doesn't actually keep it's lock on that table so there is no conflict. > Yeah, I don't think that's going to work, because you don't even know you need to lock/wait for something. I do think just waiting for all running transactions to complete is fine, and it's not the first place where we use it - CREATE SUBSCRIPTION does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY too, to some extent). So we have a precedent / working code we can copy. > > And if you try this with a temporary table (not hidden in transaction, > so the bgworker can see it), the worker will fail with this: > > ERROR: cannot access temporary tables of other sessions > > But of course, this is just another way how to crash without updating > the result for the launcher, so checksums may end up being enabled > anyway. > > > Yeah, there will be plenty of side-effect issues from that > crash-with-wrong-status case. Fixing that will at least make things > safer -- in that checksums won't be enabled when not put on all pages. > Sure, the outcome with checksums enabled incorrectly is a consequence of bogus status, and fixing that will prevent that. But that wasn't my main point here - not articulated very clearly, though. The bigger question is how to handle temporary tables gracefully, so that it does not terminate the bgworker like this at all. This might be even bigger issue than dropped relations, considering that temporary tables are pretty common part of applications (and it also includes CREATE/DROP). For some clusters it might mean the online checksum enabling would crash+restart infinitely (well, until reaching MAX_ATTEMPTS). Unfortunately, try_relation_open() won't fix this, as the error comes from ReadBufferExtended. And it's not a matter of simply creating a ReadBuffer variant without that error check, because temporary tables use local buffers. I wonder if we could just go and set the checksums anyway, ignoring the local buffers. If the other session does some changes, it'll overwrite our changes, this time with the correct checksums. But it seems pretty dangerous (I mean, what if they're writing stuff while we're updating the checksums? Considering the various short-cuts for temporary tables, I suspect that would be a boon for race conditions.) Another option would be to do something similar to running transactions, i.e. wait until all temporary tables (that we've seen at the beginning) disappear. But we're starting to wait on more and more stuff. If we do this, we should clearly log which backends we're waiting for, so that the admins can go and interrupt them manually. > I have attached a patch that fixes the "easy" ones per your first > comments. No solution for the open-transaction yet, but I wanted to put > the rest out there -- especially if you have automated tests you can > send it through. > I don't have automated tests, but I'll take a look. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/31/2018 02:02 PM, Magnus Hagander wrote:
> On Sat, Mar 31, 2018 at 2:08 AM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
>
> But wait - there is more ;-) BuildRelationList is using heap_beginscan
> with the regular snapshot, so it does not see uncommitted transactions.
> So if you do this:
>
> BEGIN;
> CREATE TABLE t AS SELECT i FROM generate_series(1,10000000) s(i);
> -- run pg_enable_data_checksums() from another session
> SELECT COUNT(*) FROM t;
>
> then the table will be invisible to the checksum worker, it won't have
> checksums updated and the cluster will get checksums enabled. Which
> means this:
>
>
> Ugh. Interestingly enough I just put that on my TODO list *yesterday*
> that I forgot to check that specific case :/
>
But I was faster in reporting it ;-)
> test=# SELECT COUNT(*) FROM t;
> WARNING: page verification failed, calculated checksum 27170 but
> expected 0
> ERROR: invalid page in block 0 of relation base/16677/16683
>
> Not sure what's the best way to fix this - maybe we could wait for all
> running transactions to end, before starting the work.
>
>
> I was thinking of that as one way to deal with it, yes.
>
> I guess a reasonable way to do that would be to do it as part of
> BuildRelationList() -- basically have that one wait until there is no
> other running transaction in that specific database before we take the
> snapshot?
>
> A first thought I had was to try to just take an access exclusive lock
> on pg_class for a very short time, but a transaction that does create
> table doesn't actually keep it's lock on that table so there is no conflict.
>
Yeah, I don't think that's going to work, because you don't even know
you need to lock/wait for something.
I do think just waiting for all running transactions to complete is
fine, and it's not the first place where we use it - CREATE SUBSCRIPTION
does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY
too, to some extent). So we have a precedent / working code we can copy.
> And if you try this with a temporary table (not hidden in transaction,
> so the bgworker can see it), the worker will fail with this:
>
> ERROR: cannot access temporary tables of other sessions
>
> But of course, this is just another way how to crash without updating
> the result for the launcher, so checksums may end up being enabled
> anyway.
>
>
> Yeah, there will be plenty of side-effect issues from that
> crash-with-wrong-status case. Fixing that will at least make things
> safer -- in that checksums won't be enabled when not put on all pages.
>
Sure, the outcome with checksums enabled incorrectly is a consequence of
bogus status, and fixing that will prevent that. But that wasn't my main
point here - not articulated very clearly, though.
The bigger question is how to handle temporary tables gracefully, so
that it does not terminate the bgworker like this at all. This might be
even bigger issue than dropped relations, considering that temporary
tables are pretty common part of applications (and it also includes
CREATE/DROP).
For some clusters it might mean the online checksum enabling would
crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
Unfortunately, try_relation_open() won't fix this, as the error comes
from ReadBufferExtended. And it's not a matter of simply creating a
ReadBuffer variant without that error check, because temporary tables
use local buffers.
I wonder if we could just go and set the checksums anyway, ignoring the
local buffers. If the other session does some changes, it'll overwrite
our changes, this time with the correct checksums. But it seems pretty
dangerous (I mean, what if they're writing stuff while we're updating
the checksums? Considering the various short-cuts for temporary tables,
I suspect that would be a boon for race conditions.)
Another option would be to do something similar to running transactions,
i.e. wait until all temporary tables (that we've seen at the beginning)
disappear. But we're starting to wait on more and more stuff.
If we do this, we should clearly log which backends we're waiting for,
so that the admins can go and interrupt them manually.
Attachment
On 03/31/2018 05:05 PM, Magnus Hagander wrote: > On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > ... > > I do think just waiting for all running transactions to complete is > fine, and it's not the first place where we use it - CREATE SUBSCRIPTION > does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY > too, to some extent). So we have a precedent / working code we can copy. > > > Thinking again, I don't think it should be done as part of > BuildRelationList(). We should just do it once in the launcher before > starting, that'll be both easier and cleaner. Anything started after > that will have checksums on it, so we should be fine. > > PFA one that does this. > Seems fine to me. I'd however log waitforxid, not the oldest one. If you're a DBA and you want to make the checksumming to proceed, knowing the oldest running XID is useless for that. If we log waitforxid, it can be used to query pg_stat_activity and interrupt the sessions somehow. > > > And if you try this with a temporary table (not hidden in transaction, > > so the bgworker can see it), the worker will fail with this: > > > > ERROR: cannot access temporary tables of other sessions > > > > But of course, this is just another way how to crash without updating > > the result for the launcher, so checksums may end up being enabled > > anyway. > > > > > > Yeah, there will be plenty of side-effect issues from that > > crash-with-wrong-status case. Fixing that will at least make things > > safer -- in that checksums won't be enabled when not put on all pages. > > > > Sure, the outcome with checksums enabled incorrectly is a consequence of > bogus status, and fixing that will prevent that. But that wasn't my main > point here - not articulated very clearly, though. > > The bigger question is how to handle temporary tables gracefully, so > that it does not terminate the bgworker like this at all. This might be > even bigger issue than dropped relations, considering that temporary > tables are pretty common part of applications (and it also includes > CREATE/DROP). > > For some clusters it might mean the online checksum enabling would > crash+restart infinitely (well, until reaching MAX_ATTEMPTS). > > Unfortunately, try_relation_open() won't fix this, as the error comes > from ReadBufferExtended. And it's not a matter of simply creating a > ReadBuffer variant without that error check, because temporary tables > use local buffers. > > I wonder if we could just go and set the checksums anyway, ignoring the > local buffers. If the other session does some changes, it'll overwrite > our changes, this time with the correct checksums. But it seems pretty > dangerous (I mean, what if they're writing stuff while we're updating > the checksums? Considering the various short-cuts for temporary tables, > I suspect that would be a boon for race conditions.) > > Another option would be to do something similar to running transactions, > i.e. wait until all temporary tables (that we've seen at the beginning) > disappear. But we're starting to wait on more and more stuff. > > If we do this, we should clearly log which backends we're waiting for, > so that the admins can go and interrupt them manually. > > > > Yeah, waiting for all transactions at the beginning is pretty simple. > > Making the worker simply ignore temporary tables would also be easy. > > One of the bigger issues here is temporary tables are *session* scope > and not transaction, so we'd actually need the other session to finish, > not just the transaction. > > I guess what we could do is something like this: > > 1. Don't process temporary tables in the checksumworker, period. > Instead, build a list of any temporary tables that existed when the > worker started in this particular database (basically anything that we > got in our scan). Once we have processed the complete database, keep > re-scanning pg_class until those particular tables are gone (search by oid). > > That means that any temporary tables that are created *while* we are > processing a database are ignored, but they should already be receiving > checksums. > > It definitely leads to a potential issue with long running temp tables. > But as long as we look at the *actual tables* (by oid), we should be > able to handle long-running sessions once they have dropped their temp > tables. > > Does that sound workable to you? > Yes, that's pretty much what I meant by 'wait until all temporary tables disappear'. Again, we need to make it easy to determine which OIDs are we waiting for, which sessions may need DBA's attention. I don't think it makes sense to log OIDs of the temporary tables. There can be many of them, and in most cases the connection/session is managed by the application, so the only thing you can do is kill the connection. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 03/31/2018 05:05 PM, Magnus Hagander wrote:
> On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
> ...
>
> I do think just waiting for all running transactions to complete is
> fine, and it's not the first place where we use it - CREATE SUBSCRIPTION
> does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY
> too, to some extent). So we have a precedent / working code we can copy.
>
>
> Thinking again, I don't think it should be done as part of
> BuildRelationList(). We should just do it once in the launcher before
> starting, that'll be both easier and cleaner. Anything started after
> that will have checksums on it, so we should be fine.
>
> PFA one that does this.
>
Seems fine to me. I'd however log waitforxid, not the oldest one. If
you're a DBA and you want to make the checksumming to proceed, knowing
the oldest running XID is useless for that. If we log waitforxid, it can
be used to query pg_stat_activity and interrupt the sessions somehow.
Yes, that's pretty much what I meant by 'wait until all temporary tables> > And if you try this with a temporary table (not hidden in transaction,
> > so the bgworker can see it), the worker will fail with this:
> >
> > ERROR: cannot access temporary tables of other sessions
> >
> > But of course, this is just another way how to crash without updating
> > the result for the launcher, so checksums may end up being enabled
> > anyway.
> >
> >
> > Yeah, there will be plenty of side-effect issues from that
> > crash-with-wrong-status case. Fixing that will at least make things
> > safer -- in that checksums won't be enabled when not put on all pages.
> >
>
> Sure, the outcome with checksums enabled incorrectly is a consequence of
> bogus status, and fixing that will prevent that. But that wasn't my main
> point here - not articulated very clearly, though.
>
> The bigger question is how to handle temporary tables gracefully, so
> that it does not terminate the bgworker like this at all. This might be
> even bigger issue than dropped relations, considering that temporary
> tables are pretty common part of applications (and it also includes
> CREATE/DROP).
>
> For some clusters it might mean the online checksum enabling would
> crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
>
> Unfortunately, try_relation_open() won't fix this, as the error comes
> from ReadBufferExtended. And it's not a matter of simply creating a
> ReadBuffer variant without that error check, because temporary tables
> use local buffers.
>
> I wonder if we could just go and set the checksums anyway, ignoring the
> local buffers. If the other session does some changes, it'll overwrite
> our changes, this time with the correct checksums. But it seems pretty
> dangerous (I mean, what if they're writing stuff while we're updating
> the checksums? Considering the various short-cuts for temporary tables,
> I suspect that would be a boon for race conditions.)
>
> Another option would be to do something similar to running transactions,
> i.e. wait until all temporary tables (that we've seen at the beginning)
> disappear. But we're starting to wait on more and more stuff.
>
> If we do this, we should clearly log which backends we're waiting for,
> so that the admins can go and interrupt them manually.
>
>
>
> Yeah, waiting for all transactions at the beginning is pretty simple.
>
> Making the worker simply ignore temporary tables would also be easy.
>
> One of the bigger issues here is temporary tables are *session* scope
> and not transaction, so we'd actually need the other session to finish,
> not just the transaction.
>
> I guess what we could do is something like this:
>
> 1. Don't process temporary tables in the checksumworker, period.
> Instead, build a list of any temporary tables that existed when the
> worker started in this particular database (basically anything that we
> got in our scan). Once we have processed the complete database, keep
> re-scanning pg_class until those particular tables are gone (search by oid).
>
> That means that any temporary tables that are created *while* we are
> processing a database are ignored, but they should already be receiving
> checksums.
>
> It definitely leads to a potential issue with long running temp tables.
> But as long as we look at the *actual tables* (by oid), we should be
> able to handle long-running sessions once they have dropped their temp
> tables.
>
> Does that sound workable to you?
>
disappear'. Again, we need to make it easy to determine which OIDs are
we waiting for, which sessions may need DBA's attention.
I don't think it makes sense to log OIDs of the temporary tables. There
can be many of them, and in most cases the connection/session is managed
by the application, so the only thing you can do is kill the connection.
Attachment
On Sat, Mar 31, 2018 at 5:38 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:On 03/31/2018 05:05 PM, Magnus Hagander wrote:
> On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote:
>
> ...
>
> I do think just waiting for all running transactions to complete is
> fine, and it's not the first place where we use it - CREATE SUBSCRIPTION
> does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY
> too, to some extent). So we have a precedent / working code we can copy.
>
>
> Thinking again, I don't think it should be done as part of
> BuildRelationList(). We should just do it once in the launcher before
> starting, that'll be both easier and cleaner. Anything started after
> that will have checksums on it, so we should be fine.
>
> PFA one that does this.
>
Seems fine to me. I'd however log waitforxid, not the oldest one. If
you're a DBA and you want to make the checksumming to proceed, knowing
the oldest running XID is useless for that. If we log waitforxid, it can
be used to query pg_stat_activity and interrupt the sessions somehow.Yeah, makes sense. Updated.Yes, that's pretty much what I meant by 'wait until all temporary tables> > And if you try this with a temporary table (not hidden in transaction,
> > so the bgworker can see it), the worker will fail with this:
> >
> > ERROR: cannot access temporary tables of other sessions
> >
> > But of course, this is just another way how to crash without updating
> > the result for the launcher, so checksums may end up being enabled
> > anyway.
> >
> >
> > Yeah, there will be plenty of side-effect issues from that
> > crash-with-wrong-status case. Fixing that will at least make things
> > safer -- in that checksums won't be enabled when not put on all pages.
> >
>
> Sure, the outcome with checksums enabled incorrectly is a consequence of
> bogus status, and fixing that will prevent that. But that wasn't my main
> point here - not articulated very clearly, though.
>
> The bigger question is how to handle temporary tables gracefully, so
> that it does not terminate the bgworker like this at all. This might be
> even bigger issue than dropped relations, considering that temporary
> tables are pretty common part of applications (and it also includes
> CREATE/DROP).
>
> For some clusters it might mean the online checksum enabling would
> crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
>
> Unfortunately, try_relation_open() won't fix this, as the error comes
> from ReadBufferExtended. And it's not a matter of simply creating a
> ReadBuffer variant without that error check, because temporary tables
> use local buffers.
>
> I wonder if we could just go and set the checksums anyway, ignoring the
> local buffers. If the other session does some changes, it'll overwrite
> our changes, this time with the correct checksums. But it seems pretty
> dangerous (I mean, what if they're writing stuff while we're updating
> the checksums? Considering the various short-cuts for temporary tables,
> I suspect that would be a boon for race conditions.)
>
> Another option would be to do something similar to running transactions,
> i.e. wait until all temporary tables (that we've seen at the beginning)
> disappear. But we're starting to wait on more and more stuff.
>
> If we do this, we should clearly log which backends we're waiting for,
> so that the admins can go and interrupt them manually.
>
>
>
> Yeah, waiting for all transactions at the beginning is pretty simple.
>
> Making the worker simply ignore temporary tables would also be easy.
>
> One of the bigger issues here is temporary tables are *session* scope
> and not transaction, so we'd actually need the other session to finish,
> not just the transaction.
>
> I guess what we could do is something like this:
>
> 1. Don't process temporary tables in the checksumworker, period.
> Instead, build a list of any temporary tables that existed when the
> worker started in this particular database (basically anything that we
> got in our scan). Once we have processed the complete database, keep
> re-scanning pg_class until those particular tables are gone (search by oid).
>
> That means that any temporary tables that are created *while* we are
> processing a database are ignored, but they should already be receiving
> checksums.
>
> It definitely leads to a potential issue with long running temp tables.
> But as long as we look at the *actual tables* (by oid), we should be
> able to handle long-running sessions once they have dropped their temp
> tables.
>
> Does that sound workable to you?
>
disappear'. Again, we need to make it easy to determine which OIDs are
we waiting for, which sessions may need DBA's attention.
I don't think it makes sense to log OIDs of the temporary tables. There
can be many of them, and in most cases the connection/session is managed
by the application, so the only thing you can do is kill the connection.Yeah, agreed. I think it makes sense to show the *number* of temp tables. That's also a predictable amount of information -- logging all temp tables may as you say lead to an insane amount of data.PFA a patch that does this. I've also added some docs for it.And I also noticed pg_verify_checksums wasn't installed, so fixed that too.
Attachment
On 04/03/2018 02:05 PM, Magnus Hagander wrote: > On Sun, Apr 1, 2018 at 2:04 PM, Magnus Hagander <magnus@hagander.net > <mailto:magnus@hagander.net>> wrote: > > On Sat, Mar 31, 2018 at 5:38 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> > wrote: > > On 03/31/2018 05:05 PM, Magnus Hagander wrote: > > On Sat, Mar 31, 2018 at 4:21 PM, Tomas Vondra > > <tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com> > <mailto:tomas.vondra@2ndquadrant.com > <mailto:tomas.vondra@2ndquadrant.com>>> wrote: > > > > ... > > > > I do think just waiting for all running transactions to complete is > > fine, and it's not the first place where we use it - CREATE SUBSCRIPTION > > does pretty much exactly the same thing (and CREATE INDEX CONCURRENTLY > > too, to some extent). So we have a precedent / working code we can copy. > > > > > > Thinking again, I don't think it should be done as part of > > BuildRelationList(). We should just do it once in the launcher before > > starting, that'll be both easier and cleaner. Anything started after > > that will have checksums on it, so we should be fine. > > > > PFA one that does this. > > > > Seems fine to me. I'd however log waitforxid, not the oldest one. If > you're a DBA and you want to make the checksumming to proceed, > knowing > the oldest running XID is useless for that. If we log > waitforxid, it can > be used to query pg_stat_activity and interrupt the sessions > somehow. > > > Yeah, makes sense. Updated. > > > > > > And if you try this with a temporary table (not > hidden in transaction, > > > so the bgworker can see it), the worker will fail > with this: > > > > > > ERROR: cannot access temporary tables of other > sessions > > > > > > But of course, this is just another way how to crash > without updating > > > the result for the launcher, so checksums may end up > being enabled > > > anyway. > > > > > > > > > Yeah, there will be plenty of side-effect issues from that > > > crash-with-wrong-status case. Fixing that will at least > make things > > > safer -- in that checksums won't be enabled when not put > on all pages. > > > > > > > Sure, the outcome with checksums enabled incorrectly is a > consequence of > > bogus status, and fixing that will prevent that. But that > wasn't my main > > point here - not articulated very clearly, though. > > > > The bigger question is how to handle temporary tables > gracefully, so > > that it does not terminate the bgworker like this at all. > This might be > > even bigger issue than dropped relations, considering that > temporary > > tables are pretty common part of applications (and it also > includes > > CREATE/DROP). > > > > For some clusters it might mean the online checksum > enabling would > > crash+restart infinitely (well, until reaching MAX_ATTEMPTS). > > > > Unfortunately, try_relation_open() won't fix this, as the > error comes > > from ReadBufferExtended. And it's not a matter of simply > creating a > > ReadBuffer variant without that error check, because > temporary tables > > use local buffers. > > > > I wonder if we could just go and set the checksums anyway, > ignoring the > > local buffers. If the other session does some changes, > it'll overwrite > > our changes, this time with the correct checksums. But it > seems pretty > > dangerous (I mean, what if they're writing stuff while > we're updating > > the checksums? Considering the various short-cuts for > temporary tables, > > I suspect that would be a boon for race conditions.) > > > > Another option would be to do something similar to running > transactions, > > i.e. wait until all temporary tables (that we've seen at > the beginning) > > disappear. But we're starting to wait on more and more stuff. > > > > If we do this, we should clearly log which backends we're > waiting for, > > so that the admins can go and interrupt them manually. > > > > > > > > Yeah, waiting for all transactions at the beginning is pretty > simple. > > > > Making the worker simply ignore temporary tables would also be > easy. > > > > One of the bigger issues here is temporary tables are > *session* scope > > and not transaction, so we'd actually need the other session > to finish, > > not just the transaction. > > > > I guess what we could do is something like this: > > > > 1. Don't process temporary tables in the checksumworker, period. > > Instead, build a list of any temporary tables that existed > when the > > worker started in this particular database (basically anything > that we > > got in our scan). Once we have processed the complete > database, keep > > re-scanning pg_class until those particular tables are gone > (search by oid). > > > > That means that any temporary tables that are created *while* > we are > > processing a database are ignored, but they should already be > receiving > > checksums. > > > > It definitely leads to a potential issue with long running > temp tables. > > But as long as we look at the *actual tables* (by oid), we > should be > > able to handle long-running sessions once they have dropped > their temp > > tables. > > > > Does that sound workable to you? > > > > Yes, that's pretty much what I meant by 'wait until all > temporary tables > disappear'. Again, we need to make it easy to determine which > OIDs are > we waiting for, which sessions may need DBA's attention. > > I don't think it makes sense to log OIDs of the temporary > tables. There > can be many of them, and in most cases the connection/session is > managed > by the application, so the only thing you can do is kill the > connection. > > > Yeah, agreed. I think it makes sense to show the *number* of temp > tables. That's also a predictable amount of information -- logging > all temp tables may as you say lead to an insane amount of data. > > PFA a patch that does this. I've also added some docs for it. > > And I also noticed pg_verify_checksums wasn't installed, so fixed > that too. > > > PFA a rebase on top of the just committed verify-checksums patch. > This seems OK in terms of handling errors in the worker and passing it to the launcher. I haven't managed to do any crash testing today, but code-wise it seems sane. It however still fails to initialize the attempts field after allocating the db entry in BuildDatabaseList, so if you try running with -DRANDOMIZE_ALLOCATED_MEMORY it'll get initialized to values like this: WARNING: attempts = -1684366952 WARNING: attempts = 1010514489 WARNING: attempts = -1145390664 WARNING: attempts = 1162101570 I guess those are not the droids we're looking for? Likewise, I don't see where ChecksumHelperShmemStruct->abort gets initialized. I think it only ever gets set in launcher_exit(), but that does not seem sufficient. I suspect it's the reason for this behavior: test=# select pg_enable_data_checksums(10, 10); ERROR: database "template0" does not allow connections HINT: Allow connections using ALTER DATABASE and try again. test=# alter database template0 allow_connections = true; ALTER DATABASE test=# select pg_enable_data_checksums(10, 10); ERROR: could not start checksumhelper: already running test=# select pg_disable_data_checksums(); pg_disable_data_checksums --------------------------- (1 row) test=# select pg_enable_data_checksums(10, 10); ERROR: could not start checksumhelper: has been cancelled At which point the only thing you can do is restarting the cluster, which seems somewhat unnecessary. But perhaps it's intentional? Attached is a diff with a couple of minor comment tweaks, and correct initialization of the attempts field. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
Hi, On Tue, Apr 03, 2018 at 02:05:04PM +0200, Magnus Hagander wrote: > PFA a rebase on top of the just committed verify-checksums patch. For the record, I am on vacation this week and won't be able to do further in-depth review or testing of this patch before the end of the commitfest, sorry. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
On 04/03/2018 02:05 PM, Magnus Hagander wrote:
> On Sun, Apr 1, 2018 at 2:04 PM, Magnus Hagander <magnus@hagander.net
> <mailto:magnus@hagander.net>> wrote:
>
> On Sat, Mar 31, 2018 at 5:38 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>>
> wrote:This seems OK in terms of handling errors in the worker and passing it> > And if you try this with a temporary table (not
> hidden in transaction,
> > > so the bgworker can see it), the worker will fail
> with this:
> > >
> > > ERROR: cannot access temporary tables of other
> sessions
> > >
> > > But of course, this is just another way how to crash
> without updating
> > > the result for the launcher, so checksums may end up
> being enabled
> > > anyway.
> > >
> > >
> > > Yeah, there will be plenty of side-effect issues from that
> > > crash-with-wrong-status case. Fixing that will at least
> make things
> > > safer -- in that checksums won't be enabled when not put
> on all pages.
> > >
> >
> > Sure, the outcome with checksums enabled incorrectly is a
> consequence of
> > bogus status, and fixing that will prevent that. But that
> wasn't my main
> > point here - not articulated very clearly, though.
> >
> > The bigger question is how to handle temporary tables
> gracefully, so
> > that it does not terminate the bgworker like this at all.
> This might be
> > even bigger issue than dropped relations, considering that
> temporary
> > tables are pretty common part of applications (and it also
> includes
> > CREATE/DROP).
> >
> > For some clusters it might mean the online checksum
> enabling would
> > crash+restart infinitely (well, until reaching MAX_ATTEMPTS).
> >
> > Unfortunately, try_relation_open() won't fix this, as the
> error comes
> > from ReadBufferExtended. And it's not a matter of simply
> creating a
> > ReadBuffer variant without that error check, because
> temporary tables
> > use local buffers.
> >
> > I wonder if we could just go and set the checksums anyway,
> ignoring the
> > local buffers. If the other session does some changes,
> it'll overwrite
> > our changes, this time with the correct checksums. But it
> seems pretty
> > dangerous (I mean, what if they're writing stuff while
> we're updating
> > the checksums? Considering the various short-cuts for
> temporary tables,
> > I suspect that would be a boon for race conditions.)
> >
> > Another option would be to do something similar to running
> transactions,
> > i.e. wait until all temporary tables (that we've seen at
> the beginning)
> > disappear. But we're starting to wait on more and more stuff.
> >
> > If we do this, we should clearly log which backends we're
> waiting for,
> > so that the admins can go and interrupt them manually.
> >
> >
> >
> > Yeah, waiting for all transactions at the beginning is pretty
> simple.
> >
> > Making the worker simply ignore temporary tables would also be
> easy.
> >
> > One of the bigger issues here is temporary tables are
> *session* scope
> > and not transaction, so we'd actually need the other session
> to finish,
> > not just the transaction.
> >
> > I guess what we could do is something like this:
> >
> > 1. Don't process temporary tables in the checksumworker, period.
> > Instead, build a list of any temporary tables that existed
> when the
> > worker started in this particular database (basically anything
> that we
> > got in our scan). Once we have processed the complete
> database, keep
> > re-scanning pg_class until those particular tables are gone
> (search by oid).
> >
> > That means that any temporary tables that are created *while*
> we are
> > processing a database are ignored, but they should already be
> receiving
> > checksums.
> >
> > It definitely leads to a potential issue with long running
> temp tables.
> > But as long as we look at the *actual tables* (by oid), we
> should be
> > able to handle long-running sessions once they have dropped
> their temp
> > tables.
> >
> > Does that sound workable to you?
> >
>
> Yes, that's pretty much what I meant by 'wait until all
> temporary tables
> disappear'. Again, we need to make it easy to determine which
> OIDs are
> we waiting for, which sessions may need DBA's attention.
>
> I don't think it makes sense to log OIDs of the temporary
> tables. There
> can be many of them, and in most cases the connection/session is
> managed
> by the application, so the only thing you can do is kill the
> connection.
>
>
> Yeah, agreed. I think it makes sense to show the *number* of temp
> tables. That's also a predictable amount of information -- logging
> all temp tables may as you say lead to an insane amount of data.
>
> PFA a patch that does this. I've also added some docs for it.
>
> And I also noticed pg_verify_checksums wasn't installed, so fixed
> that too.
>
>
> PFA a rebase on top of the just committed verify-checksums patch.
>
to the launcher. I haven't managed to do any crash testing today, but
code-wise it seems sane.
It however still fails to initialize the attempts field after allocating
the db entry in BuildDatabaseList, so if you try running with
-DRANDOMIZE_ALLOCATED_MEMORY it'll get initialized to values like this:
WARNING: attempts = -1684366952
WARNING: attempts = 1010514489
WARNING: attempts = -1145390664
WARNING: attempts = 1162101570
I guess those are not the droids we're looking for?
Likewise, I don't see where ChecksumHelperShmemStruct->abort gets
initialized. I think it only ever gets set in launcher_exit(), but that
does not seem sufficient. I suspect it's the reason for this behavior:
test=# select pg_enable_data_checksums(10, 10);
ERROR: database "template0" does not allow connections
HINT: Allow connections using ALTER DATABASE and try again.
test=# alter database template0 allow_connections = true;
ALTER DATABASE
test=# select pg_enable_data_checksums(10, 10);
ERROR: could not start checksumhelper: already running
test=# select pg_disable_data_checksums();
pg_disable_data_checksums
---------------------------
(1 row)
test=# select pg_enable_data_checksums(10, 10);
ERROR: could not start checksumhelper: has been cancelled
At which point the only thing you can do is restarting the cluster,
which seems somewhat unnecessary. But perhaps it's intentional?
Attached is a diff with a couple of minor comment tweaks, and correct
initialization of the attempts field.
Attachment
On 4/5/18 11:07 AM, Magnus Hagander wrote: > > > On Wed, Apr 4, 2018 at 12:11 AM, Tomas Vondra > <tomas.vondra@2ndquadrant.com <mailto:tomas.vondra@2ndquadrant.com>> wrote: > > ... > > It however still fails to initialize the attempts field after allocating > the db entry in BuildDatabaseList, so if you try running with > -DRANDOMIZE_ALLOCATED_MEMORY it'll get initialized to values like this: > > WARNING: attempts = -1684366952 > WARNING: attempts = 1010514489 > WARNING: attempts = -1145390664 > WARNING: attempts = 1162101570 > > I guess those are not the droids we're looking for? > > > When looking at that and after a quick discussion, we just decided it's > better to completely remove the retry logic. As you mentioned in some > earlier mail, we had all this logic to retry databases (unlikely) but > not relations (likely). Attached patch simplifies it to only detect the > "database was dropped" case (which is fine), and consider every other > kind of failure a permanent one and just not turn on checksums in those > cases. > OK, works for me. > > > Likewise, I don't see where ChecksumHelperShmemStruct->abort gets > initialized. I think it only ever gets set in launcher_exit(), but that > does not seem sufficient. I suspect it's the reason for this behavior: > > > It's supposed to get initialized in ChecksumHelperShmemInit() -- fixed. > (The whole memset-to-zero) > OK, seems fine now. > > test=# select pg_enable_data_checksums(10, 10); > ERROR: database "template0" does not allow connections > HINT: Allow connections using ALTER DATABASE and try again. > test=# alter database template0 allow_connections = true; > ALTER DATABASE > test=# select pg_enable_data_checksums(10, 10); > ERROR: could not start checksumhelper: already running > test=# select pg_disable_data_checksums(); > pg_disable_data_checksums > --------------------------- > > (1 row) > > test=# select pg_enable_data_checksums(10, 10); > ERROR: could not start checksumhelper: has been cancelled > > > Turns out that wasn't the problem. The problem was that we *set* it > before erroring out with the "does not allow connections", but never > cleared it. In that case, it would be listed as launcher-is-running even > though the launcher was never started. > > Basically the check for datallowconn was put in the wrong place. That > check should go away completely once we merge (because we should also > merge the part that allows us to bypass it), but for now I have moved > the check to the correct place. > Ah, OK. I was just guessing. > > > Attached is a diff with a couple of minor comment tweaks, and correct > initialization of the attempts field. > > > Thanks. This is included in the attached update, along with the above > fixes and some other small touches from Daniel. > This patch version seems fine to me. I'm inclined to mark it RFC. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> 5 апр. 2018 г., в 14:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): > > This patch version seems fine to me. I'm inclined to mark it RFC. +1 The patch works fine for me. I've tried different combinations of backend cancelation and the only suspicious thing I foundis that you can start multiple workers by cancelling launcher and not cancelling worker. Is it problematic behavior?If we run pg_enable_data_checksums() it checks for existing launcher for a reason, m.b. it should check for workertoo? I think it worth to capitalize WAL in "re-write the page to wal". Best regards, Andrey Borodin.
> 5 апр. 2018 г., в 14:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
>
> This patch version seems fine to me. I'm inclined to mark it RFC.
+1
The patch works fine for me. I've tried different combinations of backend cancelation and the only suspicious thing I found is that you can start multiple workers by cancelling launcher and not cancelling worker. Is it problematic behavior? If we run pg_enable_data_checksums() it checks for existing launcher for a reason, m.b. it should check for worker too?
I think it worth to capitalize WAL in "re-write the page to wal".
> 5 апр. 2018 г., в 19:58, Magnus Hagander <magnus@hagander.net> написал(а): > > > > On Thu, Apr 5, 2018 at 4:55 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote: > > > > 5 апр. 2018 г., в 14:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а): > > > > This patch version seems fine to me. I'm inclined to mark it RFC. > +1 > The patch works fine for me. I've tried different combinations of backend cancelation and the only suspicious thing I foundis that you can start multiple workers by cancelling launcher and not cancelling worker. Is it problematic behavior?If we run pg_enable_data_checksums() it checks for existing launcher for a reason, m.b. it should check for workertoo? > > I don't think it's a problem in itself -- it will cause pointless work, but not actually cause any poroblems I think (whereasduplicate launchers could cause interesting things to happen). > > How did you actually cancel the launcher to end up in this situation? select pg_enable_data_checksums(10000,1); select pg_sleep(0.1); select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ; select pg_enable_data_checksums(10000,1); select pg_sleep(0.1); select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ; select pg_enable_data_checksums(10000,1); select pg_sleep(0.1); select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ; select pid,backend_type from pg_stat_activity where backend_type ~'checks'; pid | backend_type -------+----------------------- 98587 | checksumhelper worker 98589 | checksumhelper worker 98591 | checksumhelper worker (3 rows) There is a way to shoot yourself in a leg then by calling pg_disable_data_checksums(), but this is extremely stupid for auser. Best regards, Andrey Borodin.
> 5 апр. 2018 г., в 19:58, Magnus Hagander <magnus@hagander.net> написал(а):
>
>
>
> On Thu, Apr 5, 2018 at 4:55 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>
> > 5 апр. 2018 г., в 14:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
> >
> > This patch version seems fine to me. I'm inclined to mark it RFC.
> +1
> The patch works fine for me. I've tried different combinations of backend cancelation and the only suspicious thing I found is that you can start multiple workers by cancelling launcher and not cancelling worker. Is it problematic behavior? If we run pg_enable_data_checksums() it checks for existing launcher for a reason, m.b. it should check for worker too?
>
> I don't think it's a problem in itself -- it will cause pointless work, but not actually cause any poroblems I think (whereas duplicate launchers could cause interesting things to happen).
>
> How did you actually cancel the launcher to end up in this situation?
select pg_enable_data_checksums(10000,1);
select pg_sleep(0.1);
select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ;
select pg_enable_data_checksums(10000,1);
select pg_sleep(0.1);
select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ;
select pg_enable_data_checksums(10000,1);
select pg_sleep(0.1);
select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ;
select pid,backend_type from pg_stat_activity where backend_type ~'checks';
pid | backend_type
-------+-----------------------
98587 | checksumhelper worker
98589 | checksumhelper worker
98591 | checksumhelper worker
(3 rows)
There is a way to shoot yourself in a leg then by calling pg_disable_data_checksums(), but this is extremely stupid for a user
Attachment
On Thu, Apr 5, 2018 at 5:08 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
> 5 апр. 2018 г., в 19:58, Magnus Hagander <magnus@hagander.net> написал(а):
>
>
>
> On Thu, Apr 5, 2018 at 4:55 PM, Andrey Borodin <x4mmm@yandex-team.ru> wrote:
>
>
> > 5 апр. 2018 г., в 14:33, Tomas Vondra <tomas.vondra@2ndquadrant.com> написал(а):
> >
> > This patch version seems fine to me. I'm inclined to mark it RFC.
> +1
> The patch works fine for me. I've tried different combinations of backend cancelation and the only suspicious thing I found is that you can start multiple workers by cancelling launcher and not cancelling worker. Is it problematic behavior? If we run pg_enable_data_checksums() it checks for existing launcher for a reason, m.b. it should check for worker too?
>
> I don't think it's a problem in itself -- it will cause pointless work, but not actually cause any poroblems I think (whereas duplicate launchers could cause interesting things to happen).
>
> How did you actually cancel the launcher to end up in this situation?
select pg_enable_data_checksums(10000,1);
select pg_sleep(0.1);
select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ;
select pg_enable_data_checksums(10000,1);
select pg_sleep(0.1);
select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ;
select pg_enable_data_checksums(10000,1);
select pg_sleep(0.1);
select pg_cancel_backend(pid),backend_type from pg_stat_activity where backend_type ~ 'checksumhelper launcher' ;
select pid,backend_type from pg_stat_activity where backend_type ~'checks';
pid | backend_type
-------+-----------------------
98587 | checksumhelper worker
98589 | checksumhelper worker
98591 | checksumhelper worker
(3 rows)
There is a way to shoot yourself in a leg then by calling pg_disable_data_checksums(), but this is extremely stupid for a userAh, didn't consider query cancel. I'm not sure how much we should actually care about it, but it's easy enough to trap that signal and just do a clean shutdown on it, so I've done that.PFA a patch that does that, and also rebased over the datallowconn patch just landed (which also removes some docs).
On 2018-04-05 22:06:36 +0200, Magnus Hagander wrote: > I have now pushed this latest version with some minor text adjustments and > a catversion bump. > > Thanks for all the reviews! I want to be on the record that I think merging a nontrival feature that got submitted 2018-02-21, just before the start of the last last CF, is an abuse of process, and not cool. We've other people working hard to follow the process, and circumventing it like this just signals to people trying to follow the rules that they're fools. Merging ~2kloc patches like that is going to cause pain. And even if not, it causes procedual damage. - Andres
On 2018-04-05 13:12:08 -0700, Andres Freund wrote: > On 2018-04-05 22:06:36 +0200, Magnus Hagander wrote: > > I have now pushed this latest version with some minor text adjustments and > > a catversion bump. > > > > Thanks for all the reviews! > > I want to be on the record that I think merging a nontrival feature that > got submitted 2018-02-21, just before the start of the last last CF, is > an abuse of process, and not cool. We've other people working hard to > follow the process, and circumventing it like this just signals to > people trying to follow the rules that they're fools. > > Merging ~2kloc patches like that is going to cause pain. And even if > not, it causes procedual damage. And even worse, without even announcing an intent to commit and giving people a chance to object. Greetings, Andres Freund
On 2018-04-05 13:12:08 -0700, Andres Freund wrote:
> On 2018-04-05 22:06:36 +0200, Magnus Hagander wrote:
> > I have now pushed this latest version with some minor text adjustments and
> > a catversion bump.
> >
> > Thanks for all the reviews!
>
> I want to be on the record that I think merging a nontrival feature that
> got submitted 2018-02-21, just before the start of the last last CF, is
> an abuse of process, and not cool. We've other people working hard to
> follow the process, and circumventing it like this just signals to
> people trying to follow the rules that they're fools.
>
> Merging ~2kloc patches like that is going to cause pain. And even if
> not, it causes procedual damage.
And even worse, without even announcing an intent to commit and giving
people a chance to object.
At least this patch was posted on the lists before commit, unlike many others from many different people. And AFAIK there has never been such a rule.
On April 5, 2018 1:20:52 PM PDT, Magnus Hagander <magnus@hagander.net> wrote: >On Thu, Apr 5, 2018 at 10:14 PM, Andres Freund <andres@anarazel.de> >wrote: > >And even worse, without even announcing an intent to commit and giving >> people a chance to object. >> > >At least this patch was posted on the lists before commit, unlike many >others from many different people. And AFAIK there has never been such >a >rule. The more debatable a decision is, the more important it IMO becomes to give people a chance to object. Don't think thereneeds to be a hard rule to always announce an intent to commit. Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On 04/05/2018 01:12 PM, Andres Freund wrote: > On 2018-04-05 22:06:36 +0200, Magnus Hagander wrote: >> I have now pushed this latest version with some minor text adjustments and >> a catversion bump. >> >> Thanks for all the reviews! > I want to be on the record that I think merging a nontrival feature that > got submitted 2018-02-21, just before the start of the last last CF, is > an abuse of process, and not cool. We've other people working hard to > follow the process, and circumventing it like this just signals to > people trying to follow the rules that they're fools. > > Merging ~2kloc patches like that is going to cause pain. And even if > not, it causes procedual damage. Perhaps I am missing something but there has been a lot of public discussion on this feature for the last 7 weeks of which you barely participated. I certainly understand wanting some notice before commit but there has been lots of discussion, multiple people publicly commenting on the patch and Magnus has been very receptive to all feedback (that I have seen). Perhaps we are being a sensitive because of another patch that is actually ramrodding the process and we need to take a step back? Thanks, JD > > - Andres > -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On 2018-04-05 13:51:41 -0700, Joshua D. Drake wrote: > On 04/05/2018 01:12 PM, Andres Freund wrote: > > I want to be on the record that I think merging a nontrival feature that > > got submitted 2018-02-21, just before the start of the last last CF, is > > an abuse of process, and not cool. We've other people working hard to > > follow the process, and circumventing it like this just signals to > > people trying to follow the rules that they're fools. > > > > Merging ~2kloc patches like that is going to cause pain. And even if > > not, it causes procedual damage. > > Perhaps I am missing something but there has been a lot of public discussion > on this feature for the last 7 weeks of which you barely participated. I've commented weeks ago about my doubts, and Robert concurred: http://archives.postgresql.org/message-id/CA%2BTgmoZPRfMqZoK_Fbo_tD9OH9PdPFcPBsi-sdGZ6Jg8OMM2PA%40mail.gmail.com > I certainly understand wanting some notice before commit but there has > been lots of discussion, multiple people publicly commenting on the > patch and Magnus has been very receptive to all feedback (that I have > seen). It's perfectly reasonable to continue review / improvement cycles of a patch, even if it's not going to get in the current release. What does that have to do with what I am concerned about? > Perhaps we are being a sensitive because of another patch that is > actually ramrodding the process and we need to take a step back? No. See link above. Please don't use "we" in this childishness implying fashion. - Andres
On 04/05/2018 02:01 PM, Andres Freund wrote: > > No. See link above. > > Please don't use "we" in this childishness implying fashion. The term "we" was used on purpose because I too was annoyed and I was trying to be objective, non-combative and productive. JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On Thu, Apr 5, 2018 at 1:27 PM, Andres Freund <andres@anarazel.de> wrote: >>At least this patch was posted on the lists before commit, unlike many >>others from many different people. And AFAIK there has never been such >>a >>rule. The rules cannot possibly anticipate every situation or subtlety. The letter of the law is a slightly distinct thing to its spirit. > The more debatable a decision is, the more important it IMO becomes to give people a chance to object. Don't think thereneeds to be a hard rule to always announce an intent to commit. +1 Andres' remarks need to be seen in the context of the past couple of weeks, and in the context of this being a relatively high risk patch that was submitted quite late in the cycle. -- Peter Geoghegan
On Thu, Apr 5, 2018 at 1:27 PM, Andres Freund <andres@anarazel.de> wrote:
>>At least this patch was posted on the lists before commit, unlike many
>>others from many different people. And AFAIK there has never been such
>>a
>>rule.
The rules cannot possibly anticipate every situation or subtlety. The
letter of the law is a slightly distinct thing to its spirit.
> The more debatable a decision is, the more important it IMO becomes to give people a chance to object. Don't think there needs to be a hard rule to always announce an intent to commit.
+1
Andres' remarks need to be seen in the context of the past couple of
weeks, and in the context of this being a relatively high risk patch
that was submitted quite late in the cycle.
On Thu, Apr 5, 2018 at 1:51 PM, Joshua D. Drake <jd@commandprompt.com> wrote: > Perhaps I am missing something but there has been a lot of public discussion > on this feature for the last 7 weeks of which you barely participated. I > certainly understand wanting some notice before commit but there has been > lots of discussion, multiple people publicly commenting on the patch and > Magnus has been very receptive to all feedback (that I have seen). Perhaps > we are being a sensitive because of another patch that is actually > ramrodding the process and we need to take a step back? I wish it was just one patch. I can think of several. -- Peter Geoghegan
Hi, On 2018-04-05 22:06:36 +0200, Magnus Hagander wrote: > I have now pushed this latest version with some minor text adjustments and > a catversion bump. Is there any sort of locking that guarantees that worker processes see an up2date value of DataChecksumsNeedWrite()/ControlFile->data_checksum_version? Afaict there's not. So you can afaict end up with checksums being computed by the worker, but concurrent writes missing them. The window is going to be at most one missed checksum per process (as the unlocking of the page is a barrier) and is probably not easy to hit, but that's dangerous enough. Just plonking a barrier into DataChecksumsNeedWrite() etc is a possibility, but it's also not free... Greetings, Andres Freund
Hi,
On 2018-04-05 22:06:36 +0200, Magnus Hagander wrote:
> I have now pushed this latest version with some minor text adjustments and
> a catversion bump.
Is there any sort of locking that guarantees that worker processes see
an up2date value of
DataChecksumsNeedWrite()/ControlFile->data_checksum_ version? Afaict
there's not. So you can afaict end up with checksums being computed by
the worker, but concurrent writes missing them. The window is going to
be at most one missed checksum per process (as the unlocking of the page
is a barrier) and is probably not easy to hit, but that's dangerous
enough.
Hi, On 2018-04-05 23:32:19 +0200, Magnus Hagander wrote: > On Thu, Apr 5, 2018 at 11:23 PM, Andres Freund <andres@anarazel.de> wrote: > > Is there any sort of locking that guarantees that worker processes see > > an up2date value of > > DataChecksumsNeedWrite()/ControlFile->data_checksum_version? Afaict > > there's not. So you can afaict end up with checksums being computed by > > the worker, but concurrent writes missing them. The window is going to > > be at most one missed checksum per process (as the unlocking of the page > > is a barrier) and is probably not easy to hit, but that's dangerous > > enough. > > > > So just to be clear of the case you're worried about. It's basically: > Session #1 - sets checksums to inprogress > Session #1 - starts dynamic background worker ("launcher") > Launcher reads and enumerates pg_database > Launcher starts worker in first database > Worker processes first block of data in database > And at this point, Session #2 has still not seen the "checksums inprogress" > flag and continues to write without checksums? Yes. I think there are some variations of that, but yes, that's pretty much it. > That seems like quite a long time to me -- is that really a problem? We don't generally build locking models that are only correct based on likelihood. Especially not without a lengthy comment explaining that analysis. > I'm guessing you're seeing a shorter path between the two that I can't > see right now (I'll blame the late evning...)? I don't think it matters terribly much how long that path is. Greetings, Andres Freund
Magnus Hagander <magnus@hagander.net> writes: > I have now pushed this latest version with some minor text adjustments and > a catversion bump. crake is not happy --- it's failing cross-version upgrade tests because: Performing Consistency Checks ----------------------------- Checking cluster versions ok old cluster uses data checksums but the new one does not Failure, exiting This seems to indicate that you broke pg_upgrade's detection of checksumming status, or that this patch changed the default checksum state (which it surely isn't described as doing). regards, tom lane
Magnus Hagander <magnus@hagander.net> writes:
> I have now pushed this latest version with some minor text adjustments and
> a catversion bump.
crake is not happy --- it's failing cross-version upgrade tests because:
Performing Consistency Checks
-----------------------------
Checking cluster versions ok
old cluster uses data checksums but the new one does not
Failure, exiting
This seems to indicate that you broke pg_upgrade's detection of
checksumming status, or that this patch changed the default
checksum state (which it surely isn't described as doing).
> On 05 Apr 2018, at 23:13, Magnus Hagander <magnus@hagander.net> wrote: > (And yes, we've noticed it's failing in isolationtester on a number of boxes -- Daniel is currently investigating) Looking into the isolationtester failure on piculet, which builds using --disable-atomics, and locust which doesn’t have atomics, the code for pg_atomic_test_set_flag seems a bit odd. TAS() is defined to return zero if successful, and pg_atomic_test_set_flag() defined to return True if it could set. When running without atomics, don’t we need to do something like the below diff to make these APIs match? : --- a/src/backend/port/atomics.c +++ b/src/backend/port/atomics.c @@ -73,7 +73,7 @@ pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr) bool pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) { - return TAS((slock_t *) &ptr->sema); + return TAS((slock_t *) &ptr->sema) == 0; } Applying this makes the _cancel test pass, moving the failure instead to the following _enable test (which matches what coypu and mylodon are seeing). AFAICT there are no other callers of this than the online checksum code, and this is not executed by the tests when running without atomics, which could explain why nothing else is broken. Before continuing the debugging, does this theory hold any water? This isn’t code I’m deeply familiar with so would appreciate any pointers. cheers ./daniel
Hi,
On 2018-04-05 23:32:19 +0200, Magnus Hagander wrote:
> On Thu, Apr 5, 2018 at 11:23 PM, Andres Freund <andres@anarazel.de> wrote:
> > Is there any sort of locking that guarantees that worker processes see
> > an up2date value of
> > DataChecksumsNeedWrite()/ControlFile->data_checksum_ version? Afaict
> > there's not. So you can afaict end up with checksums being computed by
> > the worker, but concurrent writes missing them. The window is going to
> > be at most one missed checksum per process (as the unlocking of the page
> > is a barrier) and is probably not easy to hit, but that's dangerous
> > enough.
> >
>
> So just to be clear of the case you're worried about. It's basically:
> Session #1 - sets checksums to inprogress
> Session #1 - starts dynamic background worker ("launcher")
> Launcher reads and enumerates pg_database
> Launcher starts worker in first database
> Worker processes first block of data in database
> And at this point, Session #2 has still not seen the "checksums inprogress"
> flag and continues to write without checksums?
Yes. I think there are some variations of that, but yes, that's pretty
much it.
> That seems like quite a long time to me -- is that really a problem?
We don't generally build locking models that are only correct based on
likelihood. Especially not without a lengthy comment explaining that
analysis.
Or can that somehow be cleanly solved using some of the new atomic operators? Or is that likely to cause the same kind of overhead as throwing a barrier in there?
On 04/06/2018 11:25 AM, Magnus Hagander wrote: > > > On Thu, Apr 5, 2018 at 11:41 PM, Andres Freund <andres@anarazel.de > <mailto:andres@anarazel.de>> wrote: > > Hi, > > On 2018-04-05 23:32:19 +0200, Magnus Hagander wrote: > > On Thu, Apr 5, 2018 at 11:23 PM, Andres Freund <andres@anarazel.de <mailto:andres@anarazel.de>> wrote: > > > Is there any sort of locking that guarantees that worker processes see > > > an up2date value of > > > DataChecksumsNeedWrite()/ControlFile->data_checksum_version? Afaict > > > there's not. So you can afaict end up with checksums being computed by > > > the worker, but concurrent writes missing them. The window is going to > > > be at most one missed checksum per process (as the unlocking of the page > > > is a barrier) and is probably not easy to hit, but that's dangerous > > > enough. > > > > > > > So just to be clear of the case you're worried about. It's basically: > > Session #1 - sets checksums to inprogress > > Session #1 - starts dynamic background worker ("launcher") > > Launcher reads and enumerates pg_database > > Launcher starts worker in first database > > Worker processes first block of data in database > > And at this point, Session #2 has still not seen the "checksums inprogress" > > flag and continues to write without checksums? > > Yes. I think there are some variations of that, but yes, that's pretty > much it. > > > > That seems like quite a long time to me -- is that really a problem? > > We don't generally build locking models that are only correct based on > likelihood. Especially not without a lengthy comment explaining that > analysis. > > > Oh, that's not my intention either -- I just wanted to make sure I > was thinking about the same issue you were. > I agree we shouldn't rely on chance here - if we might read a stale value, we need to fix that of course. I'm not quite sure I fully understand the issue, though. I assume both LockBufHdr and UnlockBufHdr are memory barriers, so for bad things to happen the process would need to be already past LockBufHdr when the checksum version is updated. In which case it can use a stale version when writing the buffer out. Correct? I wonder if that's actually a problem, considering the checksum worker will then overwrite all data with correct checksums anyway. So the other process would have to overwrite the buffer after checksum worker, at which point it'll have to go through LockBufHdr. So I'm not sure I see the problem here. But perhaps LockBufHdr is not a memory barrier? > Since you know a lot more about that type of interlocks than I do :) We > already wait for all running transactions to finish before we start > doing anything. Obviously transactions != buffer writes (and we have > things like the checkpointer/bgwriter to consider). Is there something > else that we could safely just *wait* for? I have no problem whatsoever > if this is a long wait (given the total time). I mean to the point of > "what if we just stick a sleep(10) in there" level waiting. > > Or can that somehow be cleanly solved using some of the new atomic > operators? Or is that likely to cause the same kind of overhead as > throwing a barrier in there? > Perhaps the easiest thing we could do is walk shared buffers and do LockBufHdr/UnlockBufHdr, which would guarantee no session is the process of writing out a buffer with possibly stale checksum flag. Of course, it's a bit brute-force-ish, but it's not that different from the waits for running transactions and temporary tables. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-06 11:25:59 +0200, Magnus Hagander wrote: > Since you know a lot more about that type of interlocks than I do :) We > already wait for all running transactions to finish before we start doing > anything. Obviously transactions != buffer writes (and we have things like > the checkpointer/bgwriter to consider). Is there something else that we > could safely just *wait* for? I have no problem whatsoever if this is a > long wait (given the total time). I mean to the point of "what if we just > stick a sleep(10) in there" level waiting. I don't think anything just related to "time" is resonable in any sort of way. On a overloaded system you can see long long stalls of processes that have done a lot of work. Locking protocols should be correct, and that's that. > Or can that somehow be cleanly solved using some of the new atomic > operators? Or is that likely to cause the same kind of overhead as throwing > a barrier in there? Worse. I wonder if we could introduce "MegaExpensiveRareMemoryBarrier()" that goes through pgproc and signals every process with a signal that requiers the other side to do an operation implying a memory barrier. That's actually not hard to do (e.g. every latch operation qualifies), the problem is that signal delivery isn't synchronous. So you need some acknowledgement protocol. I think you could introduce a procsignal message that does a memory barrier and then sets PGPROC->barrierGeneration to ProcArrayStruct->barrierGeneration. MegaExpensiveRareMemoryBarrier() increments ProcArrayStruct->barrierGeneration, signals everyone, and then waits till every PGPROC->barrierGeneration has surpassed ProcArrayStruct->barrierGeneration. Greetings, Andres Freund
Hi, On 2018-04-06 14:34:43 +0200, Tomas Vondra wrote: > > Oh, that's not my intention either -- I just wanted to make sure I > > was thinking about the same issue you were. > I agree we shouldn't rely on chance here - if we might read a stale > value, we need to fix that of course. It's perfectly possible that some side-conditions mitigate this. What concerns me that a) Nobody appears to have raised this issue beforehand, besides an unlocked read of a critical variable being a fairly obvious issue. This kind of thing needs to be carefully thought about. b) If there's some "side channel" interlock, it's not documented. I noticed the issue because of an IM question about the general feature, and I did a three minute skim and saw the read without a comment. > I'm not quite sure I fully understand the issue, though. I assume both > LockBufHdr and UnlockBufHdr are memory barriers, so for bad things to > happen the process would need to be already past LockBufHdr when the > checksum version is updated. In which case it can use a stale version > when writing the buffer out. Correct? Yes, they're are memory barriers. > I wonder if that's actually a problem, considering the checksum worker > will then overwrite all data with correct checksums anyway. So the other > process would have to overwrite the buffer after checksum worker, at > which point it'll have to go through LockBufHdr. Again, I'm not sure if there's some combination of issues that make this not a problem in practice. I just *asked* if there's something preventing this from being a problem. The really problematic case would be if it is possible for some process to wait long enough, without executing a barrier implying operation, that it'd try to write out a page that the checksum worker has already passed over. Greetings, Andres Freund
On 04/06/2018 07:22 PM, Andres Freund wrote: > Hi, > > On 2018-04-06 14:34:43 +0200, Tomas Vondra wrote: >>> Oh, that's not my intention either -- I just wanted to make sure I >>> was thinking about the same issue you were. > >> I agree we shouldn't rely on chance here - if we might read a stale >> value, we need to fix that of course. > > It's perfectly possible that some side-conditions mitigate this. What > concerns me that > a) Nobody appears to have raised this issue beforehand, besides an > unlocked read of a critical variable being a fairly obvious > issue. This kind of thing needs to be carefully thought about. > b) If there's some "side channel" interlock, it's not documented. > > I noticed the issue because of an IM question about the general feature, > and I did a three minute skim and saw the read without a comment. > All I can say is that I did consider this issue while reviewing the patch, and I've managed to convince myself it's not an issue (using the logic that I've just outlined here). Which is why I haven't raised it as an issue, because I don't think it is. You're right it might have been mentioned explicitly, of course. In any case, I wouldn't call LockBufHdr/UnlockBufHdr a "side channel" interlock. It's a pretty direct and intentional interlock, I think. > >> I'm not quite sure I fully understand the issue, though. I assume both >> LockBufHdr and UnlockBufHdr are memory barriers, so for bad things to >> happen the process would need to be already past LockBufHdr when the >> checksum version is updated. In which case it can use a stale version >> when writing the buffer out. Correct? > > Yes, they're are memory barriers. > Phew! ;-) > >> I wonder if that's actually a problem, considering the checksum worker >> will then overwrite all data with correct checksums anyway. So the other >> process would have to overwrite the buffer after checksum worker, at >> which point it'll have to go through LockBufHdr. > > Again, I'm not sure if there's some combination of issues that make this > not a problem in practice. I just *asked* if there's something > preventing this from being a problem. > > The really problematic case would be if it is possible for some process > to wait long enough, without executing a barrier implying operation, > that it'd try to write out a page that the checksum worker has already > passed over. > Sure. But what would that be? I can't think of anything. A process that modifies a buffer (or any other piece of shared state) without holding some sort of lock seems broken by default. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-06 19:40:59 +0200, Tomas Vondra wrote: > In any case, I wouldn't call LockBufHdr/UnlockBufHdr a "side channel" > interlock. It's a pretty direct and intentional interlock, I think. I mean it's a side-channel as far as DataChecksumsNeedWrite() is concerned. You're banking on all callers using a barrier implying operation around it. > Sure. But what would that be? I can't think of anything. A process that > modifies a buffer (or any other piece of shared state) without holding > some sort of lock seems broken by default. You can quite possibly already *hold* a lock if it's not an exclusive one. Greetings, Andres Freund
On 04/06/2018 07:46 PM, Andres Freund wrote: > On 2018-04-06 19:40:59 +0200, Tomas Vondra wrote: >> In any case, I wouldn't call LockBufHdr/UnlockBufHdr a "side channel" >> interlock. It's a pretty direct and intentional interlock, I think. > > I mean it's a side-channel as far as DataChecksumsNeedWrite() is > concerned. You're banking on all callers using a barrier implying > operation around it. Ah, OK. > >> Sure. But what would that be? I can't think of anything. A process that >> modifies a buffer (or any other piece of shared state) without holding >> some sort of lock seems broken by default. > > You can quite possibly already *hold* a lock if it's not an exclusive > one. > Sure, but if you're holding the buffer lock when the checksum version is changed, then the checksumhelper is obviously not running yet. In which case it will update the checksum on the buffer later. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-06 19:59:17 +0200, Tomas Vondra wrote: > On 04/06/2018 07:46 PM, Andres Freund wrote: > >> Sure. But what would that be? I can't think of anything. A process that > >> modifies a buffer (or any other piece of shared state) without holding > >> some sort of lock seems broken by default. > > > > You can quite possibly already *hold* a lock if it's not an exclusive > > one. > > > > Sure, but if you're holding the buffer lock when the checksum version is > changed, then the checksumhelper is obviously not running yet. In which > case it will update the checksum on the buffer later. The buffer content lock itself doesn't generally give any such guarantee afaict, as it's required that the content lock is held in shared mode during IO. ProcessSingleRelationFork() happens to use exclusive mode (which could and possibly should be optimized), so that's probably sufficient from that end though. I'm mainly disconcerted this isn't well discussed & documented. Greetings, Andres Freund
On Thu, Apr 5, 2018 at 5:01 PM, Andres Freund <andres@anarazel.de> wrote: > I've commented weeks ago about my doubts, and Robert concurred: > http://archives.postgresql.org/message-id/CA%2BTgmoZPRfMqZoK_Fbo_tD9OH9PdPFcPBsi-sdGZ6Jg8OMM2PA%40mail.gmail.com Yes, and I expressed some previous reservations as well. Granted, my reservations about this patch are less than for MERGE, and granted, too, what is Magnus supposed to do about a couple of committers expressing doubts about whether something really ought to be committed? Is that an absolute bar? It wasn't phrased as such, nor do we really have the authority. At the same time, those concerns didn't generate much discussion and, at least in my case, are not withdrawn merely because time has passed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2018-04-06 14:14:40 -0400, Robert Haas wrote: > and granted, too, what is Magnus supposed to do about a couple of > committers expressing doubts about whether something really ought to > be committed? Is that an absolute bar? It wasn't phrased as such, > nor do we really have the authority. At the same time, those concerns > didn't generate much discussion and, at least in my case, are not > withdrawn merely because time has passed. Yea, I don't think they're an absolute blocker. But imo more than sufficient reason to give the list a head up of a day or two that the patch is intended to be committed. I'd only pointed the message out because JD said something about me not having participated in the earlier discussion. Greetings, Andres Freund
On 04/06/2018 08:13 PM, Andres Freund wrote: > On 2018-04-06 19:59:17 +0200, Tomas Vondra wrote: >> On 04/06/2018 07:46 PM, Andres Freund wrote: >>>> Sure. But what would that be? I can't think of anything. A process that >>>> modifies a buffer (or any other piece of shared state) without holding >>>> some sort of lock seems broken by default. >>> >>> You can quite possibly already *hold* a lock if it's not an exclusive >>> one. >>> >> >> Sure, but if you're holding the buffer lock when the checksum version is >> changed, then the checksumhelper is obviously not running yet. In which >> case it will update the checksum on the buffer later. > > The buffer content lock itself doesn't generally give any such > guarantee afaict, as it's required that the content lock is held in > shared mode during IO. ProcessSingleRelationFork() happens to use > exclusive mode (which could and possibly should be optimized), so > that's probably sufficient from that end though. > Yes. > I'm mainly disconcerted this isn't well discussed & documented. > Agreed, no argument here. -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/06/2018 08:13 PM, Andres Freund wrote: > On 2018-04-06 19:59:17 +0200, Tomas Vondra wrote: >> On 04/06/2018 07:46 PM, Andres Freund wrote: >>>> Sure. But what would that be? I can't think of anything. A process that >>>> modifies a buffer (or any other piece of shared state) without holding >>>> some sort of lock seems broken by default. >>> >>> You can quite possibly already *hold* a lock if it's not an exclusive >>> one. >>> >> >> Sure, but if you're holding the buffer lock when the checksum version is >> changed, then the checksumhelper is obviously not running yet. In which >> case it will update the checksum on the buffer later. > > The buffer content lock itself doesn't generally give any such guarantee > afaict, as it's required that the content lock is held in shared mode > during IO. ProcessSingleRelationFork() happens to use exclusive mode > (which could and possibly should be optimized), so that's probably > sufficient from that end though. > Oh, I've just realized the phrasing of my previous message was rather confusing. What I meant to say is this: Sure, but the checksum version is changed before the checksumhelper launcher/worker is even started. So if you're holding the buffer lock at that time, then the buffer is essentially guaranteed to be updated by the worker later. Sorry if it seemed I'm suggesting the buffer lock itself guarantees something about the worker startup. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote: > Looking into the isolationtester failure on piculet, which builds using > --disable-atomics, and locust which doesn’t have atomics, the code for > pg_atomic_test_set_flag seems a bit odd. > > TAS() is defined to return zero if successful, and pg_atomic_test_set_flag() > defined to return True if it could set. When running without atomics, don’t we > need to do something like the below diff to make these APIs match? : > > --- a/src/backend/port/atomics.c > +++ b/src/backend/port/atomics.c > @@ -73,7 +73,7 @@ pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr) > bool > pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) > { > - return TAS((slock_t *) &ptr->sema); > + return TAS((slock_t *) &ptr->sema) == 0; > } Yes, this looks wrong. Greetings, Andres Freund
Hi, On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote: > Applying this makes the _cancel test pass, moving the failure instead to the > following _enable test (which matches what coypu and mylodon are seeing). FWIW, I'm somewhat annoyed that I'm now spending time debugging this to get the buildfarm green again. I'm fairly certain that the bug here is a simple race condition in the test (not the main code!): The flag informing whether the worker has started is cleared via an on_shmem_exit() hook: static void launcher_exit(int code, Datum arg) { ChecksumHelperShmem->abort = false; pg_atomic_clear_flag(&ChecksumHelperShmem->launcher_started); } but the the wait in the test is done via functions like: CREATE OR REPLACE FUNCTION test_checksums_on() RETURNS boolean AS $$ DECLARE enabled boolean; BEGIN LOOP SELECT setting = 'on' INTO enabled FROM pg_catalog.pg_settings WHERE name = 'data_checksums'; IF enabled THEN EXIT; END IF; PERFORM pg_sleep(1); END LOOP; RETURN enabled; END; $$ LANGUAGE plpgsql; INSERT INTO t1 (b, c) VALUES (generate_series(1,10000), 'starting values'); CREATE OR REPLACE FUNCTION test_checksums_off() RETURNS boolean AS $$ DECLARE enabled boolean; BEGIN PERFORM pg_sleep(1); SELECT setting = 'off' INTO enabled FROM pg_catalog.pg_settings WHERE name = 'data_checksums'; RETURN enabled; END; $$ LANGUAGE plpgsql; which just waits for setting checksums to have finished. It's exceedingly unsurprising that a 'pg_sleep(1)' is not a reliable way to make sure that a process has finished exiting. Then followup tests fail because the process is still running Also: CREATE OR REPLACE FUNCTION reader_loop() RETURNS boolean AS $$ DECLARE counter integer; BEGIN FOR counter IN 1..30 LOOP PERFORM count(a) FROM t1; PERFORM pg_sleep(0.2); END LOOP; RETURN True; END; $$ LANGUAGE plpgsql; } really? Let's just force the test take at least 6s purely from sleeping? Greetings, Andres Freund
On 2018-04-06 14:33:48 -0700, Andres Freund wrote: > On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote: > > Looking into the isolationtester failure on piculet, which builds using > > --disable-atomics, and locust which doesn’t have atomics, the code for > > pg_atomic_test_set_flag seems a bit odd. > > > > TAS() is defined to return zero if successful, and pg_atomic_test_set_flag() > > defined to return True if it could set. When running without atomics, don’t we > > need to do something like the below diff to make these APIs match? : > > > > --- a/src/backend/port/atomics.c > > +++ b/src/backend/port/atomics.c > > @@ -73,7 +73,7 @@ pg_atomic_init_flag_impl(volatile pg_atomic_flag *ptr) > > bool > > pg_atomic_test_set_flag_impl(volatile pg_atomic_flag *ptr) > > { > > - return TAS((slock_t *) &ptr->sema); > > + return TAS((slock_t *) &ptr->sema) == 0; > > } > > Yes, this looks wrong. And the reason the tests fail reliably after is because the locking model around ChecksumHelperShmem->launcher_started arguably is broken: /* If the launcher isn't started, there is nothing to shut down */ if (pg_atomic_unlocked_test_flag(&ChecksumHelperShmem->launcher_started)) return; This uses a non-concurrency safe primitive. Which then spuriously triggers: #define PG_HAVE_ATOMIC_UNLOCKED_TEST_FLAG static inline bool pg_atomic_unlocked_test_flag_impl(volatile pg_atomic_flag *ptr) { /* * Can't do this efficiently in the semaphore based implementation - we'd * have to try to acquire the semaphore - so always return true. That's * correct, because this is only an unlocked test anyway. Do this in the * header so compilers can optimize the test away. */ return true; } no one can entirely quibble with the rationale that this is ok (I'll post a patch cleaning up the atomics simulation of flags in a bit), but this is certainly not a correct locking strategy. Greetings, Andres Freund
On Fri, Apr 6, 2018 at 6:56 PM, Andres Freund <andres@anarazel.de> wrote: > no one can entirely quibble with the rationale that this is ok (I'll > post a patch cleaning up the atomics simulation of flags in a bit), but > this is certainly not a correct locking strategy. I think we have enough evidence at this point to conclude that this patch, along with MERGE, should be reverted. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
> On 07 Apr 2018, at 00:23, Andres Freund <andres@anarazel.de> wrote: > On 2018-04-06 02:28:17 +0200, Daniel Gustafsson wrote: >> Applying this makes the _cancel test pass, moving the failure instead to the >> following _enable test (which matches what coypu and mylodon are seeing). > > FWIW, I'm somewhat annoyed that I'm now spending time debugging this to > get the buildfarm green again. Sorry about that, I’m a bit slow due to various $family situations at the moment. > I'm fairly certain that the bug here is a simple race condition in the > test (not the main code!): I wonder if it may perhaps be a case of both? > It's > exceedingly unsurprising that a 'pg_sleep(1)' is not a reliable way to > make sure that a process has finished exiting. Then followup tests fail > because the process is still running I can reproduce the error when building with --disable-atomics, and it seems that all the failing members either do that, lack atomic.h, lack atomics or a combination. checksumhelper cancellation use pg_atomic_unlocked_test_flag() to test if it’s running when asked to abort, something which seems unsafe to do in semaphore simulation as it always returns true. If I for debugging synthesize a flag test with testset/clear, the tests pass green (with the upstream patch for pg_atomic_test_set_flag_impl() applied). Cancelling with semaphore sim is thus doomed to never work IIUC. Or it’s a red herring. As Magnus mentioned upstream, rewriting to not use an atomic flag is probably the best option, once the current failure is understood. > really? Let's just force the test take at least 6s purely from > sleeping? The test needs continuous reading in a session to try and trigger any bugs in read access on the cluster during checksumming, is there a good way to do that in the isolationtester? I have failed to find a good way to repeat a step like that, but I might be missing something. cheers ./daniel
On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote: > > I'm fairly certain that the bug here is a simple race condition in the > > test (not the main code!): > > I wonder if it may perhaps be a case of both? See my other message about the atomic fallback bit. > > It's > > exceedingly unsurprising that a 'pg_sleep(1)' is not a reliable way to > > make sure that a process has finished exiting. Then followup tests fail > > because the process is still running > > I can reproduce the error when building with --disable-atomics, and it seems > that all the failing members either do that, lack atomic.h, lack atomics or a > combination. atomics.h isn't important, it's just relevant for solaris (IIRC). Only one of the failing ones lacks atomics afaict. See On 2018-04-06 14:19:09 -0700, Andres Freund wrote: > Is that an explanation for > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=gharial&dt=2018-04-06%2019%3A18%3A11 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=lousyjack&dt=2018-04-06%2016%3A03%3A01 > https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=sungazer&dt=2018-04-06%2015%3A46%3A16 > ? Those all don't seem fall under that? Having proper atomics? So there it's the timing. Note that they didn't always fail either. > > really? Let's just force the test take at least 6s purely from > > sleeping? > > The test needs continuous reading in a session to try and trigger any bugs in > read access on the cluster during checksumming, is there a good way to do that > in the isolationtester? I have failed to find a good way to repeat a step like > that, but I might be missing something. IDK, I know this isn't right. Greetings, Andres Freund
> On 07 Apr 2018, at 01:13, Andres Freund <andres@anarazel.de> wrote: > > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote: >>> I'm fairly certain that the bug here is a simple race condition in the >>> test (not the main code!): >> >> I wonder if it may perhaps be a case of both? > > See my other message about the atomic fallback bit. Yep, my MUA pulled it down just as I had sent this. Thanks for confirming my suspicion. cheers ./daniel
Greetings, * Robert Haas (robertmhaas@gmail.com) wrote: > On Fri, Apr 6, 2018 at 6:56 PM, Andres Freund <andres@anarazel.de> wrote: > > no one can entirely quibble with the rationale that this is ok (I'll > > post a patch cleaning up the atomics simulation of flags in a bit), but > > this is certainly not a correct locking strategy. > > I think we have enough evidence at this point to conclude that this > patch, along with MERGE, should be reverted. I'm not sure that I see some issues around getting the locking correct when starting/stopping the process is really evidence of a major problem with the patch- yes, it obviously needs to be fixed and it would have been unfortuante if we hadn't caught it, but a good bit of effort appears to have been taken to ensure that exactly this is tested (which is in part why the buildfarm is failing) and this evidently found an existing bug, which is hardly this patch's fault. In short, I don't agree (yet..) that this needs reverting. I'm quite sure that bringing up MERGE in this thread and saying it needs to be reverted without even having the committer of that feature on the CC list isn't terribly useful and conflates two otherwise unrelated patches and efforts. Let's try to use the threads the way they're intended and keep our responses to each on their respective threads. Thanks! Stephen
Attachment
On 2018-04-06 19:31:56 -0400, Stephen Frost wrote: > Greetings, > > * Robert Haas (robertmhaas@gmail.com) wrote: > > On Fri, Apr 6, 2018 at 6:56 PM, Andres Freund <andres@anarazel.de> wrote: > > > no one can entirely quibble with the rationale that this is ok (I'll > > > post a patch cleaning up the atomics simulation of flags in a bit), but > > > this is certainly not a correct locking strategy. > > > > I think we have enough evidence at this point to conclude that this > > patch, along with MERGE, should be reverted. > > I'm not sure that I see some issues around getting the locking correct > when starting/stopping the process is really evidence of a major problem > with the patch- Note that there've been several other things mentioned in the thread. I'll add some more in a bit. > yes, it obviously needs to be fixed and it would have been unfortuante > if we hadn't caught it, but a good bit of effort appears to have been > taken to ensure that exactly this is tested (which is in part why the > buildfarm is failing) and this evidently found an existing bug, which > is hardly this patch's fault. THAT is the problem. It costs people that haven't been involved in the feature time. I've friggin started debugging this because nobody else could be bothered. Even though I'd planned to spend that time on other patches that have been submitted far ahead in time. > I'm quite sure that bringing up MERGE in this thread and saying it needs > to be reverted without even having the committer of that feature on the > CC list isn't terribly useful and conflates two otherwise unrelated > patches and efforts. Robert also mentioned it on the other thread, so... And no, they're not unrelated matters, in that it's pushing half baked stuff. Greetings, Andres Freund
On 2018-04-07 01:27:13 +0200, Daniel Gustafsson wrote: > > On 07 Apr 2018, at 01:13, Andres Freund <andres@anarazel.de> wrote: > > > > On 2018-04-07 01:04:50 +0200, Daniel Gustafsson wrote: > >>> I'm fairly certain that the bug here is a simple race condition in the > >>> test (not the main code!): > >> > >> I wonder if it may perhaps be a case of both? > > > > See my other message about the atomic fallback bit. > > Yep, my MUA pulled it down just as I had sent this. Thanks for confirming my > suspicion. But note it fails because the code using it is WRONG. There's a reason there's "unlocked" in the name. But even leaving that aside, it probably *still* be wrong if it were locked. It seems *extremely* dubious that we'll allow to re-enable the checksums while a worker is still doing stuff for the old cycle in the background. Consider what happens if the checksum helper is currently doing RequestCheckpoint() (something that can certainly take a *LONG*) while. Another process disables checksums. Pages get written out without checksums. Yet another process re-enables checksums. Helper process does SetDataChecksumsOn(). Which succeeds because if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION) { LWLockRelease(ControlFileLock); elog(ERROR, "Checksums not in inprogress mode"); } succeeds. Boom. Cluster with partially set checksums but marked as valid. Greetings, Andres Freund
Andres, * Andres Freund (andres@anarazel.de) wrote: > On 2018-04-06 19:31:56 -0400, Stephen Frost wrote: > > I'm quite sure that bringing up MERGE in this thread and saying it needs > > to be reverted without even having the committer of that feature on the > > CC list isn't terribly useful and conflates two otherwise unrelated > > patches and efforts. > > Robert also mentioned it on the other thread, so... And no, they're not > unrelated matters, in that it's pushing half baked stuff. Apparently I've missed where he specifically called for it to be reverted then, which is fine, and my apologies for missing it amongst the depth of that particular thread. I do think that specifically asking for it to be reverted is distinct from expressing concerns about it. Thanks! Stephen
Attachment
Here's a pass through the patch: @@ -1033,7 +1034,7 @@ XLogInsertRecord(XLogRecData *rdata, Assert(RedoRecPtr < Insert->RedoRecPtr); RedoRecPtr = Insert->RedoRecPtr; } - doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites); + doPageWrites = (Insert->fullPageWrites || Insert->forcePageWrites || DataChecksumsInProgress()); if (fpw_lsn != InvalidXLogRecPtr && fpw_lsn <= RedoRecPtr && doPageWrites) Why does this need an in-progress specific addition? Given that we unconditionally log FPWs for all pages, and #define XLogHintBitIsNeeded() (DataChecksumsNeedWrite() || wal_log_hints) it's not clear what this achieves? At the very least needs a comment. @@ -4748,12 +4745,90 @@ GetMockAuthenticationNonce(void) * Are checksums enabled for data pages? */ bool -DataChecksumsEnabled(void) +DataChecksumsNeedWrite(void) { Assert(ControlFile != NULL); return (ControlFile->data_checksum_version > 0); } +bool +DataChecksumsNeedVerify(void) +{ + Assert(ControlFile != NULL); + + /* + * Only verify checksums if they are fully enabled in the cluster. In + * inprogress state they are only updated, not verified. + */ + return (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_VERSION); +} + +bool +DataChecksumsInProgress(void) +{ + Assert(ControlFile != NULL); + return (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION); +} As previously mentioned, the locking model around this is unclear. It's probably fine due to to surrounding memory barriers, but that needs to be very very explicitly documented. +void +SetDataChecksumsOn(void) +{ + Assert(ControlFile != NULL); + + LWLockAcquire(ControlFileLock, LW_EXCLUSIVE); + + if (ControlFile->data_checksum_version != PG_DATA_CHECKSUM_INPROGRESS_VERSION) + { + LWLockRelease(ControlFileLock); + elog(ERROR, "Checksums not in inprogress mode"); + } + + ControlFile->data_checksum_version = PG_DATA_CHECKSUM_VERSION; + UpdateControlFile(); + LWLockRelease(ControlFileLock); + + XlogChecksums(PG_DATA_CHECKSUM_VERSION); As I've explained in https://www.postgresql.org/message-id/20180406235126.d4sg4dtgicdpucnj@alap3.anarazel.de this appears to be unsafe. There's no guarantee that the ControlFile->data_checksum_version hasn't intermittently set to 0. @@ -7788,6 +7863,16 @@ StartupXLOG(void) */ CompleteCommitTsInitialization(); + /* + * If we reach this point with checksums in inprogress state, we notify + * the user that they need to manually restart the process to enable + * checksums. + */ + if (ControlFile->data_checksum_version == PG_DATA_CHECKSUM_INPROGRESS_VERSION) + ereport(WARNING, + (errmsg("checksum state is \"inprogress\" with no worker"), + errhint("Either disable or enable checksums by calling the pg_disable_data_checksums() or pg_enable_data_checksums()functions."))); + Hm, so one manually has to take action here. Any reason we can't just re-start the worker? Also, this'll be issued on standbys etc too, that seems misleading? + +/* + * Disables checksums for the cluster, unless already disabled. + * + * Has immediate effect - the checksums are set to off right away. + */ +Datum +disable_data_checksums(PG_FUNCTION_ARGS) +{ + /* + * If we don't need to write new checksums, then clearly they are already + * disabled. + */ + if (!DataChecksumsNeedWrite()) + ereport(ERROR, + (errmsg("data checksums already disabled"))); + + ShutdownChecksumHelperIfRunning(); + + SetDataChecksumsOff(); + + PG_RETURN_VOID(); Unsafe, see SetDataChecksumsOn comment above. Shouldn't this be named with a pg_? We normally do that for SQL callable functions, no? See the preceding functions. This function is marked PROPARALLEL_SAFE. That can't be right? Also, shouldn't this refuse to work if called in recovery mode? Erroring out with ERROR: XX000: cannot make new WAL entries during recovery doesn't seem right. +/* + * Enables checksums for the cluster, unless already enabled. + * + * Supports vacuum-like cost-based throttling, to limit system load. + * Starts a background worker that updates checksums on existing data. + */ +Datum +enable_data_checksums(PG_FUNCTION_ARGS) This is PROPARALLEL_RESTRICTED. That doesn't strike me right, shouldn't they be PROPARALLEL_UNSAFE? It might be fine, but I'd not want to rely on it. +/* + * Main entry point for checksumhelper launcher process. + */ +bool +StartChecksumHelperLauncher(int cost_delay, int cost_limit) entry point sounds a bit like it's the bgw invoked routine... +/* + * ShutdownChecksumHelperIfRunning + * Request shutdown of the checksumhelper + * + * This does not turn off processing immediately, it signals the checksum + * process to end when done with the current block. + */ +void +ShutdownChecksumHelperIfRunning(void) +{ + /* If the launcher isn't started, there is nothing to shut down */ + if (pg_atomic_unlocked_test_flag(&ChecksumHelperShmem->launcher_started)) + return; Using an unlocked op here without docs seems wrong. See also mail referenced above. +static bool +ProcessSingleRelationFork(Relation reln, ForkNumber forkNum, BufferAccessStrategy strategy) ... + BlockNumber numblocks = RelationGetNumberOfBlocksInFork(reln, forkNum); Shouldn't this have a comment explaining that this is safe because all pages that concurrently get added to the end are going to be checksummed by the extendor? + /* Need to get an exclusive lock before we can flag as dirty */ + LockBuffer(buf, BUFFER_LOCK_EXCLUSIVE); Hm. So we'll need exclusive locks on all pages in the database. On busy inner tables that's going to be painful. It shouldn't be too hard to reduce this to share locks. + START_CRIT_SECTION(); + MarkBufferDirty(buf); + log_newpage_buffer(buf, false); Hm. So we always log buffers as non-standard ones? That's going to cause quite the increase in FPW space. + elog(DEBUG2, "Checksumhelper done with relation %d: %s", + relationId, (aborted ? "aborted" : "finished")); Hm. Wouldn't it be advisable to include actual relation names? It's pretty annoying to keep track this way. +void +ChecksumHelperLauncherMain(Datum arg) .... + /* + * Create a database list. We don't need to concern ourselves with + * rebuilding this list during runtime since any database created after + * this process started will be running with checksums turned on from the + * start. + */ Why is this true? What if somebody runs CREATE DATABASE while the launcher / worker are processing a different database? It'll copy the template database on the filesystem level, and it very well might not yet have checksums set? Afaict the second time we go through this list that's not cought. + if (processing == SUCCESSFUL) + { + pfree(db->dbname); + pfree(db); ... Why bother with that and other deallocations here and in other places? Launcher's going to quit anyway... + else if (processing == FAILED) + { + /* + * Put failed databases on the remaining list. + */ + remaining = lappend(remaining, db); Uh. So we continue checksumming the entire cluster if one database failed? Given there's no restart capability at all, that seems hard to defend? + CurrentDatabases = BuildDatabaseList(); + + foreach(lc, remaining) + { + ChecksumHelperDatabase *db = (ChecksumHelperDatabase *) lfirst(lc); + bool found = false; + + foreach(lc2, CurrentDatabases) + { + ChecksumHelperDatabase *db2 = (ChecksumHelperDatabase *) lfirst(lc2); + + if (db->dboid == db2->dboid) + { + found = true; This is an O(N^2) comparison logic? There's clusters with tens of thousands of databases... The comparison costs are low, but still. + /* + * Foreign tables have by definition no local storage that can be + * checksummed, so skip. + */ + if (pgc->relkind == RELKIND_FOREIGN_TABLE) + continue; + That strikes me as a dangerous form of test. Shouldn't we instead check whether a relfilenode exists? I'll note that this test currently includes plain views. It's just the smgrexist() test that makes it work. - Andres
On 2018-04-06 17:59:28 -0700, Andres Freund wrote: > + /* > + * Create a database list. We don't need to concern ourselves with > + * rebuilding this list during runtime since any database created after > + * this process started will be running with checksums turned on from the > + * start. > + */ > > Why is this true? What if somebody runs CREATE DATABASE while the > launcher / worker are processing a different database? It'll copy the > template database on the filesystem level, and it very well might not > yet have checksums set? Afaict the second time we go through this list > that's not cought. *caught It's indeed trivial to reproduce this, just slowing down a checksum run and copying the database yields: ./pg_verify_checksums -D /srv/dev/pgdev-dev pg_verify_checksums: checksum verification failed in file "/srv/dev/pgdev-dev/base/16385/2703", block 0: calculated checksum45A7 but expected 0 pg_verify_checksums: checksum verification failed in file "/srv/dev/pgdev-dev/base/16385/2703", block 1: calculated checksum8C7D but expected 0 further complaints: The new isolation test cannot be re-run on an existing cluster. That's because the first test expects isolationtests to be disabled. As even remarked upon: # The checksum_enable suite will enable checksums for the cluster so should # not run before anything expecting the cluster to have checksums turned off How's that ok? You can leave database wide objects around, but the cluster-wide stuff needs to be cleaned up. The tests don't actually make sure that no checksum launcher / apply is running anymore. They just assume that it's gone once the GUC shows checksums have been set. If you wanted to make the tests stable, you'd need to wait for that to show true *and* then check that no workers are around anymore. If it's not obvious: This isn't ready, should be reverted, cleaned up, and re-submitted for v12. Greetings, Andres Freund
On 2018-04-06 17:59:28 -0700, Andres Freund wrote:
> + /*
> + * Create a database list. We don't need to concern ourselves with
> + * rebuilding this list during runtime since any database created after
> + * this process started will be running with checksums turned on from the
> + * start.
> + */
>
> Why is this true? What if somebody runs CREATE DATABASE while the
> launcher / worker are processing a different database? It'll copy the
> template database on the filesystem level, and it very well might not
> yet have checksums set? Afaict the second time we go through this list
> that's not cought.
*caught
It's indeed trivial to reproduce this, just slowing down a checksum run
and copying the database yields:
./pg_verify_checksums -D /srv/dev/pgdev-dev
pg_verify_checksums: checksum verification failed in file "/srv/dev/pgdev-dev/base/16385/2703", block 0: calculated checksum 45A7 but expected 0
pg_verify_checksums: checksum verification failed in file "/srv/dev/pgdev-dev/base/16385/2703", block 1: calculated checksum 8C7D but expected 0
further complaints:
The new isolation test cannot be re-run on an existing cluster. That's
because the first test expects isolationtests to be disabled. As even
remarked upon:
# The checksum_enable suite will enable checksums for the cluster so should
# not run before anything expecting the cluster to have checksums turned off
How's that ok? You can leave database wide objects around, but the
cluster-wide stuff needs to be cleaned up.
The tests don't actually make sure that no checksum launcher / apply is
running anymore. They just assume that it's gone once the GUC shows
checksums have been set. If you wanted to make the tests stable, you'd
need to wait for that to show true *and* then check that no workers are
around anymore.
If it's not obvious: This isn't ready, should be reverted, cleaned up,
and re-submitted for v12.
Hi, On Sat, Apr 07, 2018 at 08:57:03AM +0200, Magnus Hagander wrote: > On Sat, Apr 7, 2018 at 6:26 AM, Andres Freund <andres@anarazel.de> wrote: > > If it's not obvious: This isn't ready, should be reverted, cleaned up, > > and re-submitted for v12. > > While I do think that it's still definitely fixable in time for 11, I won't > argue for it.Will revert. :( Can the pg_verify_checksums command be kept at least, please? AFAICT this one is not contentious, the code is isolated, it's really useful, orthogonal to online checksum activation and argueably could've been committed as a separate patch anyway. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer
Hi, On 2018-04-07 10:14:49 +0200, Michael Banck wrote: > Can the pg_verify_checksums command be kept at least, please? > > AFAICT this one is not contentious, the code is isolated, it's really > useful, orthogonal to online checksum activation and argueably could've > been committed as a separate patch anyway. I've not looked at it in any meaningful amount of detail, but it does seem a lot lower risk from here. Greetings, Andres Freund
Hi, On 2018-04-07 08:57:03 +0200, Magnus Hagander wrote: > Note however that I'm sans-laptop until Sunday, so I will revert it then or > possibly Monday. I'll deactive the isolationtester tests until then. They've been intermittently broken for days now and prevent other tests from being exercised. Greetings, Andres Freund
Hi,
On 2018-04-07 08:57:03 +0200, Magnus Hagander wrote:
> Note however that I'm sans-laptop until Sunday, so I will revert it then or
> possibly Monday.
I'll deactive the isolationtester tests until then. They've been
intermittently broken for days now and prevent other tests from being
exercised.
On Fri, Apr 6, 2018 at 8:59 PM, Andres Freund <andres@anarazel.de> wrote: > This is PROPARALLEL_RESTRICTED. That doesn't strike me right, shouldn't > they be PROPARALLEL_UNSAFE? It might be fine, but I'd not want to rely > on it. Just a fine-grained note on this particular point: It's totally fine for parallel-restricted operations to write WAL, write to the filesystem, or launch nukes at ${ENEMY_NATION}. Well, I mean, the last one might be a bad idea for geopolitical reasons, but it's not a problem for parallel query. It is a problem to insert or update heap tuples because it might extend the relation; mutual exclusion doesn't work properly there yet (there was a patch to fix that, but you had some concerns and it didn't go in). It is a problem to update or delete heap tuples which might create new combo CIDs; not all workers will have the same view (there's no patch for this yet AFAIK, but the fix probably doesn't look that different from cc5f81366c36b3dd8f02bd9be1cf75b2cc8482bd and could probably use most of the same infrastructure). TL;DR: Writing pages (e.g. to set a checksum) doesn't make something non-parallel-safe. Writing heap tuples makes it parallel-unsafe. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Apr 6, 2018 at 8:59 PM, Andres Freund <andres@anarazel.de> wrote:
> This is PROPARALLEL_RESTRICTED. That doesn't strike me right, shouldn't
> they be PROPARALLEL_UNSAFE? It might be fine, but I'd not want to rely
> on it.
Just a fine-grained note on this particular point:
It's totally fine for parallel-restricted operations to write WAL,
write to the filesystem, or launch nukes at ${ENEMY_NATION}. Well, I
mean, the last one might be a bad idea for geopolitical reasons, but
it's not a problem for parallel query. It is a problem to insert or
update heap tuples because it might extend the relation; mutual
exclusion doesn't work properly there yet (there was a patch to fix
that, but you had some concerns and it didn't go in). It is a problem
to update or delete heap tuples which might create new combo CIDs; not
all workers will have the same view (there's no patch for this yet
AFAIK, but the fix probably doesn't look that different from
cc5f81366c36b3dd8f02bd9be1cf75b2cc8482bd and could probably use most
of the same infrastructure).
TL;DR: Writing pages (e.g. to set a checksum) doesn't make something
non-parallel-safe. Writing heap tuples makes it parallel-unsafe.
On Sat, Apr 7, 2018 at 6:22 PM, Andres Freund <andres@anarazel.de> wrote:Hi,
On 2018-04-07 08:57:03 +0200, Magnus Hagander wrote:
> Note however that I'm sans-laptop until Sunday, so I will revert it then or
> possibly Monday.
I'll deactive the isolationtester tests until then. They've been
intermittently broken for days now and prevent other tests from being
exercised.Thanks.I've pushed the revert now, and left the pg_verify_checksums in place for the time being.
Attachment
Hello I tried build this patch and got error during make docs > postgres.sgml:19626: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-limit" > postgres.sgml:19625: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-delay" Both new GUC checksumhelper_cost_delay and checksumhelper_cost_limit mentioned in postgresql.conf with special value -1 (-1to use vacuum_cost_limit), but this value was not mentioned in docs. I noticed that the code and documentation describedifferent defaults. Also i found one "<literal>in progress</literal>" in pg_enable_data_checksums() description. In other places status is called"inprogress" (without space). > VacuumPageHit = 0; > VacuumPageMiss = 0; > VacuumPageDirty = 0; Hm, why these settings are set to 0 in checksumhelper process? > /* > * Force a checkpoint to get everything out to disk. XXX: this should > * probably not be an IMMEDIATE checkpoint, but leave it there for now for > * testing > */ > RequestCheckpoint(CHECKPOINT_FORCE | CHECKPOINT_WAIT | CHECKPOINT_IMMEDIATE); We need not forget that. regards, Sergei
The following review has been posted through the commitfest application: make installcheck-world: tested, failed Implements feature: not tested Spec compliant: not tested Documentation: tested, failed Hello As i wrote few weeks ago i can not build documentation due errors: > postgres.sgml:19625: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-delay" > postgres.sgml:19626: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-limit" After remove such xref for test purposes patch pass check-world. regards, Sergei The new status of this patch is: Waiting on Author
> On 24 Jul 2018, at 11:05, Sergei Kornilov <sk@zsrv.org> wrote: > > The following review has been posted through the commitfest application: > make installcheck-world: tested, failed > Implements feature: not tested > Spec compliant: not tested > Documentation: tested, failed > > Hello > As i wrote few weeks ago i can not build documentation due errors: >> postgres.sgml:19625: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-delay" >> postgres.sgml:19626: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-limit" > > After remove such xref for test purposes patch pass check-world. Hi!, Thanks for reviewing, I’ve updated the patch with the above mentioned incorrect linkends as well as fixed the comments you made in a previous review. The CF-builder-bot is red, but it’s because it’s trying to apply the already committed patch which is in the attached datallowconn thread. cheers ./daniel
Attachment
Hello Thank you for update! I did only quick test now: patch applied and build clean. But i have reproducible error during check-world: > t/001_standby_checksum.pl .. 6/10 > # Failed test 'ensure checksums are enabled on standby' > # at t/001_standby_checksum.pl line 84. > # got: 'inprogress' > # expected: 'on' In stanby log i found error: > 2018-07-25 13:13:05.463 MSK [16544] FATAL: could not receive data from WAL stream: ERROR: requested WAL segment 000000010000000000000003has already been removed Checksumhelper obviously writes lot of wal. Test pass if i change restart order to slave first: > $node_standby_1->restart(); > $node_master->restart(); Or we need replication slot setup. Also we have log record after start: > data checksums in pending state, starting background worker to enable even in recovery state with actual start background worker only at recovery end (or promote). I think better using DEBUGereport in postmaster and LOG in checksumhelper. regards, Sergei 25.07.2018, 12:35, "Daniel Gustafsson" <daniel@yesql.se>: >> On 24 Jul 2018, at 11:05, Sergei Kornilov <sk@zsrv.org> wrote: >> >> The following review has been posted through the commitfest application: >> make installcheck-world: tested, failed >> Implements feature: not tested >> Spec compliant: not tested >> Documentation: tested, failed >> >> Hello >> As i wrote few weeks ago i can not build documentation due errors: >>> postgres.sgml:19625: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-delay" >>> postgres.sgml:19626: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-limit" >> >> After remove such xref for test purposes patch pass check-world. > > Hi!, > > Thanks for reviewing, I’ve updated the patch with the above mentioned incorrect > linkends as well as fixed the comments you made in a previous review. > > The CF-builder-bot is red, but it’s because it’s trying to apply the already > committed patch which is in the attached datallowconn thread. > > cheers ./daniel
On Tue, Jun 26, 2018 at 7:45 AM, Magnus Hagander <magnus@hagander.net> wrote: > PFA an updated version of the patch for the next CF. We believe this one > takes care of all the things pointed out so far. > > For this version, we "implemented" the MegaExpensiveRareMemoryBarrier() by > simply requiring a restart of PostgreSQL to initiate the conversion > background. That is definitely going to guarantee a memory barrier. It's > certainly not ideal, but restarting the cluster is still a *lot* better than > having to do the entire conversion offline. This can of course be improved > upon in the future, but for now we stuck to the safe way. Honestly, I feel like the bar for this feature ought to be higher than that. (I half-expect a vigorous discussion of whether I have set the bar for the features I've developed in the right place or not, but I think that's not really a fair response. If somebody thinks some feature I implemented should've been more baked, they might be right, but that's not what this thread is about. I'm giving you MY opinion about THIS patch, nothing more or less.) Why can't we do better? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On July 26, 2018 10:03:39 AM PDT, Robert Haas <robertmhaas@gmail.com> wrote: >On Tue, Jun 26, 2018 at 7:45 AM, Magnus Hagander <magnus@hagander.net> >wrote: >> PFA an updated version of the patch for the next CF. We believe this >one >> takes care of all the things pointed out so far. >> >> For this version, we "implemented" the >MegaExpensiveRareMemoryBarrier() by >> simply requiring a restart of PostgreSQL to initiate the conversion >> background. That is definitely going to guarantee a memory barrier. >It's >> certainly not ideal, but restarting the cluster is still a *lot* >better than >> having to do the entire conversion offline. This can of course be >improved >> upon in the future, but for now we stuck to the safe way. > >Honestly, I feel like the bar for this feature ought to be higher than >that. > >(I half-expect a vigorous discussion of whether I have set the bar for >the features I've developed in the right place or not, but I think >that's not really a fair response. If somebody thinks some feature I >implemented should've been more baked, they might be right, but that's >not what this thread is about. I'm giving you MY opinion about THIS >patch, nothing more or less.) +1 >Why can't we do better? I don't think it's that hard to do better. IIRC I even outlined something before the freeze. If not, o certainly can (sketch:use procsignal based acknowledgment protocol, using a 64 bit integer. Useful for plenty other things). Andres -- Sent from my Android device with K-9 Mail. Please excuse my brevity.
On Wed, Jul 25, 2018 at 11:35:31AM +0200, Daniel Gustafsson wrote: > > On 24 Jul 2018, at 11:05, Sergei Kornilov <sk@zsrv.org> wrote: > > > > The following review has been posted through the commitfest application: > > make installcheck-world: tested, failed > > Implements feature: not tested > > Spec compliant: not tested > > Documentation: tested, failed > > > > Hello > > As i wrote few weeks ago i can not build documentation due errors: > >> postgres.sgml:19625: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-delay" > >> postgres.sgml:19626: element xref: validity error : IDREF attribute linkend references an unknown ID "runtime-checksumhelper-cost-limit" > > > > After remove such xref for test purposes patch pass check-world. > > Hi!, > > Thanks for reviewing, I’ve updated the patch with the above mentioned incorrect > linkends as well as fixed the comments you made in a previous review. > > The CF-builder-bot is red, but it’s because it’s trying to apply the already > committed patch which is in the attached datallowconn thread. I think checksumhelper_cost_delay should be checksum_helper_cost_delay. ^ Is "helper" the right word? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 07/31/2018 12:45 PM, Bruce Momjian wrote: > >> Hi!, >> >> Thanks for reviewing, I’ve updated the patch with the above mentioned incorrect >> linkends as well as fixed the comments you made in a previous review. >> >> The CF-builder-bot is red, but it’s because it’s trying to apply the already >> committed patch which is in the attached datallowconn thread. > I think checksumhelper_cost_delay should be checksum_helper_cost_delay. > ^ > > Is "helper" the right word? Based on other terminology within the postgresql.conf should it be "checksum_worker_cost_delay"? JD > -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
On Tue, Jul 31, 2018 at 12:52:40PM -0700, Joshua Drake wrote: > On 07/31/2018 12:45 PM, Bruce Momjian wrote: > > > >>Hi!, > >> > >>Thanks for reviewing, I’ve updated the patch with the above mentioned incorrect > >>linkends as well as fixed the comments you made in a previous review. > >> > >>The CF-builder-bot is red, but it’s because it’s trying to apply the already > >>committed patch which is in the attached datallowconn thread. > >I think checksumhelper_cost_delay should be checksum_helper_cost_delay. > > ^ > > > >Is "helper" the right word? > > Based on other terminology within the postgresql.conf should it be > "checksum_worker_cost_delay"? +1> -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
> On 31 Jul 2018, at 21:52, Joshua D. Drake <jd@commandprompt.com> wrote: > > On 07/31/2018 12:45 PM, Bruce Momjian wrote: >>> Thanks for reviewing, I’ve updated the patch with the above mentioned incorrect >>> linkends as well as fixed the comments you made in a previous review. >>> >>> The CF-builder-bot is red, but it’s because it’s trying to apply the already >>> committed patch which is in the attached datallowconn thread. >> I think checksumhelper_cost_delay should be checksum_helper_cost_delay. >> ^ >> Is "helper" the right word? IIRC, “helper” was chosen to signal that it’s a single process where “worker” may be thought of as a process of which there can be many. > Based on other terminology within the postgresql.conf should it be "checksum_worker_cost_delay”? Yes, I think it makes sense to rename it “worker” to align better with the postgres nomenclature. Will fix. cheers ./daniel
> On 26 Jul 2018, at 19:35, Andres Freund <andres@anarazel.de> wrote: > On July 26, 2018 10:03:39 AM PDT, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>> wrote: >> Why can't we do better? > > I don't think it's that hard to do better. IIRC I even outlined something before the freeze. If not, o certainly can (sketch:use procsignal based acknowledgment protocol, using a 64 bit integer. Useful for plenty other things). Not really arguing for or against, but just to understand the reasoning before starting hacking. Why do we feel that a restart (intended for safety here) in this case is a burden on a use-once process? Is it from a usability or technical point of view? Just want to make sure we are on the same page before digging in to not hack on this patch in a direction which isn’t what is requested. cheers ./daniel
Hi, On 2018-07-31 23:20:27 +0200, Daniel Gustafsson wrote: > > On 26 Jul 2018, at 19:35, Andres Freund <andres@anarazel.de> wrote: > > On July 26, 2018 10:03:39 AM PDT, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>> wrote: > > >> Why can't we do better? > > > > I don't think it's that hard to do better. IIRC I even outlined something before the freeze. If not, o certainly can(sketch: use procsignal based acknowledgment protocol, using a 64 bit integer. Useful for plenty other things). > > Not really arguing for or against, but just to understand the reasoning before > starting hacking. Why do we feel that a restart (intended for safety here) in > this case is a burden on a use-once process? Is it from a usability or > technical point of view? Just want to make sure we are on the same page before > digging in to not hack on this patch in a direction which isn’t what is > requested. Having, at some arbitrary seeming point in the middle of enabling checksums to restart the server makes it harder to use and to schedule. The restart is only needed to fix a relatively small issue, and doesn't save that much code. Greetings, Andres Freund
Andres Freund <andres@anarazel.de> writes: > On 2018-07-31 23:20:27 +0200, Daniel Gustafsson wrote: >> Not really arguing for or against, but just to understand the reasoning before >> starting hacking. Why do we feel that a restart (intended for safety here) in >> this case is a burden on a use-once process? Is it from a usability or >> technical point of view? Just want to make sure we are on the same page before >> digging in to not hack on this patch in a direction which isn’t what is >> requested. > Having, at some arbitrary seeming point in the middle of enabling > checksums to restart the server makes it harder to use and to schedule. > The restart is only needed to fix a relatively small issue, and doesn't > save that much code. Without taking a position on the merits ... I don't see how you can claim "it doesn't save that much code" when we don't have a patch to compare to that doesn't require the restart. Maybe it will turn out not to be much code, but we don't know that now. regards, tom lane
On 2018-07-31 17:28:41 -0400, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-07-31 23:20:27 +0200, Daniel Gustafsson wrote: > >> Not really arguing for or against, but just to understand the reasoning before > >> starting hacking. Why do we feel that a restart (intended for safety here) in > >> this case is a burden on a use-once process? Is it from a usability or > >> technical point of view? Just want to make sure we are on the same page before > >> digging in to not hack on this patch in a direction which isn’t what is > >> requested. > > > Having, at some arbitrary seeming point in the middle of enabling > > checksums to restart the server makes it harder to use and to schedule. > > The restart is only needed to fix a relatively small issue, and doesn't > > save that much code. > > Without taking a position on the merits ... I don't see how you can > claim "it doesn't save that much code" when we don't have a patch to > compare to that doesn't require the restart. Maybe it will turn out > not to be much code, but we don't know that now. IIRC I outlined a solution around the feature freeze, and I've since offered to go into further depth if needed. And I'd pointed out the issue at hand. So while I'd obviously not want to predict a specific linecount, I'm fairly sure I have a reasonable guesstimate about the complexity. - Andres
On 2018-Jul-31, Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On 2018-07-31 23:20:27 +0200, Daniel Gustafsson wrote: > >> Not really arguing for or against, but just to understand the reasoning before > >> starting hacking. Why do we feel that a restart (intended for safety here) in > >> this case is a burden on a use-once process? Is it from a usability or > >> technical point of view? Just want to make sure we are on the same page before > >> digging in to not hack on this patch in a direction which isn’t what is > >> requested. > > > Having, at some arbitrary seeming point in the middle of enabling > > checksums to restart the server makes it harder to use and to schedule. > > The restart is only needed to fix a relatively small issue, and doesn't > > save that much code. > > Without taking a position on the merits ... I don't see how you can > claim "it doesn't save that much code" when we don't have a patch to > compare to that doesn't require the restart. Maybe it will turn out > not to be much code, but we don't know that now. The ability to get checksums enabled is a killer feature; the ability to do it with no restart ... okay, it's better than requiring a restart, but it's not *that* big a deal. In the spirit of supporting incremental development, I think it's quite sensible to get the current thing done, then see what it takes to get the next thing done. Each is an improvement on its own merits. And it doesn't have to be made by the same people. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, On 2018-07-31 18:56:29 -0400, Alvaro Herrera wrote: > In the spirit of supporting incremental development, I think it's quite > sensible to get the current thing done, then see what it takes to get > the next thing done. Each is an improvement on its own merits. And it > doesn't have to be made by the same people. I just don't buy this. An earlier version of this feature was committed to v11 without the restart, over objections. There's now extra state in the control file to support the restart based system, there's extra tests, extra docs. And it'd not be much code to just make it work without the restart. The process around this patchset is just plain weird. Greetings, Andres Freund
Hi, Am Dienstag, den 31.07.2018, 18:56 -0400 schrieb Alvaro Herrera: > The ability to get checksums enabled is a killer feature; the ability to > do it with no restart ... okay, it's better than requiring a restart, > but it's not *that* big a deal. Well, it's a downtime and service interruption from the client's POV, and arguably we should remove the 'online' from the patch subject then. You can activate checksums on an offline instance already via the pg_checksums extensions to pg_verify_checksums that Michael Paquier and I wrote independently; of course, that downtime will be linerarily longer the more data you have. If this was one week before feature freeze, I would agree with you that it makes sense to ship it with the restart requirement rather than not shipping it at all. But we're several commitfests away from v12, so making an effort to having this work without a downtime looks like a reasonable requirement to me. Michael -- Michael Banck Projektleiter / Senior Berater Tel.: +49 2166 9901-171 Fax: +49 2166 9901-100 Email: michael.banck@credativ.de credativ GmbH, HRB Mönchengladbach 12080 USt-ID-Nummer: DE204566209 Trompeterallee 108, 41189 Mönchengladbach Geschäftsführung: Dr. Michael Meskes, Jörg Folz, Sascha Heuer Unser Umgang mit personenbezogenen Daten unterliegt folgenden Bestimmungen: https://www.credativ.de/datenschutz
On 08/01/2018 10:40 AM, Michael Banck wrote: > Hi, > > Am Dienstag, den 31.07.2018, 18:56 -0400 schrieb Alvaro Herrera: >> The ability to get checksums enabled is a killer feature; the >> ability to do it with no restart ... okay, it's better than >> requiring a restart, but it's not *that* big a deal. > > Well, it's a downtime and service interruption from the client's POV, > and arguably we should remove the 'online' from the patch subject > then. You can activate checksums on an offline instance already via > the pg_checksums extensions to pg_verify_checksums that Michael > Paquier and I wrote independently; of course, that downtime will be > linerarily longer the more data you have. > IMHO the main work still happens while the instance is running, so I don't see why the restart would make it "not online". But keeping or not keeping "online" is not the main dilemma faced by this patch, I think. That is, if we drop "online" from the name I doubt it'll make it any more acceptable for those objecting to having to restart the cluster. > If this was one week before feature freeze, I would agree with you > that it makes sense to ship it with the restart requirement rather > than not shipping it at all. But we're several commitfests away from > v12, so making an effort to having this work without a downtime > looks like a reasonable requirement to me. > Why would all those pieces had to be committed at once? Why not to commit what we have now (with the restart) and then remove the restriction in a later commit? I understand the desire to be able to enable checksums without a restart, but kinda agree with Alvaro regarding incremental development. In a way, the question is how far can we reasonably push the patch author(s) to implement stuff we consider desirable, but he/she/they decided it's not worth the time investment at this point. To me, it seems like an immensely useful feature even with the restart, and I don't think the restart is a major burden for most systems (it can be, if your system has no maintenance windows, or course). regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hello I think one restart is acceptable for such feature. I doubt user want often disable-enable checksums. In many cases checksumswill be enabled one time during all cluster life. We need more downtimes for minor updates (4/year) and changesin config PGC_POSTMASTER (max_connections or autovacuum workers, for example). regards, Sergei
Hi, On 2018-08-01 10:40:24 +0200, Michael Banck wrote: > If this was one week before feature freeze, I would agree with you that > it makes sense to ship it with the restart requirement rather than not > shipping it at all. But we're several commitfests away from v12, so > making an effort to having this work without a downtime looks like a > reasonable requirement to me. My problem isn't just that I shouldn't think this should be committed without at least a firm committement to do better, my problem is that I think the "restart" approach is just using the entirely wrong hammer to solve the problem at hand. At the very least it's very problematic in respect to replicas, which need to know about the setting too, and can have similar problems the restart on the primary is supposed to prevent. Greetings, Andres Freund
Hi, On 2018-08-01 11:15:38 +0200, Tomas Vondra wrote: > On 08/01/2018 10:40 AM, Michael Banck wrote: > > If this was one week before feature freeze, I would agree with you > > that it makes sense to ship it with the restart requirement rather > > than not shipping it at all. But we're several commitfests away from > > v12, so making an effort to having this work without a downtime > > looks like a reasonable requirement to me. > > > > Why would all those pieces had to be committed at once? Why not to > commit what we have now (with the restart) and then remove the > restriction in a later commit? Sure, if all the pieces existed in various degrees of solidness (with the earlier pieces committable, but later ones needing work), I'd feel *much* less concerned about it. > In a way, the question is how far can we reasonably push the patch > author(s) to implement stuff we consider desirable, but he/she/they > decided it's not worth the time investment at this point. We push people to only implement something really consistent all the time. > To me, it seems like an immensely useful feature even with the restart, > and I don't think the restart is a major burden for most systems (it can > be, if your system has no maintenance windows, or course). I think it a problem, my problem is more that I don't think it's really a solution for the problem. Greetings, Andres Freund
Hello On 2018-Aug-01, Andres Freund wrote: > My problem isn't just that I shouldn't think this should be committed > without at least a firm committement to do better, I take "I think this shouldn't be committed" is what you meant. I'm not sure I agree with this line of argument. The reality is that real life or diverging priorities preclude people from working on $stuff. This is a useful feature-1 we have here, and if we stall it until we have feature-2, we may not get either until a year later. That's not a great outcome. We didn't wait for partitioning, parallel query, DDL progress reporting, logical replication, JIT, wait events (to name but a few) to solve world's hunger in order to start getting committed. We move forward step by step, and that's a good thing. Firm commitments are great things to have, and if the firmness leads to feature-2 being part of the same release, great, but if it's not firm enough, we can have feature-2 the next release (or whenever). Even if there's no such commitment, feature-1 is useful on its own. > my problem is that I think the "restart" approach is just using the > entirely wrong hammer to solve the problem at hand. At the very least > it's very problematic in respect to replicas, which need to know about > the setting too, and can have similar problems the restart on the > primary is supposed to prevent. If we define "restart" to mean taking all the servers down simultaneously, that can be planned. For users that cannot do that, that's too bad, they'll have to wait to the next release in order to enable checksums (assuming they fund the necessary development). But there are many systems where it *is* possible to take everything down for five seconds, then back up. They can definitely take advantage of checksummed data. Currently, the only way to enable checksums is to initdb and create a new copy of the data from a logical backup, which could take hours or even days if data is large, or use logical replication. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 08/01/2018 05:58 PM, Andres Freund wrote: > Hi, > > On 2018-08-01 11:15:38 +0200, Tomas Vondra wrote: >> On 08/01/2018 10:40 AM, Michael Banck wrote: >>> If this was one week before feature freeze, I would agree with you >>> that it makes sense to ship it with the restart requirement rather >>> than not shipping it at all. But we're several commitfests away from >>> v12, so making an effort to having this work without a downtime >>> looks like a reasonable requirement to me. >>> >> >> Why would all those pieces had to be committed at once? Why not to >> commit what we have now (with the restart) and then remove the >> restriction in a later commit? > > Sure, if all the pieces existed in various degrees of solidness (with > the earlier pieces committable, but later ones needing work), I'd feel > *much* less concerned about it. > That's not what I meant, sorry for not being clearer. My point was that I see the "without restart" as desirable but optional, and would not mind treating it as a future improvement. > >> In a way, the question is how far can we reasonably push the patch >> author(s) to implement stuff we consider desirable, but he/she/they >> decided it's not worth the time investment at this point. > > We push people to only implement something really consistent all the > time. > Sure, but it's somewhat subjective matter - to me this limitation does not make this particular patch inconsistent. If we can remove it, great. If not, it's still immensely useful improvement. > >> To me, it seems like an immensely useful feature even with the restart, >> and I don't think the restart is a major burden for most systems (it can >> be, if your system has no maintenance windows, or course). > > I think it a problem, my problem is more that I don't think it's really > a solution for the problem. > Sure, if there are issues with this approach, that would make it unacceptable. I'm not sure why would it be an issue for replicas (which is what you mention elsewhere), considering those don't write data and so can't fail to update a checksum? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-08-01 12:20:12 -0400, Alvaro Herrera wrote: > Hello > > On 2018-Aug-01, Andres Freund wrote: > > > My problem isn't just that I shouldn't think this should be committed > > without at least a firm committement to do better, > > I take "I think this shouldn't be committed" is what you meant. Yep. > I'm not sure I agree with this line of argument. The reality is that > real life or diverging priorities preclude people from working on > $stuff. Right. > This is a useful feature-1 we have here, and if we stall it > until we have feature-2, we may not get either until a year later. > That's not a great outcome. We didn't wait for partitioning, parallel > query, DDL progress reporting, logical replication, JIT, wait events (to > name but a few) to solve world's hunger in order to start getting > committed. We move forward step by step, and that's a good thing. But we asked that they implement something consistent, and rejected many that were not deemed to be that. > > my problem is that I think the "restart" approach is just using the > > entirely wrong hammer to solve the problem at hand. At the very least > > it's very problematic in respect to replicas, which need to know about > > the setting too, and can have similar problems the restart on the > > primary is supposed to prevent. > > If we define "restart" to mean taking all the servers down > simultaneously, that can be planned. It really can't realistically. That'd essentially mean breaking PITR. You'd have to schedule the restart of any replicas to happen after a specific record. And what if there's replicas that are on a delay? What if there's data centers that are currently offline? And again, this isn't hard to do properly. I don't get why we're talking about an at least operationally complex workaround when the proper solution isn't hard. Greetings, Andres Freund
Hi, On 2018-08-01 18:25:48 +0200, Tomas Vondra wrote: > Sure, if there are issues with this approach, that would make it > unacceptable. I'm not sure why would it be an issue for replicas (which is > what you mention elsewhere), considering those don't write data and so can't > fail to update a checksum? Standbys compute checksums on writeout as well, no? We compute checksums not at buffer modification, but at writeout time. And replay just marks buffers dirty, it doesn't directly write to disk. Architecturally there'd also be hint bits as a source, but I think we probably neutered them enough for that not to be a problem during replay. And then there's also promotions. Greetings, Andres Freund
On Tue, Jul 31, 2018 at 04:05:23PM -0700, Andres Freund wrote: > Hi, > > On 2018-07-31 18:56:29 -0400, Alvaro Herrera wrote: > > In the spirit of supporting incremental development, I think it's quite > > sensible to get the current thing done, then see what it takes to get > > the next thing done. Each is an improvement on its own merits. And it > > doesn't have to be made by the same people. > > I just don't buy this. An earlier version of this feature was committed > to v11 without the restart, over objections. There's now extra state in > the control file to support the restart based system, there's extra > tests, extra docs. And it'd not be much code to just make it work > without the restart. The process around this patchset is just plain > weird. ---------------------------------------------- ----- This patchset is weird because it is perhaps our first case of trying to change the state of the server while it is running. We just don't have an established protocol for how to orchestrate that, so we are limping along toward a solution. Forcing a restart is probably part of that primitive orchestration. We will probably have similar challenges if we ever allowed Postgres to change its data format on the fly. These challenges are one reason pg_upgrade only modifies the new cluster, never the old one. -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
On 2018-08-01 12:36:13 -0400, Bruce Momjian wrote: > This patchset is weird because it is perhaps our first case of trying to > change the state of the server while it is running. We just don't have > an established protocol for how to orchestrate that, so we are limping > along toward a solution. There's a number of GUCs that do this? Even in related areas, cf. full_page_writes. Greetings, Andres Freund
On Wed, Aug 1, 2018 at 09:39:43AM -0700, Andres Freund wrote: > On 2018-08-01 12:36:13 -0400, Bruce Momjian wrote: > > This patchset is weird because it is perhaps our first case of trying to > > change the state of the server while it is running. We just don't have > > an established protocol for how to orchestrate that, so we are limping > > along toward a solution. > > There's a number of GUCs that do this? Even in related areas, > cf. full_page_writes. How is that taking the server from one state to the next in a non-instantaneous way? -- Bruce Momjian <bruce@momjian.us> http://momjian.us EnterpriseDB http://enterprisedb.com + As you are, so once was I. As I am, so you will be. + + Ancient Roman grave inscription +
Hi, On 2018-08-01 12:45:27 -0400, Bruce Momjian wrote: > On Wed, Aug 1, 2018 at 09:39:43AM -0700, Andres Freund wrote: > > On 2018-08-01 12:36:13 -0400, Bruce Momjian wrote: > > > This patchset is weird because it is perhaps our first case of trying to > > > change the state of the server while it is running. We just don't have > > > an established protocol for how to orchestrate that, so we are limping > > > along toward a solution. > > > > There's a number of GUCs that do this? Even in related areas, > > cf. full_page_writes. > > How is that taking the server from one state to the next in a > non-instantaneous way? Because it requires coordination around checkpoints, it can only really fully take effect after the start of a checkpoint. And it even both has primary and replica differences... Greetings, Andres Freund
On 08/01/2018 09:20 AM, Alvaro Herrera wrote: > >> my problem is that I think the "restart" approach is just using the >> entirely wrong hammer to solve the problem at hand. At the very least >> it's very problematic in respect to replicas, which need to know about >> the setting too, and can have similar problems the restart on the >> primary is supposed to prevent. > If we define "restart" to mean taking all the servers down > simultaneously, that can be planned. People in mission critical environments do not "restart all servers". They fail over to a secondary to do maintenance on a primary. When you have a system where you literally lose thousands of dollars every minute the database is down you can't do what you are proposing. When you have a system that if the database is down for longer than X minutes, you actually lose a whole day because all of the fabricators have to revalidate before they begin work, you can't do that. Granted that is not the majority (which you mention) but let's not forget them. The one place where a restart does happen and will continue to happen for around 5 (3 if you incorporate pg_logical and 9.6) more years is upgrades. Although we have logical replication for upgrades now, we are 5 years away from the majority of users being on a version of PostgreSQL that supports logical replication for upgrades. So, I can see an argument for an incremental approach because people could enable checksums as part of their upgrade restart. > For users that cannot do that, > that's too bad, they'll have to wait to the next release in order to > enable checksums (assuming they fund the necessary development). But I have to say, as a proponent of funded development for longer than most I like to see this refreshing take on the fact that this all does take money. > there are many systems where it *is* possible to take everything down > for five seconds, then back up. They can definitely take advantage of > checksummed data. This is a good point. > Currently, the only way to enable checksums is to initdb and create a > new copy of the data from a logical backup, which could take hours or > even days if data is large, or use logical replication. Originally, I was going to -1 how this is being implemented. I too wish we had the "ALTER DATABASE ENABLE CHECKSUM" or equivalent without a restart. However, being able to just restart is a huge step forward from what we have now. Lastly, I think Alvaro has a point with the incremental development and I also think some others on this thread need to, "show me the patch" instead of being armchair directors of development. JD -- Command Prompt, Inc. || http://the.postgres.company/ || @cmdpromptinc *** A fault and talent of mine is to tell it exactly how it is. *** PostgreSQL centered full stack support, consulting and development. Advocate: @amplifypostgres || Learn: https://postgresconf.org ***** Unless otherwise stated, opinions are my own. *****
Hi, On 2018-08-01 10:34:55 -0700, Joshua D. Drake wrote: > Lastly, I think Alvaro has a point with the incremental development and I > also think some others on this thread need to, "show me the patch" instead > of being armchair directors of development. Oh, FFS. I pointed out the issue that led to the restart being introduced (reminder, in the committed but then reverted version that didn't exist). I explained how the much less intrusive version would work. I think it's absurd to describe that as "armchair directors of development". Greetings, Andres Freund
Hi > They fail over to a secondary to do maintenance on a primary. But this is not problem even in current patch state. We can restart replica before failover and it works. I tested this behaviorduring my review. We can: - call pg_enable_data_checksums() on master - wait change data_checksums to inprogress on replica - restart replica - we can restart replica before promote, right? - promote this replica - checksum helper is launched now and working on this promoted cluster regards, Sergei
Hi, On 2018-08-01 21:03:22 +0300, Sergei Kornilov wrote: > > They fail over to a secondary to do maintenance on a primary. > But this is not problem even in current patch state. We can restart replica before failover and it works. I tested thisbehavior during my review. > We can: > - call pg_enable_data_checksums() on master > - wait change data_checksums to inprogress on replica That's *precisely* the problem. What if your replicas are delayed (e.g. recovery_min_apply_delay)? How would you schedule that restart properly? What if you later need to do PITR? > - restart replica - we can restart replica before promote, right? > - promote this replica > - checksum helper is launched now and working on this promoted cluster This doesn't test the consequences of the restart being skipped, nor does it review on a code level the correctness. Greetings, Andres Freund
Hi > This doesn't test the consequences of the restart being skipped, nor > does it review on a code level the correctness. I check not only one stuff during review. I look code too: bgworker checksumhelper.c registered with: > bgw.bgw_start_time = BgWorkerStart_RecoveryFinished; And then process the whole cluster (even if we run checksumhelper before, but exit before its completed). Or BgWorkerStart_RecoveryFinisheddoes not guarantee start only after recovery finished? Before start any real work (and after recovery end) checksumhelper checked current cluster status again: > + * If a standby was restarted when in pending state, a background worker > + * was registered to start. If it's later promoted after the master has > + * completed enabling checksums, we need to terminate immediately and not > + * do anything. If the cluster is still in pending state when promoted, > + * the background worker should start to complete the job. > What if your replicas are delayed (e.g. recovery_min_apply_delay)? > What if you later need to do PITR? if we start after replay pg_enable_data_checksums and before it completed - we plan start bgworker on recovery finish. if we replay checksumhelper finish - we _can_ start checksumhelper again and this is handled during checksumhelper start. Behavior seems correct for me. I miss something very wrong? regards, Sergei
Hi, While looking at the online checksum verification patch (which I guess will get committed before this one), it occurred to me that disabling checksums may need to be more elaborate, to protect against someone using the stale flag value (instead of simply switching to "off" assuming that's fine). The signals etc. seem good enough for our internal stuff, but what if someone uses the flag in a different way? E.g. the online checksum verification runs as an independent process (i.e. not a backend) and reads the control file to find out if the checksums are enabled or not. So if we just switch from "on" to "off" that will break. Of course, we may also say "Don't disable checksums while online verification is running!" but that's not ideal. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > While looking at the online checksum verification patch (which I guess > will get committed before this one), it occurred to me that disabling > checksums may need to be more elaborate, to protect against someone > using the stale flag value (instead of simply switching to "off" > assuming that's fine). > > The signals etc. seem good enough for our internal stuff, but what if > someone uses the flag in a different way? E.g. the online checksum > verification runs as an independent process (i.e. not a backend) and > reads the control file to find out if the checksums are enabled or not. > So if we just switch from "on" to "off" that will break. > > Of course, we may also say "Don't disable checksums while online > verification is running!" but that's not ideal. I'm not really sure what else we could say here..? I don't particularly see an issue with telling people that if they disable checksums while they're running a tool that's checking the checksums that they're going to get odd results. Thanks! Stephen
Attachment
On 09/29/2018 02:19 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> While looking at the online checksum verification patch (which I guess >> will get committed before this one), it occurred to me that disabling >> checksums may need to be more elaborate, to protect against someone >> using the stale flag value (instead of simply switching to "off" >> assuming that's fine). >> >> The signals etc. seem good enough for our internal stuff, but what if >> someone uses the flag in a different way? E.g. the online checksum >> verification runs as an independent process (i.e. not a backend) and >> reads the control file to find out if the checksums are enabled or not. >> So if we just switch from "on" to "off" that will break. >> >> Of course, we may also say "Don't disable checksums while online >> verification is running!" but that's not ideal. > > I'm not really sure what else we could say here..? I don't particularly > see an issue with telling people that if they disable checksums while > they're running a tool that's checking the checksums that they're going > to get odd results. > I don't know, to be honest. I was merely looking at the online verification patch and realized that if someone disables checksums it won't notice it (because it only reads the flag once, at the very beginning) and will likely produce bogus errors. Although, maybe it won't - it now uses a checkpoint LSN, so that might fix it. The checkpoint LSN is read from the same controlfile as the flag, so we know the checksums were enabled during that checkpoint. Soi if we ignore failures with a newer LSN, that should do the trick, no? So perhaps that's the right "protocol" to handle this? regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Greetings, * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > On 09/29/2018 02:19 PM, Stephen Frost wrote: > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >> While looking at the online checksum verification patch (which I guess > >> will get committed before this one), it occurred to me that disabling > >> checksums may need to be more elaborate, to protect against someone > >> using the stale flag value (instead of simply switching to "off" > >> assuming that's fine). > >> > >> The signals etc. seem good enough for our internal stuff, but what if > >> someone uses the flag in a different way? E.g. the online checksum > >> verification runs as an independent process (i.e. not a backend) and > >> reads the control file to find out if the checksums are enabled or not. > >> So if we just switch from "on" to "off" that will break. > >> > >> Of course, we may also say "Don't disable checksums while online > >> verification is running!" but that's not ideal. > > > > I'm not really sure what else we could say here..? I don't particularly > > see an issue with telling people that if they disable checksums while > > they're running a tool that's checking the checksums that they're going > > to get odd results. > > I don't know, to be honest. I was merely looking at the online > verification patch and realized that if someone disables checksums it > won't notice it (because it only reads the flag once, at the very > beginning) and will likely produce bogus errors. > > Although, maybe it won't - it now uses a checkpoint LSN, so that might > fix it. The checkpoint LSN is read from the same controlfile as the > flag, so we know the checksums were enabled during that checkpoint. Soi > if we ignore failures with a newer LSN, that should do the trick, no? > > So perhaps that's the right "protocol" to handle this? I certainly don't think we need to do anything more. Thanks! Stephen
Attachment
On 09/29/2018 06:51 PM, Stephen Frost wrote: > Greetings, > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >> On 09/29/2018 02:19 PM, Stephen Frost wrote: >>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: >>>> While looking at the online checksum verification patch (which I guess >>>> will get committed before this one), it occurred to me that disabling >>>> checksums may need to be more elaborate, to protect against someone >>>> using the stale flag value (instead of simply switching to "off" >>>> assuming that's fine). >>>> >>>> The signals etc. seem good enough for our internal stuff, but what if >>>> someone uses the flag in a different way? E.g. the online checksum >>>> verification runs as an independent process (i.e. not a backend) and >>>> reads the control file to find out if the checksums are enabled or not. >>>> So if we just switch from "on" to "off" that will break. >>>> >>>> Of course, we may also say "Don't disable checksums while online >>>> verification is running!" but that's not ideal. >>> >>> I'm not really sure what else we could say here..? I don't particularly >>> see an issue with telling people that if they disable checksums while >>> they're running a tool that's checking the checksums that they're going >>> to get odd results. >> >> I don't know, to be honest. I was merely looking at the online >> verification patch and realized that if someone disables checksums it >> won't notice it (because it only reads the flag once, at the very >> beginning) and will likely produce bogus errors. >> >> Although, maybe it won't - it now uses a checkpoint LSN, so that might >> fix it. The checkpoint LSN is read from the same controlfile as the >> flag, so we know the checksums were enabled during that checkpoint. Soi >> if we ignore failures with a newer LSN, that should do the trick, no? >> >> So perhaps that's the right "protocol" to handle this? > > I certainly don't think we need to do anything more. > Not sure I agree. I'm not suggesting we absolutely have to write huge amount of code to deal with this issue, but I hope we agree we need to at least understand the issue so that we can put warnings into docs. FWIW pg_basebackup (in the default "verify checksums") has this issue too AFAICS, and it seems rather unfriendly to just start reporting checksum errors during backup in that case. But as I mentioned, maybe there's no problem at all and using the checkpoint LSN deals with it automatically. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 2018-09-30 10:48:36 +0200, Tomas Vondra wrote: > > > On 09/29/2018 06:51 PM, Stephen Frost wrote: > > Greetings, > > > > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >> On 09/29/2018 02:19 PM, Stephen Frost wrote: > >>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote: > >>>> While looking at the online checksum verification patch (which I guess > >>>> will get committed before this one), it occurred to me that disabling > >>>> checksums may need to be more elaborate, to protect against someone > >>>> using the stale flag value (instead of simply switching to "off" > >>>> assuming that's fine). > >>>> > >>>> The signals etc. seem good enough for our internal stuff, but what if > >>>> someone uses the flag in a different way? E.g. the online checksum > >>>> verification runs as an independent process (i.e. not a backend) and > >>>> reads the control file to find out if the checksums are enabled or not. > >>>> So if we just switch from "on" to "off" that will break. > >>>> > >>>> Of course, we may also say "Don't disable checksums while online > >>>> verification is running!" but that's not ideal. > >>> > >>> I'm not really sure what else we could say here..? I don't particularly > >>> see an issue with telling people that if they disable checksums while > >>> they're running a tool that's checking the checksums that they're going > >>> to get odd results. > >> > >> I don't know, to be honest. I was merely looking at the online > >> verification patch and realized that if someone disables checksums it > >> won't notice it (because it only reads the flag once, at the very > >> beginning) and will likely produce bogus errors. > >> > >> Although, maybe it won't - it now uses a checkpoint LSN, so that might > >> fix it. The checkpoint LSN is read from the same controlfile as the > >> flag, so we know the checksums were enabled during that checkpoint. Soi > >> if we ignore failures with a newer LSN, that should do the trick, no? > >> > >> So perhaps that's the right "protocol" to handle this? > > > > I certainly don't think we need to do anything more. > > > > Not sure I agree. I'm not suggesting we absolutely have to write huge > amount of code to deal with this issue, but I hope we agree we need to > at least understand the issue so that we can put warnings into docs. > > FWIW pg_basebackup (in the default "verify checksums") has this issue > too AFAICS, and it seems rather unfriendly to just start reporting > checksum errors during backup in that case. > > But as I mentioned, maybe there's no problem at all and using the > checkpoint LSN deals with it automatically. Given that this patch has not been developed in a few months, I don't see why this has an active 2019-01 CF entry? I think we should mark this as Returned With Feedback. https://commitfest.postgresql.org/21/1535/ Greetings, Andres Freund
On 2018-09-30 10:48:36 +0200, Tomas Vondra wrote:
>
>
> On 09/29/2018 06:51 PM, Stephen Frost wrote:
> > Greetings,
> >
> > * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >> On 09/29/2018 02:19 PM, Stephen Frost wrote:
> >>> * Tomas Vondra (tomas.vondra@2ndquadrant.com) wrote:
> >>>> While looking at the online checksum verification patch (which I guess
> >>>> will get committed before this one), it occurred to me that disabling
> >>>> checksums may need to be more elaborate, to protect against someone
> >>>> using the stale flag value (instead of simply switching to "off"
> >>>> assuming that's fine).
> >>>>
> >>>> The signals etc. seem good enough for our internal stuff, but what if
> >>>> someone uses the flag in a different way? E.g. the online checksum
> >>>> verification runs as an independent process (i.e. not a backend) and
> >>>> reads the control file to find out if the checksums are enabled or not.
> >>>> So if we just switch from "on" to "off" that will break.
> >>>>
> >>>> Of course, we may also say "Don't disable checksums while online
> >>>> verification is running!" but that's not ideal.
> >>>
> >>> I'm not really sure what else we could say here..? I don't particularly
> >>> see an issue with telling people that if they disable checksums while
> >>> they're running a tool that's checking the checksums that they're going
> >>> to get odd results.
> >>
> >> I don't know, to be honest. I was merely looking at the online
> >> verification patch and realized that if someone disables checksums it
> >> won't notice it (because it only reads the flag once, at the very
> >> beginning) and will likely produce bogus errors.
> >>
> >> Although, maybe it won't - it now uses a checkpoint LSN, so that might
> >> fix it. The checkpoint LSN is read from the same controlfile as the
> >> flag, so we know the checksums were enabled during that checkpoint. Soi
> >> if we ignore failures with a newer LSN, that should do the trick, no?
> >>
> >> So perhaps that's the right "protocol" to handle this?
> >
> > I certainly don't think we need to do anything more.
> >
>
> Not sure I agree. I'm not suggesting we absolutely have to write huge
> amount of code to deal with this issue, but I hope we agree we need to
> at least understand the issue so that we can put warnings into docs.
>
> FWIW pg_basebackup (in the default "verify checksums") has this issue
> too AFAICS, and it seems rather unfriendly to just start reporting
> checksum errors during backup in that case.
>
> But as I mentioned, maybe there's no problem at all and using the
> checkpoint LSN deals with it automatically.
Given that this patch has not been developed in a few months, I don't
see why this has an active 2019-01 CF entry? I think we should mark this
as Returned With Feedback.
https://commitfest.postgresql.org/21/1535/