Thread: Re: Does PostgreSQL check database integrity at startup?
Stephen Frost wrote: > * Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: > > Anyway, instead digging into rsync functionality (or bugs - I doubt, > > but who knows?), I do prefer to have a script I can run to check if > > there is obvious failures in standby servers. > > As mentioned, zero-byte files can be perfectly valid. PostgreSQL does > have page-level CRCs, if you initialized your database with them (which > I would strongly recommend). Page-level checksums would not detect the problem being complained in this thread, however. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro, * Alvaro Herrera (alvherre@2ndquadrant.com) wrote: > Stephen Frost wrote: > > > * Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: > > > > Anyway, instead digging into rsync functionality (or bugs - I doubt, > > > but who knows?), I do prefer to have a script I can run to check if > > > there is obvious failures in standby servers. > > > > As mentioned, zero-byte files can be perfectly valid. PostgreSQL does > > have page-level CRCs, if you initialized your database with them (which > > I would strongly recommend). > > Page-level checksums would not detect the problem being complained in > this thread, however. It's entirely unclear to me what the problem being complained about in this thread actually is. The complaint so far was about zero-byte files, but those are entirely valid, so that isn't a problem that anyone can solve.. Given the thread subject, if someone actually wanted to do a database integrity check before startup, they could use pgBackRest to perform a backup with a CRC-enabled database and at least verify that all of the checksums are valid. We could possibly look into adding some set of additional checks for files which can't actually be zero-byte, perhaps.. I know we have some other one-off checks already. Thanks! Stephen
Attachment
Em 27/12/2017 15:02, Stephen Frost escreveu: > Alvaro, > > * Alvaro Herrera (alvherre@2ndquadrant.com) wrote: >> Stephen Frost wrote: >> >>> * Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: >>>> Anyway, instead digging into rsync functionality (or bugs - I doubt, >>>> but who knows?), I do prefer to have a script I can run to check if >>>> there is obvious failures in standby servers. >>> As mentioned, zero-byte files can be perfectly valid. PostgreSQL does >>> have page-level CRCs, if you initialized your database with them (which >>> I would strongly recommend). >> Page-level checksums would not detect the problem being complained in >> this thread, however. > It's entirely unclear to me what the problem being complained about in > this thread actually is. The complaint so far was about zero-byte > files, but those are entirely valid, so that isn't a problem that anyone > can solve.. > > Given the thread subject, if someone actually wanted to do a database > integrity check before startup, they could use pgBackRest to perform a > backup with a CRC-enabled database and at least verify that all of the > checksums are valid. > > We could possibly look into adding some set of additional checks for > files which can't actually be zero-byte, perhaps.. I know we have some > other one-off checks already. > > Thanks! > > Stephen Actually, the problem is: Master => Slave => Backup In the master server everything is fine. But at some point in time, slave became corrupt (one of the base files are zero size where it should be 16Mb in size), and IMHO a "red alert" should arise - Slave server shall not even startup at all. Since backups are taken from slave server, all backups are also corrupt. I've detected the problem just because I've restored a backup (excellent practice perhaps - nobody should take backups if not testing it with the restore procedure). In slave server there is no indication that the database is corrupt (not in logs, it starts normally and show it is applying stream changes regularly). So that is the point: how to detect that a database is corrupt so cluster doesn't even start... Regards, Edson
Stephen Frost wrote: > It's entirely unclear to me what the problem being complained about in > this thread actually is. As Edson explained, a relfilenode in the standby server is zero bytes long when it is not that size in the primary server, and it corresponds to a persistent table. I don't have any satisfactory explanation for that. > Given the thread subject, if someone actually wanted to do a database > integrity check before startup, they could use pgBackRest to perform a > backup with a CRC-enabled database and at least verify that all of the > checksums are valid. That's not a complete solution, because a zero-byte file does not contain any CRC. CRCs may detect some problems, but they will not detect this particular kind of corruption. > We could possibly look into adding some set of additional checks for > files which can't actually be zero-byte, perhaps.. I know we have some > other one-off checks already. Some possibilities, from the realm of haven't-had-coffee-yet-after-lunch: * the length of the FSM fork can let you infer something about the length that the main fork ought to have. Maybe the VM fork too? not sure. (Easy to check: just some math on the size of the FSM/VM forks) * the largest block number in any item pointer in any index of a table can tell you what's the latest page that should appear in the table. (Expensive: need to scan the indexes completely) -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > Stephen Frost wrote: > > > It's entirely unclear to me what the problem being complained about in > > this thread actually is. > > As Edson explained, a relfilenode in the standby server is zero bytes > long when it is not that size in the primary server, and it corresponds > to a persistent table. I don't have any satisfactory explanation for > that. Somehow I missed the discussion of it being zero bytes for a long time after it's non-zero on the primary (I take the to main that it's longer than whatever checkpoint timeout is set to..?). That does sound like there might be an actual issue/bug here. > > Given the thread subject, if someone actually wanted to do a database > > integrity check before startup, they could use pgBackRest to perform a > > backup with a CRC-enabled database and at least verify that all of the > > checksums are valid. > > That's not a complete solution, because a zero-byte file does not > contain any CRC. CRCs may detect some problems, but they will not > detect this particular kind of corruption. I agree that we can certainly have bugs which will cause corruption that's not detectable by our CRCs. Validating CRCs is just checking for one kind of corruption; using amcheck would provide another level, though it's unclear if that would help in this specific case if it's really just files on the standby (but not the primary) being zero-length long after they should be written out. > > We could possibly look into adding some set of additional checks for > > files which can't actually be zero-byte, perhaps.. I know we have some > > other one-off checks already. > > Some possibilities, from the realm of haven't-had-coffee-yet-after-lunch: > > * the length of the FSM fork can let you infer something about the > length that the main fork ought to have. Maybe the VM fork too? not sure. > (Easy to check: just some math on the size of the FSM/VM forks) I was just chatting with David over lunch specifically about checking that the number of forks was valid and looking at their lengths also makes sense, but the VM is only created/updated based on VACUUM runs, so checking that is a bit tricky. David has ideas about how to handle various races when it comes to forks (in particular by building a manifest once and then seeing if anything in the relation has changed since the manifest was built- if so, assume that WAL has the necessary info and if not then raise a flag) but it's all a bit tricky when the system is running. We might be able to do more checks if we're serious about supporting pre-startup checks though. We were also thinking about having a set of catalog tables that must not be zero and checking for those. > * the largest block number in any item pointer in any index of a table > can tell you what's the latest page that should appear in the table. > (Expensive: need to scan the indexes completely) This is something that I'd expect amcheck to handle; I'm not sure it makes sense to invent something else. David and I have chatted a little bit a while ago about trying to incorporate amcheck but it's rather complicated and not something we plan to do in the near-term. Still, it does seem like it'd be a nice capability to have. I wish it was possible to do off-line though. Of course, a user could do a restore and then run amcheck on the result themselves. Thanks! Stephen
Attachment
Edson, * Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: > Em 27/12/2017 15:02, Stephen Frost escreveu: > >* Alvaro Herrera (alvherre@2ndquadrant.com) wrote: > >>Stephen Frost wrote: > >> > >>>* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: > >>>>Anyway, instead digging into rsync functionality (or bugs - I doubt, > >>>>but who knows?), I do prefer to have a script I can run to check if > >>>>there is obvious failures in standby servers. > >>>As mentioned, zero-byte files can be perfectly valid. PostgreSQL does > >>>have page-level CRCs, if you initialized your database with them (which > >>>I would strongly recommend). > >>Page-level checksums would not detect the problem being complained in > >>this thread, however. > >It's entirely unclear to me what the problem being complained about in > >this thread actually is. The complaint so far was about zero-byte > >files, but those are entirely valid, so that isn't a problem that anyone > >can solve.. > > > >Given the thread subject, if someone actually wanted to do a database > >integrity check before startup, they could use pgBackRest to perform a > >backup with a CRC-enabled database and at least verify that all of the > >checksums are valid. > > > >We could possibly look into adding some set of additional checks for > >files which can't actually be zero-byte, perhaps.. I know we have some > >other one-off checks already. > > Actually, the problem is: > > Master => Slave => Backup > > In the master server everything is fine. Alright.. > But at some point in time, slave became corrupt (one of the base > files are zero size where it should be 16Mb in size), and IMHO a > "red alert" should arise - Slave server shall not even startup at > all. How do you know it should be 16Mb in size...? That sounds like you're describing a WAL file, but you should be archiving your WAL files during a backup, not just using whatever is in pg_xlog/pg_wal.. > Since backups are taken from slave server, all backups are also corrupt. If you aren't following the appropriate process to perform a backup then, yes, you're going to end up with corrupt and useless/bad backups. Backing up from a replica has only been officially supported using the pg_start/stop_backup methods as of 9.6 and only when doing a non-exclusive backup. Note that the wiki page you're talking about (I think, anyway...) is describing *exclusive* backup, not non-exclusive, and the two are not the same. > I've detected the problem just because I've restored a backup > (excellent practice perhaps - nobody should take backups if not > testing it with the restore procedure). Yes, restoring a backup is excellent practice and something that everyone really should be doing. In my view, at least, everyone should also be using well tested backup tools instead of trying to write their own. > In slave server there is no indication that the database is corrupt > (not in logs, it starts normally and show it is applying stream > changes regularly). This sounds like what's happening is that you're ending up with PG thinking that a crash happened because backup_label is missing, which will happen if you do pg_start/stop_backup on the replica and don't make sure to take the results from pg_stop_backup and create the backup_label file before starting PostgreSQL after the restore. This isn't something that would happen if you used a backup tool that knew about how to perform a non-exclusive backup or how to properly backup using a replica instead of trying to write your own without understanding how all these pieces play together. > So that is the point: how to detect that a database is corrupt so > cluster doesn't even start... I'm not sure that the database is actually corrupt in this specific case- it sounds like everything is actually fine, but you didn't include the backup_label file when restoring and therefore PG thinks there is crash recovery happening when it should be replaying WAL from the start of the backup, but how is PG going to know that? Well, it'd know that from the backup_label file, if it was there.. Otherwise, it seems pretty difficult for us to know that we're not doing crash recovery. I'm certainly open to ideas on how to detect that, but nothing springs to mind off-hand. Again, using a backup tool instead of trying to roll your own would make this much less likely to happen. This isn't something which should happen when you're using pg_basebackup or the other PG-specific backup tools, rather it's only happening because you're trying to do your own with pg_start/stop_backup and rsync and didn't completely read the documentation on non-exclusive backups from replicas (that said, those docs could certainly use improvment...). Of course, perhaps I'm misunderstanding exactly what you're doing or what file you're referring to, but this is my best guess based on the information you've provided so far. Thanks! Stephen
Attachment
For context: this was first reported in the Barman forum here: https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ They are using Barman for the backups. Stephen Frost wrote: > > But at some point in time, slave became corrupt (one of the base > > files are zero size where it should be 16Mb in size), and IMHO a > > "red alert" should arise - Slave server shall not even startup at > > all. > > How do you know it should be 16Mb in size...? That sounds like you're > describing a WAL file, but you should be archiving your WAL files during > a backup, not just using whatever is in pg_xlog/pg_wal.. It's not a WAL file -- it's a file backing a table. > > Since backups are taken from slave server, all backups are also corrupt. > > If you aren't following the appropriate process to perform a backup > then, yes, you're going to end up with corrupt and useless/bad backups. A few guys went over the backup-taking protocol upthread already. But anyway the backup tool is a moot point. The problem doesn't originate in the backup -- it originates in the standby, from where the backup is taken. The file can be seen as size 0 in the standby. Edson's question is: why wasn't the problem detected in the standby? It seems a valid question to me, to which we currently we don't have any good answer. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro, * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: > For context: this was first reported in the Barman forum here: > https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ > They are using Barman for the backups. Ahhhh, I see. I wasn't aware of that history. > Stephen Frost wrote: > > > > But at some point in time, slave became corrupt (one of the base > > > files are zero size where it should be 16Mb in size), and IMHO a > > > "red alert" should arise - Slave server shall not even startup at > > > all. > > > > How do you know it should be 16Mb in size...? That sounds like you're > > describing a WAL file, but you should be archiving your WAL files during > > a backup, not just using whatever is in pg_xlog/pg_wal.. > > It's not a WAL file -- it's a file backing a table. Interesting. > > > Since backups are taken from slave server, all backups are also corrupt. > > > > If you aren't following the appropriate process to perform a backup > > then, yes, you're going to end up with corrupt and useless/bad backups. > > A few guys went over the backup-taking protocol upthread already. > > But anyway the backup tool is a moot point. The problem doesn't > originate in the backup -- it originates in the standby, from where the > backup is taken. The file can be seen as size 0 in the standby. > Edson's question is: why wasn't the problem detected in the standby? > It seems a valid question to me, to which we currently we don't have any > good answer. The last message on that thread seems pretty clear to me- the comment is "I think this is a failure in standby build." It's not clear what that failure was but I agree it doesn't appear related to the backup tool (the comment there is "I'm using rsync"), or, really, PostgreSQL at all (a failure during the build of the replica isn't something we're necessairly going to pick up on..). As discussed on this thread, zero-byte files are entirely valid to appear in the PostgreSQL data directory. To try and dig into what happened, I'd probably look at what forks there are of that relation, the entry in pg_class, and try to figure out how it is that replication isn't complaining when the file on the primary appeared to be modified well after the last modify timestamp on the replica. If it's possible to replica this into a test environment, maybe even do a no-op update of a row of that table and see what happens with replication. One thing I wonder is if this table used to be unlogged and it was later turned into a logged table but something didn't quite happen correctly with that. I'd also suggest looking for other file size mismatches between the primary and the replica. Thanks! Stephen
Attachment
Em 28/12/2017 10:16, Stephen Frost escreveu: > Alvaro, > > * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote: >> For context: this was first reported in the Barman forum here: >> https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ >> They are using Barman for the backups. > Ahhhh, I see. I wasn't aware of that history. > >> Stephen Frost wrote: >> >>>> But at some point in time, slave became corrupt (one of the base >>>> files are zero size where it should be 16Mb in size), and IMHO a >>>> "red alert" should arise - Slave server shall not even startup at >>>> all. >>> How do you know it should be 16Mb in size...? That sounds like you're >>> describing a WAL file, but you should be archiving your WAL files during >>> a backup, not just using whatever is in pg_xlog/pg_wal.. >> It's not a WAL file -- it's a file backing a table. > Interesting. > >>>> Since backups are taken from slave server, all backups are also corrupt. >>> If you aren't following the appropriate process to perform a backup >>> then, yes, you're going to end up with corrupt and useless/bad backups. >> A few guys went over the backup-taking protocol upthread already. >> >> But anyway the backup tool is a moot point. The problem doesn't >> originate in the backup -- it originates in the standby, from where the >> backup is taken. The file can be seen as size 0 in the standby. >> Edson's question is: why wasn't the problem detected in the standby? >> It seems a valid question to me, to which we currently we don't have any >> good answer. > The last message on that thread seems pretty clear to me- the comment is > "I think this is a failure in standby build." It's not clear what that > failure was but I agree it doesn't appear related to the backup tool > (the comment there is "I'm using rsync"), or, really, PostgreSQL at all > (a failure during the build of the replica isn't something we're > necessairly going to pick up on..). > > As discussed on this thread, zero-byte files are entirely valid to > appear in the PostgreSQL data directory. > > To try and dig into what happened, I'd probably look at what forks there > are of that relation, the entry in pg_class, and try to figure out how > it is that replication isn't complaining when the file on the primary > appeared to be modified well after the last modify timestamp on the > replica. If it's possible to replica this into a test environment, > maybe even do a no-op update of a row of that table and see what happens > with replication. One thing I wonder is if this table used to be > unlogged and it was later turned into a logged table but something > didn't quite happen correctly with that. I'd also suggest looking for > other file size mismatches between the primary and the replica. > > Thanks! > > Stephen The table was never unlogged. From very beginning, it was always logged. I've dozens of databases with exactly same setup - and right now, I'm rebuilding the slave server. Instead of investigating something probably I will not find the cause, I would like to have a alert for the future. Would be possible to include in future versions: 1) After start standby, standby run all WAL files until it is synchronized with master (current behavior) 3) Before getting into "accept read only queries", check if all base files have same size as master server (new behavior). In case something is different, throw an error and stop database startup? 4) Then start "accept read only queries" (current behavior) ??? Thanks, Edson
Some thoughts....
A tool to calculate a checksum of sorts based on the table (file) content would provide a better surety of duplication than simply checking file size - like differently vacuumed tables in each copy could have the same content but be different file sizes.
Something like these could be adapted to compare database content by filesystem checks rather than db queries. Following tablespaces, etc as well.
or other similar tools
Yes, there is some overhead, especially for large databases but it would be worth that to robustly ensure genuine and complete duplication.
I do wonder though - given the order of records in a table (file) is not necessarily identical (or is it?) event this may be problematic. Perhaps a checksum based on the result of a query output ordered by primary key could work?
Brent Wood
From: Edson Carlos Ericksson Richter <richter@simkorp.com.br>
To: pgsql-general@lists.postgresql.org
Sent: Friday, December 29, 2017 6:47 AM
Subject: Re: Does PostgreSQL check database integrity at startup?
Em 28/12/2017 10:16, Stephen Frost escreveu:
> Alvaro,
>
> * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
>> For context: this was first reported in the Barman forum here:
>> https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ
>> They are using Barman for the backups.
> Ahhhh, I see. I wasn't aware of that history.
>
>> Stephen Frost wrote:
>>
>>>> But at some point in time, slave became corrupt (one of the base
>>>> files are zero size where it should be 16Mb in size), and IMHO a
>>>> "red alert" should arise - Slave server shall not even startup at
>>>> all.
>>> How do you know it should be 16Mb in size...? That sounds like you're
>>> describing a WAL file, but you should be archiving your WAL files during
>>> a backup, not just using whatever is in pg_xlog/pg_wal..
>> It's not a WAL file -- it's a file backing a table.
> Interesting.
>
>>>> Since backups are taken from slave server, all backups are also corrupt.
>>> If you aren't following the appropriate process to perform a backup
>>> then, yes, you're going to end up with corrupt and useless/bad backups.
>> A few guys went over the backup-taking protocol upthread already.
>>
>> But anyway the backup tool is a moot point. The problem doesn't
>> originate in the backup -- it originates in the standby, from where the
>> backup is taken. The file can be seen as size 0 in the standby.
>> Edson's question is: why wasn't the problem detected in the standby?
>> It seems a valid question to me, to which we currently we don't have any
>> good answer.
> The last message on that thread seems pretty clear to me- the comment is
> "I think this is a failure in standby build." It's not clear what that
> failure was but I agree it doesn't appear related to the backup tool
> (the comment there is "I'm using rsync"), or, really, PostgreSQL at all
> (a failure during the build of the replica isn't something we're
> necessairly going to pick up on..).
>
> As discussed on this thread, zero-byte files are entirely valid to
> appear in the PostgreSQL data directory.
>
> To try and dig into what happened, I'd probably look at what forks there
> are of that relation, the entry in pg_class, and try to figure out how
> it is that replication isn't complaining when the file on the primary
> appeared to be modified well after the last modify timestamp on the
> replica. If it's possible to replica this into a test environment,
> maybe even do a no-op update of a row of that table and see what happens
> with replication. One thing I wonder is if this table used to be
> unlogged and it was later turned into a logged table but something
> didn't quite happen correctly with that. I'd also suggest looking for
> other file size mismatches between the primary and the replica.
>
> Thanks!
>
> Stephen
The table was never unlogged. From very beginning, it was always logged.
I've dozens of databases with exactly same setup - and right now, I'm
rebuilding the slave server. Instead of investigating something probably
I will not find the cause, I would like to have a alert for the future.
Would be possible to include in future versions:
1) After start standby, standby run all WAL files until it is
synchronized with master (current behavior)
3) Before getting into "accept read only queries", check if all base
files have same size as master server (new behavior). In case something
is different, throw an error and stop database startup?
4) Then start "accept read only queries" (current behavior)
???
Thanks,
Edson
> Alvaro,
>
> * Alvaro Herrera (alvherre@alvh.no-ip.org) wrote:
>> For context: this was first reported in the Barman forum here:
>> https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ
>> They are using Barman for the backups.
> Ahhhh, I see. I wasn't aware of that history.
>
>> Stephen Frost wrote:
>>
>>>> But at some point in time, slave became corrupt (one of the base
>>>> files are zero size where it should be 16Mb in size), and IMHO a
>>>> "red alert" should arise - Slave server shall not even startup at
>>>> all.
>>> How do you know it should be 16Mb in size...? That sounds like you're
>>> describing a WAL file, but you should be archiving your WAL files during
>>> a backup, not just using whatever is in pg_xlog/pg_wal..
>> It's not a WAL file -- it's a file backing a table.
> Interesting.
>
>>>> Since backups are taken from slave server, all backups are also corrupt.
>>> If you aren't following the appropriate process to perform a backup
>>> then, yes, you're going to end up with corrupt and useless/bad backups.
>> A few guys went over the backup-taking protocol upthread already.
>>
>> But anyway the backup tool is a moot point. The problem doesn't
>> originate in the backup -- it originates in the standby, from where the
>> backup is taken. The file can be seen as size 0 in the standby.
>> Edson's question is: why wasn't the problem detected in the standby?
>> It seems a valid question to me, to which we currently we don't have any
>> good answer.
> The last message on that thread seems pretty clear to me- the comment is
> "I think this is a failure in standby build." It's not clear what that
> failure was but I agree it doesn't appear related to the backup tool
> (the comment there is "I'm using rsync"), or, really, PostgreSQL at all
> (a failure during the build of the replica isn't something we're
> necessairly going to pick up on..).
>
> As discussed on this thread, zero-byte files are entirely valid to
> appear in the PostgreSQL data directory.
>
> To try and dig into what happened, I'd probably look at what forks there
> are of that relation, the entry in pg_class, and try to figure out how
> it is that replication isn't complaining when the file on the primary
> appeared to be modified well after the last modify timestamp on the
> replica. If it's possible to replica this into a test environment,
> maybe even do a no-op update of a row of that table and see what happens
> with replication. One thing I wonder is if this table used to be
> unlogged and it was later turned into a logged table but something
> didn't quite happen correctly with that. I'd also suggest looking for
> other file size mismatches between the primary and the replica.
>
> Thanks!
>
> Stephen
The table was never unlogged. From very beginning, it was always logged.
I've dozens of databases with exactly same setup - and right now, I'm
rebuilding the slave server. Instead of investigating something probably
I will not find the cause, I would like to have a alert for the future.
Would be possible to include in future versions:
1) After start standby, standby run all WAL files until it is
synchronized with master (current behavior)
3) Before getting into "accept read only queries", check if all base
files have same size as master server (new behavior). In case something
is different, throw an error and stop database startup?
4) Then start "accept read only queries" (current behavior)
???
Thanks,
Edson
Greetings, * Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: > Would be possible to include in future versions: > 1) After start standby, standby run all WAL files until it is > synchronized with master (current behavior) > 3) Before getting into "accept read only queries", check if all base > files have same size as master server (new behavior). In case > something is different, throw an error and stop database startup? > 4) Then start "accept read only queries" (current behavior) I'm afraid it wouldn't be nearly that simple with a running system. If you're that concerned about it, you could shut down the primary, verify that the replica has fully caught up to where the primary was, and then compare the file sizes on disk using a script yourself (excluding the temporary files and unlogged relations, etc, of course). I'm not completely sure that there isn't a valid reason for them to be different even with everything shut down and fully caught up, but that seems like the best chance way to match things up. On a running system where lots of processes are potentially writing to different files across the system, getting a point-in-time snapshot of the file sizes across the entire system as of a certain WAL point (so you could replay to there and then compare on the replica) seems unlikely to be possible, at least with just PG. Perhaps with help from the OS/filesystem, you might be able to get such a snapshot. Do you have some way of knowing that such a check would have actually caught this..? Isn't it possible that the file became zero'd out sometime after the replica started up and was running fine for a while? One of the questions I asked earlier drove at exactly this question- why weren't there any errors thrown during WAL replay on the replica, given that the relation clearly appeared to be modified after the replica was built? Another thing which could be done would be to look through the WAL for references to that file and see what the WAL included for it. Also, really, if this is happening then there's something pretty wrong with some part of the process and that simply needs to be fixed- just throwing an error saying the replica isn't valid isn't really good for anyone. Where the issue is, I don't think we have any idea, because there's a lot of details missing here, which is why I was asking questions about the other data and information since it might help us figure out what happened. Perhaps it's an issue in the OS or filesystem or another layer (though there should really be logs if that's the case) or maybe it really is a PG issue, but if so, we need a lot more info to debug and address it. Thanks! Stephen
Attachment
Greetings Brent, * Brent Wood (pcreso@yahoo.com) wrote: > A tool to calculate a checksum of sorts based on the table (file) content would provide a better surety of duplicationthan simply checking file size - like differently vacuumed tables in each copy could have the same content butbe different file sizes. PG has support for checksums and there are tools out there to validate that the checksum is correct for all pages which have one, but that wouldn't help in this case because the file is zero'd out (and a zero'd out file is actually a valid file in a PG data directory). Also, the files on the primary and the replica actually can be different when looked at with a complete-file checksum due to hint bits being set differently (at least, possibly other ways too). That doesn't make them invalid or incorrect though. Rather than trying to compare a running primary to a running replica, such a check to verify that the files backed up during a backup have the same checksum as the files being restored from that backup can be done, and that *is* done in at least some of the PG backup tools already (pgBackRest has an independent manifest that it stores for each backup which contains the checksum of each file as-backed-up, and it verifies that checksum when performing a restore to make sure that the backed up file wasn't corrupted in place, other tools hopefully have similar). > I do wonder though - given the order of records in a table (file) is not necessarily identical (or is it?) event this maybe problematic. Perhaps a checksum based on the result of a query output ordered by primary key could work? The order of records in a *file* should be the same in the heap on the primary as they are on the replica, but that doesn't mean the contents of those files will be exactly the same (as mentioned above, hint bits can differ). We used to have cases where the indexes could also be different, but I believe that was changed so they should match. I've used the approach of doing a checksum across the results of an ordered query to compare between systems and that generally does work, but it's a bit tricky if you're trying to compare a table that's heavily modified- you need to determine the point in the WAL stream that you're at on the primary when you run the query and then replay the replica to that point in the WAL and then run the query on the replica, otherwise you could end up with differences that are just because of the ongoing transactions being run to update the table. Thanks! Stephen
Attachment
Em 28/12/2017 16:06, Brent Wood escreveu: > Some thoughts.... > > A tool to calculate a checksum of sorts based on the table (file) > content would provide a better surety of duplication than simply > checking file size - like differently vacuumed tables in each copy > could have the same content but be different file sizes. > > Something like these could be adapted to compare database content by > filesystem checks rather than db queries. Following tablespaces, etc > as well. > > http://www.commandlinefu.com/commands/view/3555/find-duplicate-files-based-on-size-first-then-md5-hash > or other similar tools > > Yes, there is some overhead, especially for large databases but it > would be worth that to robustly ensure genuine and complete duplication. > > I do wonder though - given the order of records in a table (file) is > not necessarily identical (or is it?) event this may be problematic. > Perhaps a checksum based on the result of a query output ordered by > primary key could work? > > Brent Wood > > ------------------------------------------------------------------------ > *From:* Edson Carlos Ericksson Richter <richter@simkorp.com.br> > *To:* pgsql-general@lists.postgresql.org > *Sent:* Friday, December 29, 2017 6:47 AM > *Subject:* Re: Does PostgreSQL check database integrity at startup? > > Em 28/12/2017 10:16, Stephen Frost escreveu: > > Alvaro, > > > > * Alvaro Herrera (alvherre@alvh.no-ip.org > <mailto:alvherre@alvh.no-ip.org>) wrote: > >> For context: this was first reported in the Barman forum here: > >> > https://groups.google.com/forum/#!msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ > <https://groups.google.com/forum/#%21msg/pgbarman/3aXWpaKWRFI/weUIZxspDAAJ> > >> They are using Barman for the backups. > > Ahhhh, I see. I wasn't aware of that history. > > > >> Stephen Frost wrote: > >> > >>>> But at some point in time, slave became corrupt (one of the base > >>>> files are zero size where it should be 16Mb in size), and IMHO a > >>>> "red alert" should arise - Slave server shall not even startup at > >>>> all. > >>> How do you know it should be 16Mb in size...? That sounds like you're > >>> describing a WAL file, but you should be archiving your WAL files > during > >>> a backup, not just using whatever is in pg_xlog/pg_wal.. > >> It's not a WAL file -- it's a file backing a table. > > Interesting. > > > >>>> Since backups are taken from slave server, all backups are also > corrupt. > >>> If you aren't following the appropriate process to perform a backup > >>> then, yes, you're going to end up with corrupt and useless/bad > backups. > >> A few guys went over the backup-taking protocol upthread already. > >> > >> But anyway the backup tool is a moot point. The problem doesn't > >> originate in the backup -- it originates in the standby, from where the > >> backup is taken. The file can be seen as size 0 in the standby. > >> Edson's question is: why wasn't the problem detected in the standby? > >> It seems a valid question to me, to which we currently we don't > have any > >> good answer. > > The last message on that thread seems pretty clear to me- the comment is > > "I think this is a failure in standby build." It's not clear what that > > failure was but I agree it doesn't appear related to the backup tool > > (the comment there is "I'm using rsync"), or, really, PostgreSQL at all > > (a failure during the build of the replica isn't something we're > > necessairly going to pick up on..). > > > > As discussed on this thread, zero-byte files are entirely valid to > > appear in the PostgreSQL data directory. > > > > To try and dig into what happened, I'd probably look at what forks there > > are of that relation, the entry in pg_class, and try to figure out how > > it is that replication isn't complaining when the file on the primary > > appeared to be modified well after the last modify timestamp on the > > replica. If it's possible to replica this into a test environment, > > maybe even do a no-op update of a row of that table and see what happens > > with replication. One thing I wonder is if this table used to be > > unlogged and it was later turned into a logged table but something > > didn't quite happen correctly with that. I'd also suggest looking for > > other file size mismatches between the primary and the replica. > > > > Thanks! > > > > Stephen > The table was never unlogged. From very beginning, it was always logged. > I've dozens of databases with exactly same setup - and right now, I'm > rebuilding the slave server. Instead of investigating something probably > I will not find the cause, I would like to have a alert for the future. > > Would be possible to include in future versions: > 1) After start standby, standby run all WAL files until it is > synchronized with master (current behavior) > 3) Before getting into "accept read only queries", check if all base > files have same size as master server (new behavior). In case something > is different, throw an error and stop database startup? > 4) Then start "accept read only queries" (current behavior) > ??? > > Thanks, > > > Edson Thanks, Brent. I'll think about it. Regards, Edson
Em 28/12/2017 16:26, Stephen Frost escreveu: > Greetings Brent, > > * Brent Wood (pcreso@yahoo.com) wrote: >> A tool to calculate a checksum of sorts based on the table (file) content would provide a better surety of duplicationthan simply checking file size - like differently vacuumed tables in each copy could have the same content butbe different file sizes. > PG has support for checksums and there are tools out there to validate > that the checksum is correct for all pages which have one, but that > wouldn't help in this case because the file is zero'd out (and a zero'd > out file is actually a valid file in a PG data directory). > > Also, the files on the primary and the replica actually can be different > when looked at with a complete-file checksum due to hint bits being set > differently (at least, possibly other ways too). That doesn't make them > invalid or incorrect though. > > Rather than trying to compare a running primary to a running replica, > such a check to verify that the files backed up during a backup have the > same checksum as the files being restored from that backup can be done, > and that *is* done in at least some of the PG backup tools already > (pgBackRest has an independent manifest that it stores for each backup > which contains the checksum of each file as-backed-up, and it verifies > that checksum when performing a restore to make sure that the backed up > file wasn't corrupted in place, other tools hopefully have similar). > >> I do wonder though - given the order of records in a table (file) is not necessarily identical (or is it?) event thismay be problematic. Perhaps a checksum based on the result of a query output ordered by primary key could work? > The order of records in a *file* should be the same in the heap on the > primary as they are on the replica, but that doesn't mean the contents > of those files will be exactly the same (as mentioned above, hint bits > can differ). We used to have cases where the indexes could also be > different, but I believe that was changed so they should match. > > I've used the approach of doing a checksum across the results of an > ordered query to compare between systems and that generally does work, > but it's a bit tricky if you're trying to compare a table that's heavily > modified- you need to determine the point in the WAL stream that you're > at on the primary when you run the query and then replay the replica to > that point in the WAL and then run the query on the replica, otherwise > you could end up with differences that are just because of the ongoing > transactions being run to update the table. > > Thanks! > > Stephen Stephen, thanks for you detailed reasoning on why this would not be so simple. I'll take all of that in consideration. I hope I find something that gives a bit more confidence that my replicas are enought reliable. At this moment, I just don't trust them. Regards, Edson
On Thu, Dec 28, 2017 at 1:26 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings Brent,
* Brent Wood (pcreso@yahoo.com) wrote:
> A tool to calculate a checksum of sorts based on the table (file) content would provide a better surety of duplication than simply checking file size - like differently vacuumed tables in each copy could have the same content but be different file sizes.
PG has support for checksums and there are tools out there to validate
that the checksum is correct for all pages which have one, but that
wouldn't help in this case because the file is zero'd out (and a zero'd
out file is actually a valid file in a PG data directory).
Also, the files on the primary and the replica actually can be different
when looked at with a complete-file checksum due to hint bits being set
differently (at least, possibly other ways too). That doesn't make them
invalid or incorrect though.
In addition to what Stephen and everyone else said, it is impossible to get a valid snapshot of the whole "file" on a running server without locking the relation and reading it through the PostgreSQL buffer cache. On data files such as heap and index, PostgreSQL does extensive write caching. Preventing data loss from write caching is a primary purpose of WAL. Write caching in the application (PostgreSQL in this case) prevents the OS from actually knowing the correct "logical" state of the file at any given point in time. This means that even a LVM snapshot will not give you consistent data files of a running server, because the not yet written changes (in shared buffers) waiting for a checkpoint to force them into OS buffers won't be visible from outside PostgreSQL.
Regards, Jan
such a check to verify that the files backed up during a backup have the
same checksum as the files being restored from that backup can be done,
and that *is* done in at least some of the PG backup tools already
(pgBackRest has an independent manifest that it stores for each backup
which contains the checksum of each file as-backed-up, and it verifies
that checksum when performing a restore to make sure that the backed up
file wasn't corrupted in place, other tools hopefully have similar).
> I do wonder though - given the order of records in a table (file) is not necessarily identical (or is it?) event this may be problematic. Perhaps a checksum based on the result of a query output ordered by primary key could work?
The order of records in a *file* should be the same in the heap on the
primary as they are on the replica, but that doesn't mean the contents
of those files will be exactly the same (as mentioned above, hint bits
can differ). We used to have cases where the indexes could also be
different, but I believe that was changed so they should match.
I've used the approach of doing a checksum across the results of an
ordered query to compare between systems and that generally does work,
but it's a bit tricky if you're trying to compare a table that's heavily
modified- you need to determine the point in the WAL stream that you're
at on the primary when you run the query and then replay the replica to
that point in the WAL and then run the query on the replica, otherwise
you could end up with differences that are just because of the ongoing
transactions being run to update the table.
Thanks!
Stephen
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
Em 29/12/2017 22:14, Jan Wieck escreveu: > > > On Thu, Dec 28, 2017 at 1:26 PM, Stephen Frost <sfrost@snowman.net > <mailto:sfrost@snowman.net>> wrote: > > Greetings Brent, > > * Brent Wood (pcreso@yahoo.com <mailto:pcreso@yahoo.com>) wrote: > > A tool to calculate a checksum of sorts based on the table > (file) content would provide a better surety of duplication than > simply checking file size - like differently vacuumed tables in > each copy could have the same content but be different file sizes. > > PG has support for checksums and there are tools out there to validate > that the checksum is correct for all pages which have one, but that > wouldn't help in this case because the file is zero'd out (and a > zero'd > out file is actually a valid file in a PG data directory). > > Also, the files on the primary and the replica actually can be > different > when looked at with a complete-file checksum due to hint bits > being set > differently (at least, possibly other ways too). That doesn't > make them > invalid or incorrect though. > > > In addition to what Stephen and everyone else said, it is impossible > to get a valid snapshot of the whole "file" on a running server > without locking the relation and reading it through the PostgreSQL > buffer cache. On data files such as heap and index, PostgreSQL does > extensive write caching. Preventing data loss from write caching is a > primary purpose of WAL. Write caching in the application (PostgreSQL > in this case) prevents the OS from actually knowing the correct > "logical" state of the file at any given point in time. This means > that even a LVM snapshot will not give you consistent data files of a > running server, because the not yet written changes (in shared > buffers) waiting for a checkpoint to force them into OS buffers won't > be visible from outside PostgreSQL. > > > Regards, Jan > > > > such a check to verify that the files backed up during a backup > have the > same checksum as the files being restored from that backup can be > done, > and that *is* done in at least some of the PG backup tools already > (pgBackRest has an independent manifest that it stores for each backup > which contains the checksum of each file as-backed-up, and it verifies > that checksum when performing a restore to make sure that the > backed up > file wasn't corrupted in place, other tools hopefully have similar). > > > I do wonder though - given the order of records in a table > (file) is not necessarily identical (or is it?) event this may be > problematic. Perhaps a checksum based on the result of a query > output ordered by primary key could work? > > The order of records in a *file* should be the same in the heap on the > primary as they are on the replica, but that doesn't mean the contents > of those files will be exactly the same (as mentioned above, hint bits > can differ). We used to have cases where the indexes could also be > different, but I believe that was changed so they should match. > > I've used the approach of doing a checksum across the results of an > ordered query to compare between systems and that generally does work, > but it's a bit tricky if you're trying to compare a table that's > heavily > modified- you need to determine the point in the WAL stream that > you're > at on the primary when you run the query and then replay the > replica to > that point in the WAL and then run the query on the replica, otherwise > you could end up with differences that are just because of the ongoing > transactions being run to update the table. > > Thanks! > > Stephen > > > > > -- > Jan Wieck > Senior Postgres Architect > http://pgblog.wi3ck.info One last question: There should be a "catalog" that point where tables are stored in physical files (I think, at least, because at some point PostgreSQL need to know from where to read the data). Based on information from this catalog, would I have a tool (perhaps, a C function) that check that data is really there? Thanks, Edson
Jan, all, * Jan Wieck (jan@wi3ck.info) wrote: > On Thu, Dec 28, 2017 at 1:26 PM, Stephen Frost <sfrost@snowman.net> wrote: > > * Brent Wood (pcreso@yahoo.com) wrote: > > > A tool to calculate a checksum of sorts based on the table (file) > > content would provide a better surety of duplication than simply checking > > file size - like differently vacuumed tables in each copy could have the > > same content but be different file sizes. > > > > PG has support for checksums and there are tools out there to validate > > that the checksum is correct for all pages which have one, but that > > wouldn't help in this case because the file is zero'd out (and a zero'd > > out file is actually a valid file in a PG data directory). > > > > Also, the files on the primary and the replica actually can be different > > when looked at with a complete-file checksum due to hint bits being set > > differently (at least, possibly other ways too). That doesn't make them > > invalid or incorrect though. > > In addition to what Stephen and everyone else said, it is impossible to get > a valid snapshot of the whole "file" on a running server without locking > the relation and reading it through the PostgreSQL buffer cache. On data > files such as heap and index, PostgreSQL does extensive write caching. > Preventing data loss from write caching is a primary purpose of WAL. Write > caching in the application (PostgreSQL in this case) prevents the OS from > actually knowing the correct "logical" state of the file at any given point > in time. This means that even a LVM snapshot will not give you consistent > data files of a running server, because the not yet written changes (in > shared buffers) waiting for a checkpoint to force them into OS buffers > won't be visible from outside PostgreSQL. Indeed, and just to follow that through- that's what WAL exists to correct in the event of a crash. Anything cache'd in PostgreSQL but actually committed will have already been written into the WAL and sync'd to disk, so when PostgreSQL restarts it'll realize that a crash had happened and go back to the last checkpoint in the WAL stream and replay the WAL from there to get back to consistency. This is also how online backups of PostgreSQL happen, but PG can't then start the replay from the *latest* checkpoint, it has to start from the checkpoint where the backup started, and it can't just finish at any point in the WAL stream- it must reach consistency (where the backup finished). The starting location is provided by PostgreSQL when doing a non-exclusive backup to the process issuing the 'pg_stop_backup()' call in its result and must then be put into a file called 'backup_label' when PG is restored from that backup, so that PG knows that a backup was done and where to start WAL replay from. The stopping location is written into the WAL stream when doing a backup from a primary, but if you're doing a backup from a replica then we use the latest checkpoint written into pg_control as the stopping point- but this means that pg_control *must* be copied last when doing a backup from a standby. Sadly, we don't seem to have gotten around to actually documenting this fact as yet, which is really a bug that we should fix and back-patch because it's absolutely critical. The various backup tools written by folks who have delved into all of this deeply handle these requirements, thankfully, but it just goes to show how difficult this all is to get right. :/ Thanks! Stephen
Attachment
Greetings, * Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote: > There should be a "catalog" that point where tables are stored in physical > files (I think, at least, because at some point PostgreSQL need to know from > where to read the data). Yes, it's pg_class. Specifically, the relfilenode. > Based on information from this catalog, would I have a tool (perhaps, a C > function) that check that data is really there? You could write such a function, but it wouldn't be able to be general purpose as a zero-byte file is, in fact, a valid file. You could just as easily do a 'select 1 from table limit 1;' and make sure that you get back a successful, single-row, result, if you wish to verify that certain tables in your database always have at least 1 row. Thanks! Stephen
Attachment
>There should be a "catalog" that point where tables are stored in physical files...
Here is the query that gives you that information.SELECT c.oid,
n.nspname as schema,
c.relname as table,
(SELECT oid FROM pg_database WHERE datname = current_database() ) as db_dir,
c.relfilenode as filename
FROM pg_class c
JOIN pg_namespace n ON (n.oid = c.relnamespace)
WHERE relname NOT LIKE 'pg_%'
AND relname NOT LIKE 'information%'
AND relname NOT LIKE 'sql_%'
AND relkind = 'r'
ORDER BY 2, relname;
On Fri, Dec 29, 2017 at 8:13 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote:
> There should be a "catalog" that point where tables are stored in physical
> files (I think, at least, because at some point PostgreSQL need to know from
> where to read the data).
Yes, it's pg_class. Specifically, the relfilenode.
> Based on information from this catalog, would I have a tool (perhaps, a C
> function) that check that data is really there?
You could write such a function, but it wouldn't be able to be general
purpose as a zero-byte file is, in fact, a valid file. You could just
as easily do a 'select 1 from table limit 1;' and make sure that you get
back a successful, single-row, result, if you wish to verify that certain
tables in your database always have at least 1 row.
Thanks!
Stephen
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
Greetings, * Melvin Davidson (melvin6925@gmail.com) wrote: > >There should be a "catalog" that point where tables are stored in physical > files... > > Here is the query that gives you that information. > > SELECT c.oid, > n.nspname as schema, > c.relname as table, > (SELECT oid FROM pg_database WHERE datname = > current_database() ) as db_dir, > c.relfilenode as filename > FROM pg_class c > JOIN pg_namespace n ON (n.oid = c.relnamespace) > WHERE relname NOT LIKE 'pg_%' > AND relname NOT LIKE 'information%' > AND relname NOT LIKE 'sql_%' > AND relkind = 'r' > ORDER BY 2, relname; This isn't a terribly good query- it's entirely valid to have 'pg_blah' and 'informationblah' tables in user schemas. If you'd like to filter out the catalogs/internal schemas, filter based on schema name instead. Also, this doesn't actually provide what Edson is asking for. Edson's asking for a query that uses pg_ls_dir() or some custom function which will run 'stat' on each file and return the size, according to the OS. Thanks! Stephen
Attachment
On Fri, Dec 29, 2017 at 9:07 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Melvin Davidson (melvin6925@gmail.com) wrote:
> >There should be a "catalog" that point where tables are stored in physical
> files...
>
> Here is the query that gives you that information.
>
> SELECT c.oid,
> n.nspname as schema,
> c.relname as table,
> (SELECT oid FROM pg_database WHERE datname =
> current_database() ) as db_dir,
> c.relfilenode as filename
> FROM pg_class c
> JOIN pg_namespace n ON (n.oid = c.relnamespace)
> WHERE relname NOT LIKE 'pg_%'
> AND relname NOT LIKE 'information%'
> AND relname NOT LIKE 'sql_%'
> AND relkind = 'r'
> ORDER BY 2, relname;
This isn't a terribly good query- it's entirely valid to have 'pg_blah'
and 'informationblah' tables in user schemas. If you'd like to filter
out the catalogs/internal schemas, filter based on schema name instead.
Also, this doesn't actually provide what Edson is asking for. Edson's
asking for a query that uses pg_ls_dir() or some custom function which
will run 'stat' on each file and return the size, according to the OS.
Thanks!
Stephen
Edson's original request was for a query that shows the FILENAMEs for the table.
As for "qood" query, that is entirely an opinion. The query WILL show all files associated
with ALL tables. You are free to edit and reconstruct as you choose.
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
Melvin, * Melvin Davidson (melvin6925@gmail.com) wrote: > On Fri, Dec 29, 2017 at 9:07 PM, Stephen Frost <sfrost@snowman.net> wrote: > > * Melvin Davidson (melvin6925@gmail.com) wrote: > > > >There should be a "catalog" that point where tables are stored in > > physical > > > files... > > > > > > Here is the query that gives you that information. > > > > > > SELECT c.oid, > > > n.nspname as schema, > > > c.relname as table, > > > (SELECT oid FROM pg_database WHERE datname = > > > current_database() ) as db_dir, > > > c.relfilenode as filename > > > FROM pg_class c > > > JOIN pg_namespace n ON (n.oid = c.relnamespace) > > > WHERE relname NOT LIKE 'pg_%' > > > AND relname NOT LIKE 'information%' > > > AND relname NOT LIKE 'sql_%' > > > AND relkind = 'r' > > > ORDER BY 2, relname; > > > > This isn't a terribly good query- it's entirely valid to have 'pg_blah' > > and 'informationblah' tables in user schemas. If you'd like to filter > > out the catalogs/internal schemas, filter based on schema name instead. > > > > Also, this doesn't actually provide what Edson is asking for. Edson's > > asking for a query that uses pg_ls_dir() or some custom function which > > will run 'stat' on each file and return the size, according to the OS. > > *Edson's original request was for a query that shows the FILENAMEs for the > table.* As quoted previously, he request included: --- > Based on information from this catalog, would I have a tool (perhaps, a C > function) that check that data is really there? --- Which is asking about having a function to 'stat' the files and check their length ('data is really there'). > *As for "qood" query, that is entirely an opinion. The query WILL show all > files associated* > *with ALL tables. You are free to edit and reconstruct as you choose. * No, it won't, it'll filter out tables which exist in user schemas that happen to start with one of the strings that the query includes ('pg_', 'information', and 'sql_'). I encourage you to test it- create a table in the public schema called 'pg_whatever' and see if your query picks it up or not. This isn't a minor complaint about style, the query is outright wrong. Thanks! Stephen
Attachment
>the query is outright wrong.
Really? I submit a query to help and all you can do is criticize?If you don't like the query, write a better one yourself, but
kindly do not try to degrade the efforts of others.On Fri, Dec 29, 2017 at 9:22 PM, Stephen Frost <sfrost@snowman.net> wrote:
Melvin,
* Melvin Davidson (melvin6925@gmail.com) wrote:
> On Fri, Dec 29, 2017 at 9:07 PM, Stephen Frost <sfrost@snowman.net> wrote:> *Edson's original request was for a query that shows the FILENAMEs for the> > * Melvin Davidson (melvin6925@gmail.com) wrote:
> > > >There should be a "catalog" that point where tables are stored in
> > physical
> > > files...
> > >
> > > Here is the query that gives you that information.
> > >
> > > SELECT c.oid,
> > > n.nspname as schema,
> > > c.relname as table,
> > > (SELECT oid FROM pg_database WHERE datname =
> > > current_database() ) as db_dir,
> > > c.relfilenode as filename
> > > FROM pg_class c
> > > JOIN pg_namespace n ON (n.oid = c.relnamespace)
> > > WHERE relname NOT LIKE 'pg_%'
> > > AND relname NOT LIKE 'information%'
> > > AND relname NOT LIKE 'sql_%'
> > > AND relkind = 'r'
> > > ORDER BY 2, relname;
> >
> > This isn't a terribly good query- it's entirely valid to have 'pg_blah'
> > and 'informationblah' tables in user schemas. If you'd like to filter
> > out the catalogs/internal schemas, filter based on schema name instead.
> >
> > Also, this doesn't actually provide what Edson is asking for. Edson's
> > asking for a query that uses pg_ls_dir() or some custom function which
> > will run 'stat' on each file and return the size, according to the OS.
>
> table.*
As quoted previously, he request included:
---
> Based on information from this catalog, would I have a tool (perhaps, a C
> function) that check that data is really there?
---
Which is asking about having a function to 'stat' the files and check
their length ('data is really there').
> *As for "qood" query, that is entirely an opinion. The query WILL show all
> files associated*
> *with ALL tables. You are free to edit and reconstruct as you choose. *
No, it won't, it'll filter out tables which exist in user schemas that
happen to start with one of the strings that the query includes ('pg_',
'information', and 'sql_').
I encourage you to test it- create a table in the public schema called
'pg_whatever' and see if your query picks it up or not. This isn't a
minor complaint about style, the query is outright wrong.
Thanks!
Stephen
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
> On Dec 29, 2017, at 6:14 PM, Melvin Davidson <melvin6925@gmail.com> wrote: > > Edson's original request was for a query that shows the FILENAMEs for the table. Which the query you provide does not do. > As for "qood" query, that is entirely an opinion. But a well informed one. Your query may work well enough for you, but when you're suggesting others rely on it you should expect more informed users to point out that it has some critical flaws - especially ones that might not be obvious to new users - lest others naively rely on it. > The query WILL show all files associated > with ALL tables. No, it won't. You're filtering out a bunch of things via the name of the table. That doesn't do the right thing. You're heading in the right direction, but the query you gave gets some things badly wrong. Listen to the criticism from others and you'll improve your knowledge and skills. There's absolutely nothing wrong with posting information that's not quite right, nor with getting feedback from others on what's wrong with it. Copping an attitude in response to that feedback is where things go downhill. Cheers, Steve
Melvin, * Melvin Davidson (melvin6925@gmail.com) wrote: > >the query is outright wrong. > Really? I submit a query to help and all you can do is criticize? I explained exactly what was wrong with it, and how to write a proper query to answer this question, with the thought that you'd appreciate the help and would create a proper query. I'm a bit mystified as to why you seem to be unhappy with this. > Yes it is legal to create a table that starts with pg_, but any PostgreSQL > developer should > know that is not a good idea. That's really not an excuse for providing an incorrect query when you could simply adjust the query to properly filter on nspname instead of relname, further, while perhaps 'pg_' instead a good prefix to use in a PG database (note that other databases, of course, wouldn't have any issue using such a prefix), using a table name starting with 'information' or 'sql_' isn't nearly as questionable, yet your query also filtered those out. > If you don't like the query, write a better one yourself, but > kindly do not try to degrade the efforts of others. No, pointing out that a query is wrong and doesn't acurately answer the question is something which should be done, just as someone should call out a bug in code that's been submitted for inclusion in PostgreSQL. I've tried to be kind and point out exactly what's wrong, along with encouraging words of how to adjust the query to be correct. Having misleading and incorrect information in our list archives isn't being helpful- those archives are preserved for more-or-less forever, end up being returned in google search results, and have caused confusion and even data loss in multiple cases that I'm personally aware of (thank you to pg_resetxlog...). We should all try to work to minimize that, and to correct it when necessary as best we can through responses. Thanks! Stephen
Attachment
This whole discussion left out the critical point of
--
how on earth did that file end up having zero bytes?
There is the possibility that this is due to a bug in PostgreSQL. But over the 20+ years I've been using and hacking it, that has never happened.
OP never told us what type of OS, hardware and specifically storage subsystem that "slave" was running on. Filesystems zapping files to zero after a catastrophic IO failure (no longer throwing them into lost+found) isn't unheard of.
Regards, Jan
On Fri, Dec 29, 2017 at 8:13 PM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings,
* Edson Carlos Ericksson Richter (richter@simkorp.com.br) wrote:
> There should be a "catalog" that point where tables are stored in physical
> files (I think, at least, because at some point PostgreSQL need to know from
> where to read the data).
Yes, it's pg_class. Specifically, the relfilenode.
> Based on information from this catalog, would I have a tool (perhaps, a C
> function) that check that data is really there?
You could write such a function, but it wouldn't be able to be general
purpose as a zero-byte file is, in fact, a valid file. You could just
as easily do a 'select 1 from table limit 1;' and make sure that you get
back a successful, single-row, result, if you wish to verify that certain
tables in your database always have at least 1 row.
Thanks!
Stephen
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
> On 30 Dec 2017, at 6:09, Jan Wieck <jan@wi3ck.info> wrote: > > This whole discussion left out the critical point of > > how on earth did that file end up having zero bytes? > > There is the possibility that this is due to a bug in PostgreSQL. But over the 20+ years I've been using and hacking it,that has never happened. > > OP never told us what type of OS, hardware and specifically storage subsystem that "slave" was running on. Filesystemszapping files to zero after a catastrophic IO failure (no longer throwing them into lost+found) isn't unheardof. If this is indeed a case of a corrupt database, then it's possible that we're looking at a system that claims to have thedisk cache flushed when it hasn't yet done that. It gets even more fun if there are multiple table-spaces in use with different behaviour in "whatever is causing this" (fsync,file system zappery, etc.); the WAL log can get out of synch with the table files in that case. The WAL would reflecta situation where more has been flushed to the respective files than is actually the case - data got lost. But of course it's also entirely possible that these empty files are not in fact a problem, as suggested up-thread. No reason to panic - yet. In fact, I don't see it mentioned explicitly anywhere, but are we actually looking at a problem? The standby server started up just fine, from what I'm reading. Is there any evidence of corruption? So far, the database has been labeled corrupt by Edson because a base file was 0 bytes, but that's apparently not a definitiveindication of corruption. What would be definitive is if querying certain relations causes errors or does give results that were never in that database.I do think that those 0 byte files are a good starting point, by querying the relations that those belong to (ifany). Otherwise we may be searching for a needle in a haystack that doesn't necessarily contain one. I do realise that not finding any evidence of corruption does not necessarily mean there is none. I'm not convinced by the argument that a CRC check of a 0 byte file on a standby would not detect corruption. At the least,the CRC would be different or we would be left with a CRC on the master that we can't match any CRC's on the slaveto if the file is larger on the master. If CRC's can be relied on to detect corruption (which they were designed to do), then that answers Edson's question. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll find there is no forest.
Greetings Alban, * Alban Hertroys (haramrae@gmail.com) wrote: > In fact, I don't see it mentioned explicitly anywhere, but are we actually looking at a problem? From the discussion in the google barman group, it's clear that the file shouldn't be zero bytes in this specific case. > I'm not convinced by the argument that a CRC check of a 0 byte file on a standby would not detect corruption. At the least,the CRC would be different or we would be left with a CRC on the master that we can't match any CRC's on the slaveto if the file is larger on the master. As discussed, files on the primary can be different at a byte level from those on replicas and still be perfectly valid and correct, for a variety of reasons from hint bit differences to differences due to the replica not being at exactly the same point as the primary. > If CRC's can be relied on to detect corruption (which they were designed to do), then that answers Edson's question. The checksums included in PG are page-level and therefore there simply isn't one to look at if the file is zero bytes. Thanks! Stephen
Attachment
On Sat, Dec 30, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
Greetings Alban,
* Alban Hertroys (haramrae@gmail.com) wrote:
> In fact, I don't see it mentioned explicitly anywhere, but are we actually looking at a problem?
From the discussion in the google barman group, it's clear that the file
shouldn't be zero bytes in this specific case.
> I'm not convinced by the argument that a CRC check of a 0 byte file on a standby would not detect corruption. At the least, the CRC would be different or we would be left with a CRC on the master that we can't match any CRC's on the slave to if the file is larger on the master.
As discussed, files on the primary can be different at a byte level from
those on replicas and still be perfectly valid and correct, for a
variety of reasons from hint bit differences to differences due to the
replica not being at exactly the same point as the primary.
> If CRC's can be relied on to detect corruption (which they were designed to do), then that answers Edson's question.
The checksums included in PG are page-level and therefore there simply
isn't one to look at if the file is zero bytes.
Thanks!
Stephen
Steve Atkins and Stephen Frost,
You both remind me of gawkers at a disaster, pointing out "mistakes" that
first responders are making, but doing nothing yourselves to help.
My query works as designed and has done so for two years. It shows the
filenames for schemas and tables in the database. If you don't like it,
TOO BAD! Shame on you! Use your brains and write your own query.
Nuff said, I shall reply no more.
--
Melvin Davidson
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
I reserve the right to fantasize. Whether or not you
wish to share my fantasy is entirely up to you.
Greetings, * Melvin Davidson (melvin6925@gmail.com) wrote: > My query works as designed and has done so > for two years. It shows the filenames for schemas and tables in the > database. I'm glad to hear that it works in your specific use-case. Unfortunately, it doesn't work in the general case and therefore isn't a good example. A proper query to return the filename for each user table in the current database is: SELECT quote_ident(nsp.nspname) || '.' || quote_ident(c.relname), s.setting || '/base/' || db.oid || '/' || c.relfilenode FROM pg_settings s JOIN pg_database db on (s.name = 'data_directory') JOIN pg_class c on (datname = current_database()) JOIN pg_namespace nsp on (c.relnamespace = nsp.oid) WHERE relfilenode <> 0 AND nsp.nspname !~ '^pg_' AND nsp.nspname <> 'information_schema'; Note that, as discussed earlier in this thread, this doesn't actually answer what Edson was asking for. Here's the query that would answer his original request: SELECT quote_ident(nsp.nspname) || '.' || quote_ident(c.relname), s.setting || '/base/' || db.oid || '/' || c.relfilenode, (pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size FROM pg_settings s JOIN pg_database db on (s.name = 'data_directory') JOIN pg_class c on (datname = current_database()) JOIN pg_namespace nsp on (c.relnamespace = nsp.oid) WHERE relfilenode <> 0 AND nsp.nspname !~ '^pg_' AND nsp.nspname <> 'information_schema'; Technically speaking, while these queries are correct for PG10, in prior versions of PostgreSQL it's possible to have user schemas that begin with 'pg_' and therefore the filtering in the WHERE clause would have to be more specific. Note that both of these need to be run as a superuser in older versions of PG. In PostgreSQL 10, a user could be GRANT'd 'pg_read_all_settings' and be able to run the first query. We don't currently support being able to GRANT a non-superuser the ability to run pg_stat_file(), but that will likely be coming in PG 11. Thanks! Stephen
Attachment
På lørdag 30. desember 2017 kl. 23:06:52, skrev Stephen Frost <sfrost@snowman.net>:
Greetings,
* Melvin Davidson (melvin6925@gmail.com) wrote:
> My query works as designed and has done so
> for two years. It shows the filenames for schemas and tables in the
> database.
I'm glad to hear that it works in your specific use-case.
Unfortunately, it doesn't work in the general case and therefore isn't a
good example. A proper query to return the filename for each user table
in the current database is:
SELECT
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
Note that, as discussed earlier in this thread, this doesn't actually
answer what Edson was asking for. Here's the query that would answer
his original request:
SELECT
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode,
(pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
Technically speaking, while these queries are correct for PG10, in prior
versions of PostgreSQL it's possible to have user schemas that begin
with 'pg_' and therefore the filtering in the WHERE clause would have to
be more specific.
Note that both of these need to be run as a superuser in older versions
of PG. In PostgreSQL 10, a user could be GRANT'd 'pg_read_all_settings'
and be able to run the first query. We don't currently support being
able to GRANT a non-superuser the ability to run pg_stat_file(), but
that will likely be coming in PG 11.
Thanks!
That doesn't seem to work with custom types:
andreak@[local]:5433 10.1 andreak=# SELECT
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode,
(pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
ERROR: could not stat file "/home/andreak/programs/postgresql-10/data/base/22039391/22039392": No such file or directory
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode,
(pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
ERROR: could not stat file "/home/andreak/programs/postgresql-10/data/base/22039391/22039392": No such file or directory
│ public.biginttuple2 │ /home/andreak/programs/postgresql-10/data/base/22039391/22039392 │
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
* Andreas Joseph Krogh (andreas@visena.com) wrote: > SELECT > quote_ident(nsp.nspname) || '.' || quote_ident(c.relname), > s.setting || '/base/' || db.oid || '/' || c.relfilenode, > (pg_stat_file(s.setting || '/base/' || db.oid || '/' || > c.relfilenode)).size as size > FROM > pg_settings s > JOIN pg_database db on (s.name = 'data_directory') > JOIN pg_class c on (datname = current_database()) > JOIN pg_namespace nsp on (c.relnamespace = nsp.oid) > WHERE > relfilenode <> 0 > AND nsp.nspname !~ '^pg_' > AND nsp.nspname <> 'information_schema'; > > Technically speaking, while these queries are correct for PG10, in prior > versions of PostgreSQL it's possible to have user schemas that begin > with 'pg_' and therefore the filtering in the WHERE clause would have to > be more specific. > > Note that both of these need to be run as a superuser in older versions > of PG. In PostgreSQL 10, a user could be GRANT'd 'pg_read_all_settings' > and be able to run the first query. We don't currently support being > able to GRANT a non-superuser the ability to run pg_stat_file(), but > that will likely be coming in PG 11. > > Thanks! > > That doesn't seem to work with custom types: Nothing in this query referred to types, so I'm not sure what custom types would have to do with it..? > andreak@[local]:5433 10.1 andreak=# SELECT > quote_ident(nsp.nspname) || '.' || quote_ident(c.relname), > s.setting || '/base/' || db.oid || '/' || c.relfilenode, > (pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size > as size > FROM > pg_settings s > JOIN pg_database db on (s.name = 'data_directory') > JOIN pg_class c on (datname = current_database()) > JOIN pg_namespace nsp on (c.relnamespace = nsp.oid) > WHERE > relfilenode <> 0 > AND nsp.nspname !~ '^pg_' > AND nsp.nspname <> 'information_schema'; > ERROR: could not stat file > "/home/andreak/programs/postgresql-10/data/base/22039391/22039392": No such > file or directory > > │ public.biginttuple2 │ > /home/andreak/programs/postgresql-10/data/base/22039391/22039392 │ Considering this is saaying 'no such file or directory', I'm guessing that somehow your data directory isn't what is listed in pg_settings..? Alternatively, perhaps that table was concurrently dropped? Are you able to provide any specifics about your system? Does the database directory exist? Does that path look reasonable? I find it kind of interesting that the OID of the database and the relfilenode are so close together- exactly what did you do to test this query? Thanks! Stephen
Attachment
Sv: Re: Sv: Re: Does PostgreSQL check database integrity at startup?
From
Andreas Joseph Krogh
Date:
På søndag 31. desember 2017 kl. 00:49:31, skrev Stephen Frost <sfrost@snowman.net>:
* Andreas Joseph Krogh (andreas@visena.com) wrote:
> SELECT
> quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
> s.setting || '/base/' || db.oid || '/' || c.relfilenode,
> (pg_stat_file(s.setting || '/base/' || db.oid || '/' ||
> c.relfilenode)).size as size
> FROM
> pg_settings s
> JOIN pg_database db on (s.name = 'data_directory')
> JOIN pg_class c on (datname = current_database())
> JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
> WHERE
> relfilenode <> 0
> AND nsp.nspname !~ '^pg_'
> AND nsp.nspname <> 'information_schema';
>
> Technically speaking, while these queries are correct for PG10, in prior
> versions of PostgreSQL it's possible to have user schemas that begin
> with 'pg_' and therefore the filtering in the WHERE clause would have to
> be more specific.
>
> Note that both of these need to be run as a superuser in older versions
> of PG. In PostgreSQL 10, a user could be GRANT'd 'pg_read_all_settings'
> and be able to run the first query. We don't currently support being
> able to GRANT a non-superuser the ability to run pg_stat_file(), but
> that will likely be coming in PG 11.
>
> Thanks!
>
> That doesn't seem to work with custom types:
Nothing in this query referred to types, so I'm not sure what custom
types would have to do with it..?
> andreak@[local]:5433 10.1 andreak=# SELECT
> quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
> s.setting || '/base/' || db.oid || '/' || c.relfilenode,
> (pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size
> as size
> FROM
> pg_settings s
> JOIN pg_database db on (s.name = 'data_directory')
> JOIN pg_class c on (datname = current_database())
> JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
> WHERE
> relfilenode <> 0
> AND nsp.nspname !~ '^pg_'
> AND nsp.nspname <> 'information_schema';
> ERROR: could not stat file
> "/home/andreak/programs/postgresql-10/data/base/22039391/22039392": No such
> file or directory
>
> │ public.biginttuple2 │
> /home/andreak/programs/postgresql-10/data/base/22039391/22039392 │
Considering this is saaying 'no such file or directory', I'm guessing
that somehow your data directory isn't what is listed in pg_settings..?
Alternatively, perhaps that table was concurrently dropped?
Are you able to provide any specifics about your system? Does the
database directory exist? Does that path look reasonable? I find it
kind of interesting that the OID of the database and the relfilenode are
so close together- exactly what did you do to test this query?
Here's a simple test-case:
createdb test
test=# create table foo(id serial primary key, name varchar not null);
test=# CREATE TYPE BigIntTuple2 AS (f1 bigint, f2 bigint);
test=# create table foo(id serial primary key, name varchar not null);
test=# CREATE TYPE BigIntTuple2 AS (f1 bigint, f2 bigint);
SELECT
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
┌─────────────────────┬──────────────────────────────────────────────────────────────────┐
│ ?column? │ ?column? │
├─────────────────────┼──────────────────────────────────────────────────────────────────┤
│ public.foo_id_seq │ /home/andreak/programs/postgresql-10/data/base/22058766/22058767 │
│ public.foo │ /home/andreak/programs/postgresql-10/data/base/22058766/22058769 │
│ public.foo_pkey │ /home/andreak/programs/postgresql-10/data/base/22058766/22058776 │
│ public.biginttuple2 │ /home/andreak/programs/postgresql-10/data/base/22058766/22058778 │
└─────────────────────┴──────────────────────────────────────────────────────────────────┘
│ ?column? │ ?column? │
├─────────────────────┼──────────────────────────────────────────────────────────────────┤
│ public.foo_id_seq │ /home/andreak/programs/postgresql-10/data/base/22058766/22058767 │
│ public.foo │ /home/andreak/programs/postgresql-10/data/base/22058766/22058769 │
│ public.foo_pkey │ /home/andreak/programs/postgresql-10/data/base/22058766/22058776 │
│ public.biginttuple2 │ /home/andreak/programs/postgresql-10/data/base/22058766/22058778 │
└─────────────────────┴──────────────────────────────────────────────────────────────────┘
SELECT
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode,
(pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
quote_ident(nsp.nspname) || '.' || quote_ident(c.relname),
s.setting || '/base/' || db.oid || '/' || c.relfilenode,
(pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size
FROM
pg_settings s
JOIN pg_database db on (s.name = 'data_directory')
JOIN pg_class c on (datname = current_database())
JOIN pg_namespace nsp on (c.relnamespace = nsp.oid)
WHERE
relfilenode <> 0
AND nsp.nspname !~ '^pg_'
AND nsp.nspname <> 'information_schema';
ERROR: could not stat file "/home/andreak/programs/postgresql-10/data/base/22058766/22058778": No such file or directory
$ file /home/andreak/programs/postgresql-10/data/base/22058766/22058776
/home/andreak/programs/postgresql-10/data/base/22058766/22058776: lif file
/home/andreak/programs/postgresql-10/data/base/22058766/22058776: lif file
--
Andreas Joseph Krogh
CTO / Partner - Visena AS
Mobile: +47 909 56 963
Attachment
Greetings Andreas, * Andreas Joseph Krogh (andreas@visena.com) wrote: > Here's a simple test-case: > > createdb test > test=# create table foo(id serial primary key, name varchar not null); > test=# CREATE TYPE BigIntTuple2 AS (f1 bigint, f2 bigint); Hah, appears we allocate a relfilenode to types too, though it's not entirely clear to me why (evidently, they're not used..). Anyhow, adding a filter on relkind addresses it (though it's not very future-proof, unfortunately): SELECT quote_ident(nsp.nspname) || '.' || quote_ident(c.relname), s.setting || '/base/' || db.oid || '/' || c.relfilenode, (pg_stat_file(s.setting || '/base/' || db.oid || '/' || c.relfilenode)).size as size FROM pg_settings s JOIN pg_database db on (s.name = 'data_directory') JOIN pg_class c on (datname = current_database()) JOIN pg_namespace nsp on (c.relnamespace = nsp.oid) WHERE relfilenode <> 0 AND nsp.nspname !~ '^pg_' AND nsp.nspname <> 'information_schema' AND c.relkind in ('r','i','m','p'); Thanks! Stephen
Attachment
On Sat, Dec 30, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote:
The checksums included in PG are page-level and therefore there simply
isn't one to look at if the file is zero bytes.
And even if the file wasn't zero bytes you can't tell from the per page CRCs if you have all the pages you should have. You could have extra pages that aren't supposed to be there or missing some (or any mix of the two). A per page CRC is useless for those cases.
Regards, Jan
Thanks!
Stephen
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
Jan, all, * Jan Wieck (jan@wi3ck.info) wrote: > On Sat, Dec 30, 2017 at 10:27 AM, Stephen Frost <sfrost@snowman.net> wrote: > > The checksums included in PG are page-level and therefore there simply > > isn't one to look at if the file is zero bytes. > > And even if the file wasn't zero bytes you can't tell from the per page > CRCs if you have all the pages you should have. You could have extra pages > that aren't supposed to be there or missing some (or any mix of the two). > A per page CRC is useless for those cases. Right, which is why it makes sense to have whole-file checksums when doing things like backups, when you're already reading all of the file and can grab a checksum in stream, to detect if anything bad happens after the backup has compeleted to the file that's been backed up. It's unclear if that would have helped here or not since we don't know when the file ended up being zero'd out, as I understand it. Thanks! Stephen