Thanks Tomas for the explanation.
This all still points to a bug then, with WAL being affected, and
possibly the processing of 'Planet' data in the secondary database that
did not show an issue, being responsible for the enormous amount of WAL
being written due to checkpoint failures. Although the latter is still
speculation, it may be caused by another as of yet to determine cause.
To be absolutely clear about this: I have never seen so much WAL being
generated with Planet processing in ordinary situations when this issue
doesn't rear its head. In fact, I am currently processing Planet again,
and the total size of files on "file system root" has not been over
about 250GB, with still 750GB free space. This is what I see under
ordinary situations.
So I am also pretty much convinced there is no misconfiguration of the
WAL settings in my 'posgresql.conf' file.
Any suggestions for a further course of action, or do you people have at
least enough info for now to give it a first try to find out what might
be wrong?
Marco
Op 25-7-2022 om 15:39 schreef Tomas Vondra:
>
> On 7/25/22 08:04, Marco Boeringa wrote:
>> To extend on this, two interesting questions that come to mind are:
>>
>> - Does running SELECT COUNT(*) create WAL?
>>
> Yes. An obvious example is updating the visibility map (which is always
> logged to WAL) or hint bits (which may be WAL logged). I'd also bet we
> may generate WAL for indexes, e.g. to kill deleted tuples.
>
>> - Is it potentially conceivable that there is a kind of cross-database
>> vulnerability *within one and the same PostgreSQL cluster*, where an
>> issue in one database causes the WAL in another database to no longer
>> successfully be written to disk during checkpoints? I have never seen
>> processing errors where PostgreSQL emitted true PostgreSQL errors with
>> error numbers cause issues like that and affect a second database in the
>> same cluster, but since no error is generated here, and there might be
>> some uncatched error, I wonder?
>>
>> I am especially asking the second question since, although I wrote there
>> is no edit activity going on potentially generating WAL in the affected
>> small database, which is true, there *was* processing on Planet sized
>> data going on in a second database in the same cluster. That certainly
>> *is* capable of generating 890GB of WAL if nothing is cleaned up during
>> checkpoints due to checkpoints failing.
>>
> WAL is a resource shared by all the databases in the cluster, so if that
> gets broken it's broken for everyone.
>
>
> regards
>