Thread: Issue with WAL logs temporary not replaying

Issue with WAL logs temporary not replaying

From
Eugene Pirogov
Date:
Hi,

We've experienced a DB incident related to WAL. The issue is now somewhat resolved, and I'm trying to understand what could have caused the problem to occur in the first place. Below is a full chronicle of an incident we've ran into.

At some point we realized almost no new data was coming in the DB during the day. We quickly looked and realized the following:
  • our application was still writing data, unaware that there any issues on the DB end, like if nothing happened at all,
  • the database was accepting connections (we were able to connect and query),
  • the database was writing regular text logs,
  • the database was not writing new data to the tables,
  • the logical DB replica actually had the expected records replicated and present in the replica tables,
  • so we made assumption original source WAL data existed and was correct (given the fact logical DB is relying on WAL).
We tried to restart the DB, but it would not come up. Instead, a few core dumps in the postgres data folder were generated during our attempts.

This is what we saw in the postgres logs:

LOG:  invalid resource manager ID 112 at 3/89005B10
LOG:  invalid primary checkpoint record
LOG:  invalid resource manager ID 138 at 3/89005208
LOG:  invalid secondary checkpoint record
PANIC:  could not locate a valid checkpoint record

To aid this we used pg_resetxlog. After running the pg_resetxlog command, the DB successfully started and slowly began recovering the data from WAL, automatically. During the recovery process, we ran VACUUM ANALYSE couple of time, which seems to have somehow aided the speed of recovery process (though I could have been a confirmation bias).

The DB was fully recovered in 30 minutes.

Upon closer inspection, however, we realized we've lost two records from a particular table. The records were replicated to logical replica though, and further inspection showed their insertion timestamp sometime during the day, long after the database supposedly stopped applying new data from WAL to tables.

We don't know the reason why these two records were gone (no DELETE statements ever happen in the table in question). The records in question existed in replica table, luckly.

The logs for the entire day when the incident happened are available here (1.6 Mb). Each log line was prepended with a timestamp. Also, we have the core dumps that were generated during attempts to restart the database.

My questions are:
  • what would be some typical reasons for primary/secondary WAL checkpoints to fail? Disk issues? File system issues? OS issues? I'm running out of ideas here.

    Is there any way I could separate primary and secondary checkpoints in space, so to increase postgres's chances to restore one of them?

  • how come postgresql would stop applying WAL logs to tables, and yet keep silence? There was no indication WAL is not being applied, postgres log would be silent (even though it kept writing other commodity messages, such as "incomplete startup packet" and malformed query attempts).

    Due to this, we noticed the issue only hours after postgres supposedly stopped applying the WAL logs to tables.

  • Is there any other log (like some lower level system log?) where I could see the status of writing / reading / applying WAL segments?

  • What measures could be done to prevent this in future, or at least make the situation easier to recover from (e.g. without data loss)?
Thank you in advance.

Re: Issue with WAL logs temporary not replaying

From
Tomas Vondra
Date:
Hi,

unfortunately your message is missing some pretty important information,
like for example which OS you're running, what PostgreSQL version are
you using (and if you built it on your own or installed from some
package repository).

Also, you mention replicas. What type of replication are you using? The
built-in streaming replication, or something else?

On 01/31/2018 06:56 PM, Eugene Pirogov wrote:
> Hi,
> 
> We've experienced a DB incident related to WAL. The issue is now
> somewhat resolved, and I'm trying to understand what could have caused
> the problem to occur in the first place. Below is a full chronicle of an
> incident we've ran into.
> 
> At some point we realized almost no new data was coming in the DB during
> the day. We quickly looked and realized the following:
> 
>   * our application was still writing data, unaware that there any
>     issues on the DB end, like if nothing happened at all,
>   * the database was accepting connections (we were able to connect and
>     query),
>   * the database was writing regular text logs,
>   * the database was not writing new data to the tables,
>   * the logical DB replica actually had the expected records replicated
>     and present in the replica tables,
>   * so we made assumption original source WAL data existed and was
>     correct (given the fact logical DB is relying on WAL).
> 
> We tried to restart the DB, but it would not come up. Instead, a few
> core dumps in the postgres data folder were generated during our attempts.
> 

Please inspect the core dumps using gdb and share the backtraces. There
are instructions how to do that on the wiki:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

> This is what we saw in the postgres logs:
> 
>     LOG:  invalid resource manager ID 112 at 3/89005B10
>     LOG:  invalid primary checkpoint record
>     LOG:  invalid resource manager ID 138 at 3/89005208
>     LOG:  invalid secondary checkpoint record
>     PANIC:  could not locate a valid checkpoint record
> 
> 
> To aid this we used pg_resetxlog. After running the pg_resetxlog
> command, the DB successfully started and slowly began recovering the
> data from WAL, automatically. During the recovery process, we ran VACUUM
> ANALYSE couple of time, which seems to have somehow aided the speed of
> recovery process (though I could have been a confirmation bias).
> 

Firstly, I very much doubt using pg_resetxlog is a good idea, especially
when it's rather unclear what was the actual root cause.

Secondly, I don't quite understand what you mean by "recovery" here. The
regular recovery process (say, after a crash) makes it impossible to
connect to the database, so I'm not sure what you mean by running vacuum
analyze (not to mention how it could have improved the recovery speed).

> The DB was fully recovered in 30 minutes.
> 
> Upon closer inspection, however, we realized we've lost two records from
> a particular table. The records were replicated to logical replica
> though, and further inspection showed their insertion timestamp sometime
> during the day, long after the database supposedly stopped applying new
> data from WAL to tables.
> 
> We don't know the reason why these two records were gone (no DELETE
> statements ever happen in the table in question). The records in
> question existed in replica table, luckly.
> 

Well, considering you did pg_resetxlog ... I'm surprised you only lost
two records, but it's a question what state are the indexes in etc.

> The logs for the entire day when the incident happened are available
> here
>
<https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6
> Mb). Each log line was prepended with a timestamp. Also, we have the
> core dumps that were generated during attempts to restart the database.
> 

Man, that's an extremely noisy log file. There are errors caused by
broken pg_hba.conf entries and so on, which makes it extremely difficult
to look for possible other errors ... I'm not going to waste time by
doing that.

> My questions are:
> 
>   * what would be some typical reasons for primary/secondary WAL
>     checkpoints to fail? Disk issues? File system issues? OS issues? I'm
>     running out of ideas here.
> 
>     Is there any way I could separate primary and secondary checkpoints
>     in space, so to increase postgres's chances to restore one of them?

I don't know. You have not provided any information about the storage
system, and we don't have access to system logs that might show if
something happened at this level. But in general yes, a storage issue
might cause this.

> 
>   * how come postgresql would stop applying WAL logs to tables, and yet
>     keep silence? There was no indication WAL is not being applied,
>     postgres log would be silent (even though it kept writing other
>     commodity messages, such as "incomplete startup packet" and
>     malformed query attempts).
> 
>     Due to this, we noticed the issue only hours after postgres
>     supposedly stopped applying the WAL logs to tables.
> 

It wouldn't. But it's unclear how you came to that conclusion.

>   * Is there any other log (like some lower level system log?) where I
>     could see the status of writing / reading / applying WAL segments?
> 

There's only one log for PostgreSQL, so it would have to be in that file
depending on the log level.

>   * What measures could be done to prevent this in future, or at least
>     make the situation easier to recover from (e.g. without data loss)?
> 

The jury is still out on the root cause, so it's too soon to say.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Issue with WAL logs temporary not replaying

From
Adrian Klaver
Date:
On 01/31/2018 09:56 AM, Eugene Pirogov wrote:
> Hi,
> 
> We've experienced a DB incident related to WAL. The issue is now 
> somewhat resolved, and I'm trying to understand what could have caused 
> the problem to occur in the first place. Below is a full chronicle of an 
> incident we've ran into.
> 
> At some point we realized almost no new data was coming in the DB during 
> the day. We quickly looked and realized the following:
> 
>   * our application was still writing data, unaware that there any
>     issues on the DB end, like if nothing happened at all,
>   * the database was accepting connections (we were able to connect and
>     query),
>   * the database was writing regular text logs,
>   * the database was not writing new data to the tables,
>   * the logical DB replica actually had the expected records replicated
>     and present in the replica tables,
>   * so we made assumption original source WAL data existed and was
>     correct (given the fact logical DB is relying on WAL).
> 
> We tried to restart the DB, but it would not come up. Instead, a few 
> core dumps in the postgres data folder were generated during our attempts.
> 
> This is what we saw in the postgres logs:
> 
>     LOG:  invalid resource manager ID 112 at 3/89005B10
>     LOG:  invalid primary checkpoint record
>     LOG:  invalid resource manager ID 138 at 3/89005208
>     LOG:  invalid secondary checkpoint record
>     PANIC:  could not locate a valid checkpoint record
> 
> 
> To aid this we used pg_resetxlog. After running the pg_resetxlog 
> command, the DB successfully started and slowly began recovering the 
> data from WAL, automatically. During the recovery process, we ran VACUUM 
> ANALYSE couple of time, which seems to have somehow aided the speed of 
> recovery process (though I could have been a confirmation bias).
> 
> The DB was fully recovered in 30 minutes.
> 
> Upon closer inspection, however, we realized we've lost two records from 
> a particular table. The records were replicated to logical replica 
> though, and further inspection showed their insertion timestamp sometime 
> during the day, long after the database supposedly stopped applying new 
> data from WAL to tables.
> 
> We don't know the reason why these two records were gone (no DELETE 
> statements ever happen in the table in question). The records in 
> question existed in replica table, luckly.
> 
> The logs for the entire day when the incident happened are available 
> here 
>
<https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6

> Mb). Each log line was prepended with a timestamp. Also, we have the 
> core dumps that were generated during attempts to restart the database.

What the role of the user 'db'?

Any idea what is causing this?:

WARNING:  concurrent insert in progress within table "audit_log"
...repeated... many times

WARNING:  concurrent insert in progress within table 
"declarations_status_hstr"
...repeated many times

Are either of these tables the one that had the two missing records?

> 
> My questions are:
> 
>   * what would be some typical reasons for primary/secondary WAL
>     checkpoints to fail? Disk issues? File system issues? OS issues? I'm
>     running out of ideas here.
> 
>     Is there any way I could separate primary and secondary checkpoints
>     in space, so to increase postgres's chances to restore one of them?
> 
>   * how come postgresql would stop applying WAL logs to tables, and yet
>     keep silence? There was no indication WAL is not being applied,
>     postgres log would be silent (even though it kept writing other
>     commodity messages, such as "incomplete startup packet" and
>     malformed query attempts).
> 
>     Due to this, we noticed the issue only hours after postgres
>     supposedly stopped applying the WAL logs to tables.
> 
>   * Is there any other log (like some lower level system log?) where I
>     could see the status of writing / reading / applying WAL segments?
> 
>   * What measures could be done to prevent this in future, or at least
>     make the situation easier to recover from (e.g. without data loss)?
> 
> Thank you in advance.
> 
> -- 
> Eugene
> 
> http://www.gmile.me


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Issue with WAL logs temporary not replaying

From
Eugene Pirogov
Date:
Tomas,

Indeed, I haven't put enough detail in the original post. That is a fair point, thank you. Let me follow up on each of your comments.

unfortunately your message is missing some pretty important information, like for example which OS you're running, what PostgreSQL version are you using (and if you built it on your own or installed from some package repository).

We are using Docker container based on Alpine 3.5. It has a custom-built PostgreSQL 9.6.3 installation. The customization is negligible, e.g. we did not alter configure and make process in any way. We only included pglogical extension. Here are the very sources for our image – https://github.com/edenlabllc/alpine-postgre/blob/master/pglogical/Dockerfile
. The Dockerfile is based on the "official" postgres image 9.6.3 from docker library, e.g. https://github.com/docker-library/postgres/tree/a554d043a3b77937120d325b30fefaad2e3be12d/9.6/alpine
.

Also, you mention replicas. What type of replication are you using? The built-in streaming replication, or something else?

We use logical replication. We are interested in replicating selected columns from tables across several databases in our kubernetes cluster. For this we are using pglogical REL2.0.1, built with patch that allows it to compile under Alpine. Here's the patch: https://github.com/alpinelinux/aports/commit/f4f856564dfb2589c1800bd51322f4eadfc67cf1

> Firstly, I very much doubt using pg_resetxlog is a good idea, especially when it's rather unclear what was the actual root cause.

Noted. In hindsight I think our best course of action would have been to make a copy of PG_DATA table.

Secondly, I don't quite understand what you mean by "recovery" here. The regular recovery process (say, after a crash) makes it impossible to connect to the database, so I'm not sure what you mean by running vacuum analyze (not to mention how it could have improved the recovery speed).

By recovery I mean a sort of self-healing of a running database. Let me make a little step back and explain this.

After several restart attempts, the DB would still not get up. Instead it was complaining about inability to "locate a valid checkpoint record". After we ran pg_resetxlog and tried to start the database again, the "locate a valid checkpoint record" error was gone, and we ended up with a 100% working database.

From this point onwards we began issuing SELECT statements, once in several minutes, to see if the database sees the missing records. The records indeed began to slowly appear in our tables. I _think_ this was happening because the database began reading them from WAL, in the background. Again – at this point the database was (or seemed to be) fully operational. It was just catching up on data (this is my impression).

It seems as if pg_resetxlog removed some WAL segment (hence the mentioned "PANIC:  could not locate a valid checkpoint record" log), which initially caused the database to stuck and not read further WAL segments.

Well, considering you did pg_resetxlog ... I'm surprised you only lost two records, but it's a question what state are the indexes in etc.

I never considered indexes to play part in the issue. Noted on that, thank you.

Man, that's an extremely noisy log file. There are errors caused by broken pg_hba.conf entries and so on, which makes it extremely difficult to look for possible other errors ... I'm not going to waste time by doing that.

Ugh, sorry about that. I wanted to provide a vanilla version of the log, e.g. "as is". Below is an updated variant. I removed duplicate lines. The log file went down from 14794 to 329 lines:

The most interesting stuff, in my opinion, is around PANIC lines. Also, there are a lot of the following lines:

ERROR:  requested WAL segment 000000010000000300000089 has already been removed

I'm yet to find out why these happened and what they mean.

For the record, these are the lines that I removed from the log:

sed -i '' '/LOG:  incomplete startup packet/d' postgres.log
sed -i '' '/FATAL:  password authentication failed for user "db"/d postgres.log
sed -i '' '/DETAIL:  Password does not match for user "db"./d postgres.log
sed -i '' '/Connection matched pg_hba.conf line 95: "host all all all md5"/d' postgres.log
sed -i '' '/DETAIL:  Password does not match for user "pghoard"/d' postgres.log
sed -i '' '/FATAL:  password authentication failed for user "pghoard"/d' postgres.log
sed -i '' '/WARNING:  concurrent insert in progress within table "declarations_status_hstr"/d' postgres.log
sed -i '' '/WARNING:  concurrent insert in progress within table "audit_log"/d' postgres.log

I don't know. You have not provided any information about the storage system, and we don't have access to system logs that might show if something happened at this level. But in general yes, a storage issue might cause this.

We use volumes provided by Kubernetes which, in turn, are disks from Google Compute Engine. I don't know the exact characteristics of SSDs. Would it make sense to try and get to the bottom of the SSDs stats?

The file system used by PG_DATA folder is ext4.

It wouldn't. But it's unclear how you came to that conclusion.

My logic is the following:
  1. db is up and responding to write requests,
  2. the data is not actually seen in the tables,
  3. the data is actually seen in a logical replica,
  4. we restart db, it refuses to start and raises an error,
  5. we mitigate the error,
  6. db restarts successfully,
  7. data slowly restores "by itself".
Given replica had good data and since by design replica takes its data from original DB's WAL... we concluded that all inserted data made it to the WAL but not further, e.g. not in the actual tables. Again: since the records were missing from tables in the original database, somehow PostgreSQL was not applying data from WAL to tables.

It all looked as if some code in postgresql that does "now, move data from WAL to table" action would get stuck or something. While at the same time worker processes that deal with DB connections and queries would continue to operate normally. Ultimately pg_resetxlog would unstuck the "now, move data from WAL to table" code.

I'm totally may be making things up here. This is only my impression based on observation. I'd very much like to be corrected.

There's only one log for PostgreSQL, so it would have to be in that file depending on the log level.

Got it.

Please inspect the core dumps using gdb and share the backtraces. There are instructions how to do that on the wiki

I'm going to try and extract backtraces from the core dumps. I don't usually do this on a daily basis, so it will take some time. I will come back with results when I have something.

On Wed, Jan 31, 2018 at 8:18 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
Hi,

unfortunately your message is missing some pretty important information,
like for example which OS you're running, what PostgreSQL version are
you using (and if you built it on your own or installed from some
package repository).

Also, you mention replicas. What type of replication are you using? The
built-in streaming replication, or something else?

On 01/31/2018 06:56 PM, Eugene Pirogov wrote:
> Hi,
>
> We've experienced a DB incident related to WAL. The issue is now
> somewhat resolved, and I'm trying to understand what could have caused
> the problem to occur in the first place. Below is a full chronicle of an
> incident we've ran into.
>
> At some point we realized almost no new data was coming in the DB during
> the day. We quickly looked and realized the following:
>
>   * our application was still writing data, unaware that there any
>     issues on the DB end, like if nothing happened at all,
>   * the database was accepting connections (we were able to connect and
>     query),
>   * the database was writing regular text logs,
>   * the database was not writing new data to the tables,
>   * the logical DB replica actually had the expected records replicated
>     and present in the replica tables,
>   * so we made assumption original source WAL data existed and was
>     correct (given the fact logical DB is relying on WAL).
>
> We tried to restart the DB, but it would not come up. Instead, a few
> core dumps in the postgres data folder were generated during our attempts.
>

Please inspect the core dumps using gdb and share the backtraces. There
are instructions how to do that on the wiki:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

> This is what we saw in the postgres logs:
>
>     LOG:  invalid resource manager ID 112 at 3/89005B10
>     LOG:  invalid primary checkpoint record
>     LOG:  invalid resource manager ID 138 at 3/89005208
>     LOG:  invalid secondary checkpoint record
>     PANIC:  could not locate a valid checkpoint record
>
>
> To aid this we used pg_resetxlog. After running the pg_resetxlog
> command, the DB successfully started and slowly began recovering the
> data from WAL, automatically. During the recovery process, we ran VACUUM
> ANALYSE couple of time, which seems to have somehow aided the speed of
> recovery process (though I could have been a confirmation bias).
>

Firstly, I very much doubt using pg_resetxlog is a good idea, especially
when it's rather unclear what was the actual root cause.

Secondly, I don't quite understand what you mean by "recovery" here. The
regular recovery process (say, after a crash) makes it impossible to
connect to the database, so I'm not sure what you mean by running vacuum
analyze (not to mention how it could have improved the recovery speed).

> The DB was fully recovered in 30 minutes.
>
> Upon closer inspection, however, we realized we've lost two records from
> a particular table. The records were replicated to logical replica
> though, and further inspection showed their insertion timestamp sometime
> during the day, long after the database supposedly stopped applying new
> data from WAL to tables.
>
> We don't know the reason why these two records were gone (no DELETE
> statements ever happen in the table in question). The records in
> question existed in replica table, luckly.
>

Well, considering you did pg_resetxlog ... I'm surprised you only lost
two records, but it's a question what state are the indexes in etc.

> The logs for the entire day when the incident happened are available
> here
> <https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/5ab86c10332f9aef01abf16cb4989334bc7b02a2/postgresql.log> (1.6
> Mb). Each log line was prepended with a timestamp. Also, we have the
> core dumps that were generated during attempts to restart the database.
>

Man, that's an extremely noisy log file. There are errors caused by
broken pg_hba.conf entries and so on, which makes it extremely difficult
to look for possible other errors ... I'm not going to waste time by
doing that.

> My questions are:
>
>   * what would be some typical reasons for primary/secondary WAL
>     checkpoints to fail? Disk issues? File system issues? OS issues? I'm
>     running out of ideas here.
>
>     Is there any way I could separate primary and secondary checkpoints
>     in space, so to increase postgres's chances to restore one of them?

I don't know. You have not provided any information about the storage
system, and we don't have access to system logs that might show if
something happened at this level. But in general yes, a storage issue
might cause this.

>
>   * how come postgresql would stop applying WAL logs to tables, and yet
>     keep silence? There was no indication WAL is not being applied,
>     postgres log would be silent (even though it kept writing other
>     commodity messages, such as "incomplete startup packet" and
>     malformed query attempts).
>
>     Due to this, we noticed the issue only hours after postgres
>     supposedly stopped applying the WAL logs to tables.
>

It wouldn't. But it's unclear how you came to that conclusion.

>   * Is there any other log (like some lower level system log?) where I
>     could see the status of writing / reading / applying WAL segments?
>

There's only one log for PostgreSQL, so it would have to be in that file
depending on the log level.

>   * What measures could be done to prevent this in future, or at least
>     make the situation easier to recover from (e.g. without data loss)?
>

The jury is still out on the root cause, so it's too soon to say.


regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



--

Re: Issue with WAL logs temporary not replaying

From
Tomas Vondra
Date:

On 02/01/2018 08:17 AM, Eugene Pirogov wrote:
> Tomas,
> 
> Indeed, I haven't put enough detail in the original post. That is a fair
> point, thank you. Let me follow up on each of your comments.
> 
>     unfortunately your message is missing some pretty important
>     information, like for example which OS you're running, what
>     PostgreSQL version are you using (and if you built it on your own or
>     installed from some package repository).
> 
> 
> We are using Docker container based on Alpine 3.5. It has a custom-built
> PostgreSQL 9.6.3 installation. The customization is negligible, e.g. we
> did not alter configure and make process in any way. We only included
> pglogical extension. Here are the very sources for our image
> – https://github.com/edenlabllc/alpine-postgre/blob/master/pglogical/Dockerfile
> <https://github.com/edenlabllc/alpine-postgre/blob/master/pglogical/Dockerfile>
> . The Dockerfile is based on the "official" postgres image 9.6.3 from
> docker library,
> e.g. https://github.com/docker-library/postgres/tree/a554d043a3b77937120d325b30fefaad2e3be12d/9.6/alpine
> <https://github.com/docker-library/postgres/tree/a554d043a3b77937120d325b30fefaad2e3be12d/9.6/alpine>
> .
> 
>     Also, you mention replicas. What type of replication are you using?
>     The built-in streaming replication, or something else?
> 
> 
> We use logical replication. We are interested in replicating selected
> columns from tables across several databases in our kubernetes cluster.
> For this we are using pglogical REL2.0.1, built with patch that allows
> it to compile under Alpine. Here's the
> patch: https://github.com/alpinelinux/aports/commit/f4f856564dfb2589c1800bd51322f4eadfc67cf1
> <https://github.com/alpinelinux/aports/commit/f4f856564dfb2589c1800bd51322f4eadfc67cf1>
> 
>     > Firstly, I very much doubt using pg_resetxlog is a good idea, especially when it's rather unclear what was the
actualroot cause.
 
> 
> 
> Noted. In hindsight I think our best course of action would have been to
> make a copy of PG_DATA table.

Yeah, that would be helpful, but it's too late to do that.

> 
>     Secondly, I don't quite understand what you mean by "recovery" here.
>     The regular recovery process (say, after a crash) makes it
>     impossible to connect to the database, so I'm not sure what you mean
>     by running vacuum analyze (not to mention how it could have improved
>     the recovery speed).
> 
> 
> By recovery I mean a sort of self-healing of a running database. Let me
> make a little step back and explain this.
> 
> After several restart attempts, the DB would still not get up. Instead
> it was complaining about inability to "locate a valid checkpoint
> record". After we ran pg_resetxlog and tried to start the database
> again, the "locate a valid checkpoint record" error was gone, and we
> ended up with a 100% working database.
> 

OK. So what pg_resetxlog command have you used exactly? There are
various options.

> From this point onwards we began issuing SELECT statements, once in
> several minutes, to see if the database sees the missing records. The
> records indeed began to slowly appear in our tables. I _think_ this was
> happening because the database began reading them from WAL, in the
> background. Again – at this point the database was (or seemed to be)
> fully operational. It was just catching up on data (this is my impression).
> 

That is not how WAL works. It's only read when the database is starting
after an unclean shutdown/crash. But it's not possible to connect to a
database in this recovery mode. And after the database gets into a
consistent state, it does not read WAL anymore.

Moreover, pg_resetxlog actually removes all the WAL segments, so there's
nothing to read ...

So the records were appearing for some other reasons. Either there were
being inserted by some other process, or becoming visible for some other
reason (e.g. because there were XIDs initially from the "future" and the
database eventually generated them again).

> It seems as if pg_resetxlog removed some WAL segment (hence the
> mentioned "PANIC:  could not locate a valid checkpoint record" log),
> which initially caused the database to stuck and not read further WAL
> segments.
> 
>     Well, considering you did pg_resetxlog ... I'm surprised you only
>     lost two records, but it's a question what state are the indexes in etc.
> 
> 
> I never considered indexes to play part in the issue. Noted on that,
> thank you.
> 

Well, the pg_resetxlog docs actually say this:

    After running this command, it should be possible to start the
    server, but bear in mind that the database might contain
    inconsistent data due to partially-committed transactions. You
    should immediately dump your data, run initdb, and reload. After
    reload, check for inconsistencies and repair as needed.

and I strongly recommend doing exactly that - dump/restore. The sooner
the better.

>     Man, that's an extremely noisy log file. There are errors caused by
>     broken pg_hba.conf entries and so on, which makes it extremely
>     difficult to look for possible other errors ... I'm not going to
>     waste time by doing that.
> 
> 
> Ugh, sorry about that. I wanted to provide a vanilla version of the log,
> e.g. "as is". Below is an updated variant. I removed duplicate lines.
> The log file went down from 14794 to 329 lines:
>
https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/7779c8341482d8df39257ab033e809b187227389/postgresql-short.log
>
<https://gist.githubusercontent.com/gmile/6df9ce393b11fb369856b84c4e9f755a/raw/7779c8341482d8df39257ab033e809b187227389/postgresql-short.log>
> 
> The most interesting stuff, in my opinion, is around PANIC lines. Also,
> there are a lot of the following lines:
> 
>     ERROR:  requested WAL segment 000000010000000300000089 has already
>     been removed
>
> 
> I'm yet to find out why these happened and what they mean.
> 

Well, I suppose those started after the pg_resetxlog, which removes all
the WAL segments. So if you have a replica (say, logical replica which
connects through a replication slot) that needs old WAL, you'll get
exactly this error. I.e. your pglogical replication was likely broken.

I find this much more interesting:

2018-01-30T14:49:31Z FATAL:  terminating connection due to unexpected
postmaster exit

There's a log message about smart shutdown request, which should wait
for all clients to regularly disconnect. So likely something crashed.

Can you provide some timeline of the events? I mean, at which time you
discovered the issues, at which time you did the pg_resetxlog etc?

> For the record, these are the lines that I removed from the log:
> 
>     sed -i '' '/LOG:  incomplete startup packet/d' postgres.log
>     sed -i '' '/FATAL:  password authentication failed for user "db"/d
>     postgres.log
>     sed -i '' '/DETAIL:  Password does not match for user "db"./d
>     postgres.log
>     sed -i '' '/Connection matched pg_hba.conf line 95: "host all all
>     all md5"/d' postgres.log
>     sed -i '' '/DETAIL:  Password does not match for user "pghoard"/d'
>     postgres.log
>     sed -i '' '/FATAL:  password authentication failed for user
>     "pghoard"/d' postgres.log
>     sed -i '' '/WARNING:  concurrent insert in progress within table
>     "declarations_status_hstr"/d' postgres.log
>     sed -i '' '/WARNING:  concurrent insert in progress within table
>     "audit_log"/d' postgres.log
> 
> 
>     I don't know. You have not provided any information about the
>     storage system, and we don't have access to system logs that might
>     show if something happened at this level. But in general yes, a
>     storage issue might cause this.
> 
> 
> We use volumes provided by Kubernetes which, in turn, are disks from
> Google Compute Engine. I don't know the exact characteristics of SSDs.
> Would it make sense to try and get to the bottom of the SSDs stats?
> 
> The file system used by PG_DATA folder is ext4.
> 

OK. I have little experience with GCE, but my guess there was some sort
of I/O issue and the control file (which tracks checkpoint records etc.)
got corrupted. And then it went downhill. Unfortunately, we'll likely
never know without a backup of the original DATA directory.

>     It wouldn't. But it's unclear how you came to that conclusion.
> 
> 
> My logic is the following:
> 
>  1. db is up and responding to write requests,
>  2. the data is not actually seen in the tables,
>  3. the data is actually seen in a logical replica,
>  4. we restart db, it refuses to start and raises an error,
>  5. we mitigate the error,
>  6. db restarts successfully,
>  7. data slowly restores "by itself".
> 
> Given replica had good data and since by design replica takes its data
> from original DB's WAL... we concluded that all inserted data made it to
> the WAL but not further, e.g. not in the actual tables. Again: since the
> records were missing from tables in the original database, somehow
> PostgreSQL was not applying data from WAL to tables.
> > It all looked as if some code in postgresql that does "now, move data
> from WAL to table" action would get stuck or something. While at the
> same time worker processes that deal with DB connections and queries
> would continue to operate normally. Ultimately pg_resetxlog would
> unstuck the "now, move data from WAL to table" code.
> > I'm totally may be making things up here. This is only my impression
> based on observation. I'd very much like to be corrected.
> 


Again, that's not how WAL is used - your mental model of the inner
workings is incorrect. We do generate the WAL and then modify the table
data. We don't "apply" WAL unless a recovery is necessary, which only
happens after a crash etc.

So if you haven't seen the data in the tables, it was for some other
unknown reason.

>     There's only one log for PostgreSQL, so it would have to be in that
>     file depending on the log level.
> 
> 
> Got it.
> 
>     Please inspect the core dumps using gdb and share the backtraces.
>     There are instructions how to do that on the wiki
> 
> 
> I'm going to try and extract backtraces from the core dumps. I don't
> usually do this on a daily basis, so it will take some time. I will come
> back with results when I have something.
> 

OK. That would be useful.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services