Thread: standby replication server throws invalid memory alloc request size ,does not start up
standby replication server throws invalid memory alloc request size ,does not start up
Hi All,
This is my first postgres query to admin list, so if I am not following the right standards for asking the question, pls let me know 😊
The problem:
I have a postgres cluster as
A (primary)-> streaming replication -> B(hot_standby=on)
We had a power outage in one of the data centers, and when we got back, one of the databases servers (B the standby node) seem to show weird errors and is not starting up. A recovered fine, and it running fine.
--------- logs
2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2018-06-25 10:57:04 UTC WARNING: terminating connection because of crash of another server process
2018-06-25 10:57:04 UTC DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command.
2018-06-25 10:57:04 UTC LOG: database system is shut down
2018-06-27 16:59:28 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432
2018-06-27 16:59:28 UTC LOG: listening on IPv6 address "::", port 5432
2018-06-27 16:59:28 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2018-06-27 16:59:28 UTC LOG: database system was interrupted while in recovery at log time 2018-06-25 10:52:21 UTC
2018-06-27 16:59:28 UTC HINT: If this has occurred more than once some data might be corrupted and you might need to choose an earlier recovery target.
2018-06-27 16:59:28 UTC LOG: entering standby mode
2018-06-27 16:59:28 UTC LOG: recovered replication state of node 1 to 63/89B52A98
2018-06-27 16:59:28 UTC LOG: redo starts at 9A/6F1B3888
2018-06-27 16:59:28 UTC LOG: consistent recovery state reached at 9A/6F25BFF0
2018-06-27 16:59:28 UTC FATAL: invalid memory alloc request size 1768185856
2018-06-27 16:59:28 UTC LOG: database system is ready to accept read only connections
2018-06-27 16:59:28 UTC LOG: startup process (PID 11829) exited with exit code 1
2018-06-27 16:59:28 UTC LOG: terminating any other active server processes
2018-06-27 16:59:28 UTC LOG: database system is shut down
pg_ctl: could not start server
Examine the log output.
-------what I tried
I did a lot of searching around “invalid memory alloc request” it mostly points to some data error,
https://www.postgresql.org/message-id/AANLkTik88wGQ6h-xFHJBU_Fj_CnXK5LX7aBws_2oANKP%40mail.gmail.com
mem checks and disks are ok.
I took a filesystem snapshot, synced it onto another server and tried to reload it, it gave the same error.
Now to ensure there was not real data causing this issue, I took a full dump of the primary db (A) using pg_dumpall and restored in onto the test server. I was able to load the db and was able to query the tables without issues.
So I am confused why the standby is behaving weird after the power outage.
Just FYI, I spun up a new node ( C ) as a replacement for the problem standby(B) with pg_baseback and it is working fine. Can query all tables without problems. My only concern was do I need to worry about this error showing up again?
Most the googling points to some operation and finding out errors in some operation performed etc,
https://blog.dob.sk/2012/05/19/fixing-pg_dump-invalid-memory-alloc-request-size/ - 1
https://confluence.atlassian.com/jirakb/invalid-memory-alloc-request-size-440107132.html - 2
I ran that function (in 1), could not find bad data rows.
Also, I tried to bump up (memory to 32GB, cpu to 8 cpu and shmmax to 18446744073692774399 to figure out is there was any resource constraint that I could temp bump up and then debug what went wrong, but it did not start up.
$ /usr/lib/postgresql/10/bin/postgres -V
postgres (PostgreSQL) 10.4 (Ubuntu 10.4-2.pgdg16.04+1)
$ grep -v "#" /etc/postgresql/10/main/postgresql.conf | sed -e '/^$/d'
ssl = on
ssl_cert_file = '/etc/ssl/certs/ssl-cert-snakeoil.pem'
ssl_key_file = '/etc/ssl/private/ssl-cert-snakeoil.key'
log_timezone = UTC
stats_temp_directory = '/var/run/postgresql/10-main.pg_stat_tmp'
datestyle = 'iso, mdy'
timezone = UTC
default_text_search_config = 'pg_catalog.english'
log_directory = pg_log
archive_mode = on
maintenance_work_mem = 16MB
max_wal_size = 1GB
max_stack_depth = 2MB
autovacuum_analyze_scale_factor = 0.05
autovacuum_vacuum_scale_factor = 0.1
shared_preload_libraries = pg_stat_statements
work_mem = 4MB
max_wal_senders = 10
temp_buffers = 16MB
archive_command = '/bin/true'
effective_cache_size = 256MB
max_replication_slots = 10
hot_standby = on
wal_keep_segments = 64
wal_level = hot_standby
listen_addresses = '*'
log_min_duration_statement = 10000
max_worker_processes = 18
sysctl:
kernel.shmall = 18446744073692774399
kernel.shmmax = 18446744073692774399
$ free -m
total used free shared buff/cache available
Mem: 16047 134 14469 8 1442 15620
Swap: 975 0 975
$ nproc
4
Thanks,
Vijay
Re: standby replication server throws invalid memory alloc requestsize , does not start up
You need to worry, very much, that your setup did not preserve data write ordering when the power went out. - you might have fsync turned off in PG - you might be using a non-journaled filesystem - your disks might be cheating on sync commands (consumer grade disks are notorious for this) - you might be using consumer-grade flash that can't flush its cache on power loss This highly unlikely to be a PG bug, and you do need to worry that it could have been the master if you had been less lucky. -- Scott Ribe scott_ribe@elevated-dev.com https://www.linkedin.com/in/scottribe/ > On Jun 28, 2018, at 7:17 AM, Vijaykumar Jain <vjain@opentable.com> wrote: > > My only concern was do I need to worry about this error showing up again?
Re: standby replication server throws invalid memory alloc requestsize , does not start up
> On Jun 28, 2018, at 7:26 AM, Scott Ribe <scott_ribe@elevated-dev.com> wrote: > > you do need to worry that it could have been the master Correcting myself: you do need to worry that all 3 could have been corrupted if you had been unlucky, and that you couldhave experienced corruption that was silent and you wouldn't know it--in fact you still need to worry about that
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
Thanks Scott. fsync is on. postgres=# show fsync; fsync ------- on (1 row) The is a storage in qa which is less powerful then prod. (it’s on a Ubuntu 16.04 VM, vsan / ssd and database is on lvm partition) It is an ext4 filesystem. /path/vgx-root ext4 108G 6.3G 97G 7% / Wrt: - your disks might be cheating on sync commands (consumer grade disks are notorious for this) - you might be using consumer-grade flash that can't flush its cache on power loss Ok I am not much aware of quality of hardware, i'll check with the team, we have not seen this issue in atleast 2 years ofsince I have been here. We do have multiple points of backup (s3 daily, barman and 2 standby for each primary). ----though not relevant to the bug but ... And this particular cluster is one of a kind with a multi master mesh (if you could call it) setup. Each region has one table that logically replicated to all the other regions db servers. And also we spin up only one db per cluster for the same reason as it is easy to isolate the problem and use cheap hardware(if that really is __) But I'll check with the team and what the expectation should be from the storage that we get. Thanks Scott. I just wanted to confirm if this is nothing to do with pg bug. Vijay On 6/28/18, 6:56 PM, "Scott Ribe" <scott_ribe@elevated-dev.com> wrote: You need to worry, very much, that your setup did not preserve data write ordering when the power went out. - you might have fsync turned off in PG - you might be using a non-journaled filesystem - your disks might be cheating on sync commands (consumer grade disks are notorious for this) - you might be using consumer-grade flash that can't flush its cache on power loss This highly unlikely to be a PG bug, and you do need to worry that it could have been the master if you had been lesslucky. -- Scott Ribe scott_ribe@elevated-dev.com https://www.linkedin.com/in/scottribe/ > On Jun 28, 2018, at 7:17 AM, Vijaykumar Jain <vjain@opentable.com> wrote: > > My only concern was do I need to worry about this error showing up again?
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
> On Jun 28, 2018, at 9:47 AM, Vijaykumar Jain <vjain@opentable.com> wrote: > > Thanks Scott. > > fsync is on. > > postgres=# show fsync; > fsync > ------- > on > (1 row) > Yeah, I agree with Scott; it sounds like fsync() is being shorted somewhere in the I/O path. Are you using a RAID controller;if so is the cache setting — write through or write back? If it’s write back then writes are being cached anda power outage will lead to data corruption — the only saving grace at that point would be the battery backup logic onthe controller which may or nay not be trustworthy.
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
Thanks guys. We are better with backups as we have split the dbs across different availability groups with diff power supply etc. So it is expected that servers will crash and we need to failover the db onto the other regions avg if required. but servers crash all the time and we have been asked to make arrangements thinking of that. So as long as this is a h/w failure i guess there is only one db server getting impacted and we should be good which is good If this was a pg bug all the nodes would get impacted and we would be in deep waters :) So thanks. Regards, Vijay ________________________________________ From: Rui DeSousa <rui@crazybean.net> Sent: Thursday, June 28, 2018 8:13:16 PM To: Vijaykumar Jain Cc: Scott Ribe; pgsql-admin@lists.postgresql.org Subject: Re: [External] Re: standby replication server throws invalid memory alloc request size , does not start up > On Jun 28, 2018, at 9:47 AM, Vijaykumar Jain <vjain@opentable.com> wrote: > > Thanks Scott. > > fsync is on. > > postgres=# show fsync; > fsync > ------- > on > (1 row) > Yeah, I agree with Scott; it sounds like fsync() is being shorted somewhere in the I/O path. Are you using a RAID controller;if so is the cache setting — write through or write back? If it’s write back then writes are being cached anda power outage will lead to data corruption — the only saving grace at that point would be the battery backup logic onthe controller which may or nay not be trustworthy.
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
On 2018-06-28 10:43:16 -0400, Rui DeSousa wrote: > > > > On Jun 28, 2018, at 9:47 AM, Vijaykumar Jain <vjain@opentable.com> wrote: > > > > Thanks Scott. > > > > fsync is on. > > > > postgres=# show fsync; > > fsync > > ------- > > on > > (1 row) > > > > Yeah, I agree with Scott; it sounds like fsync() is being shorted > somewhere in the I/O path. Are you using a RAID controller; if so is > the cache setting — write through or write back? If it’s write back > then writes are being cached and a power outage will lead to data > corruption — the only saving grace at that point would be the battery > backup logic on the controller which may or nay not be trustworthy. How are you making the inference that that's the problem? It could very well be a postgres bug or somebody having used pg_resetxlog or.. ? Greetings, Andres Freund
Re: standby replication server throws invalid memory alloc requestsize , does not start up
Hi, On 2018-06-28 13:17:52 +0000, Vijaykumar Jain wrote: > Hi All, > > This is my first postgres query to admin list, so if I am not following the right standards for asking the question, plslet me know 😊 > > > The problem: > > I have a postgres cluster as > > A (primary)-> streaming replication -> B(hot_standby=on) > > We had a power outage in one of the data centers, and when we got back, one of the databases servers (B the standby node)seem to show weird errors and is not starting up. A recovered fine, and it running fine. > > --------- logs > > 2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. > 2018-06-25 10:57:04 UTC WARNING: terminating connection because of crash of another server process > 2018-06-25 10:57:04 UTC DETAIL: The postmaster has commanded this server process to roll back the current transactionand exit, because another server process exited abnormally and possibly corrupted shared memory. > 2018-06-25 10:57:04 UTC HINT: In a moment you should be able to reconnect to the database and repeat your command. This sounds like either was a crash or you shutdown the server with force. Could you post the preceding lines? > 2018-06-25 10:57:04 UTC LOG: database system is shut down > 2018-06-27 16:59:28 UTC LOG: listening on IPv4 address "0.0.0.0", port 5432 > 2018-06-27 16:59:28 UTC LOG: listening on IPv6 address "::", port 5432 > 2018-06-27 16:59:28 UTC LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432" > 2018-06-27 16:59:28 UTC LOG: database system was interrupted while in recovery at log time 2018-06-25 10:52:21 UTC > 2018-06-27 16:59:28 UTC HINT: If this has occurred more than once some data might be corrupted and you might need to choosean earlier recovery target. > 2018-06-27 16:59:28 UTC LOG: entering standby mode > 2018-06-27 16:59:28 UTC LOG: recovered replication state of node 1 to 63/89B52A98 > 2018-06-27 16:59:28 UTC LOG: redo starts at 9A/6F1B3888 > 2018-06-27 16:59:28 UTC LOG: consistent recovery state reached at 9A/6F25BFF0 > 2018-06-27 16:59:28 UTC FATAL: invalid memory alloc request size 1768185856 > 2018-06-27 16:59:28 UTC LOG: database system is ready to accept read only connections > 2018-06-27 16:59:28 UTC LOG: startup process (PID 11829) exited with exit code 1 > 2018-06-27 16:59:28 UTC LOG: terminating any other active server processes > 2018-06-27 16:59:28 UTC LOG: database system is shut down Could you restart the server with log_min_messages set to debug3, and change log_line_prefix so it includs the pid? Greetings, Andres Freund
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
> On Jun 28, 2018, at 10:36 AM, Andres Freund <andres@anarazel.de> wrote: > > How are you making the inference that that's the problem? It could very > well be a postgres bug or somebody having used pg_resetxlog or.. ? There could always be a postgres bug, however misconfigured cache layers are a very common cause of this kind of symptom,and postgres bugs very rare. I don't believe pg_resetxlog would cause this particular problem. Failure to recovery yes, but not illegal memory operation.
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
Hi, On 2018-06-28 10:43:16 -0600, Scott Ribe wrote: > > On Jun 28, 2018, at 10:36 AM, Andres Freund <andres@anarazel.de> wrote: > > > > How are you making the inference that that's the problem? It could very > > well be a postgres bug or somebody having used pg_resetxlog or.. ? > > There could always be a postgres bug, however misconfigured cache > layers are a very common cause of this kind of symptom, and postgres > bugs very rare. Sure, but it's also easy to not find postgres bugs if you don't investigate. If the error comes from parsing WAL - which is likely but not guaranteed from the log output - then it's unlikely the error is caused by on-disk corruption. The WAL is protected by a CRC checksum, and if it fails, you don't get to the memory allocation bit. > I don't believe pg_resetxlog would cause this particular > problem. Failure to recovery yes, but not illegal memory operation. It can. Not as a result of reading WAL itself, but as a result of the resulting corruption. That's why I asked for debugging output. - Andres
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
The server was destroyed. But did copy the pg_data folder and restored it on the test server. It failed with similar error. Will that do. I do have some very high (DEBUG5) level logs in my console along with strace (postgresql start) logs. Not sure if this is ok to paste in the mail (it will not be readable) On 6/28/18, 10:16 PM, "Andres Freund" <andres@anarazel.de> wrote: Hi, On 2018-06-28 10:43:16 -0600, Scott Ribe wrote: > > On Jun 28, 2018, at 10:36 AM, Andres Freund <andres@anarazel.de> wrote: > > > > How are you making the inference that that's the problem? It could very > > well be a postgres bug or somebody having used pg_resetxlog or.. ? > > There could always be a postgres bug, however misconfigured cache > layers are a very common cause of this kind of symptom, and postgres > bugs very rare. Sure, but it's also easy to not find postgres bugs if you don't investigate. If the error comes from parsing WAL - which is likely but not guaranteed from the log output - then it's unlikely the error is caused by on-disk corruption. The WAL is protected by a CRC checksum, and if it fails, you don't get to the memory allocation bit. > I don't believe pg_resetxlog would cause this particular > problem. Failure to recovery yes, but not illegal memory operation. It can. Not as a result of reading WAL itself, but as a result of the resulting corruption. That's why I asked for debugging output. - Andres
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
On 2018-06-28 17:00:23 +0000, Vijaykumar Jain wrote: > The server was destroyed. > > But did copy the pg_data folder and restored it on the test server. > It failed with similar error. Will that do. > > I do have some very high (DEBUG5) level logs in my console along with strace (postgresql start) logs. > > Not sure if this is ok to paste in the mail (it will not be readable) Can't you attach files containing them? Thanks, Andres
Re: [External] standby replication server throws invalid memory allocrequest size , does not start up
On Jun 28, 2018, at 12:36 PM, Andres Freund <andres@anarazel.de> wrote:How are you making the inference that that's the problem? It could very
well be a postgres bug or somebody having used pg_resetxlog or.. ?
Re: [External] Re: standby replication server throws invalid memoryalloc request size , does not start up
I do not have earlier system logs, the machine was wiped. I only got access to db dump which I took from that machine. These logs are from when I restore the /var/lib/postgresql of the problem server on the test server. I have enabled most of the debugging flags, it may be verbose but I have them attached in this mail. Let me know if anything else is needed. Thanks, Vijay On 6/28/18, 10:34 PM, "Andres Freund" <andres@anarazel.de> wrote: On 2018-06-28 17:00:23 +0000, Vijaykumar Jain wrote: > The server was destroyed. > > But did copy the pg_data folder and restored it on the test server. > It failed with similar error. Will that do. > > I do have some very high (DEBUG5) level logs in my console along with strace (postgresql start) logs. > > Not sure if this is ok to paste in the mail (it will not be readable) Can't you attach files containing them? Thanks, Andres