Thread: Failing Multi-Job Restores, Missing Indexes on Restore

Failing Multi-Job Restores, Missing Indexes on Restore

From
Cea Stapleton
Date:
Hi!

We are having a baffling problem we hope you might be able to help with. We were hoping to speed up postgres restores
toour reporting server. First, we were seeing missing indexes with pg_restore to our reporting server for one of our
databaseswhen we did pg_restore with multiple jobs (a clean restore, we also tried dropping the database prior to
restore,just in case something was extant and amiss). The indexes missed were not consistent, and we were only ever
seeingerrors on import that indicated an index had not yet been built. For example: 

pg_restore: [archiver (db)] could not execute query: ERROR:  index "index_versions_on_item_type_and_item_id" does not
exist
   Command was: DROP INDEX public.index_versions_on_item_type_and_item_id;

Which seemed like a reasonable error to us. We had no errors on insertion to indicate that index creation was a
problem. 

We believed this might be a race condition, so we attempted to do a schema-only restore followed by a data-only restore
justfor this database. This worked a few times, and then began growing exponentially in completion time before it
becameunsustainable. We figured we were using too many jobs, so we decreased them. Nothing helped. 

We decided to move back to a multi-job regular restore, and then the restores began crashing thusly:
[2016-09-14 02:20:36 UTC]    LOG:  server process (PID 27624) was terminated by signal 9: Killed
[2016-09-14 02:20:36 UTC]    LOG:  terminating any other active server processes
[2016-09-14 02:20:36 UTC] postgres [local] DBNAME WARNING:  terminating connection because of crash of another server
process
[2016-09-14 02:20:36 UTC] postgres [local] DBNAME DETAIL:  The postmaster has commanded this server process to roll
backthe current transaction and exit, because another server process exited abnormally and possibly corrupted shared
memory.

The restore crashed this way for all job numbers except for one. We’re now stuck back where we were prior to increasing
jobnumbers, at one job for this restore in order to prevent errors and crashes.   

Background:
    • 3 ec2 instances with postgres
        • 1 used for reporting, on Postgresql 9.5.4
            • Reporting server is a c4.2xlarge, and should have been able to handle multiple jobs (8cpu /
https://aws.amazon.com/ec2/instance-types/) 
        • 2 production servers; one leader and one follower, both on Postgresql 9.5.3.

We have one very large database, 678GB, and several others, but the largest is our concern.

I have attached our postgresql.conf file. Thank you so much for your time.

Best,



Cea Stapleton
Operations Engineer
http://www.healthfinch.com





Attachment

Re: Failing Multi-Job Restores, Missing Indexes on Restore

From
Tom Lane
Date:
Cea Stapleton <cea@healthfinch.com> writes:
> We are having a baffling problem we hope you might be able to help with. We were hoping to speed up postgres restores
toour reporting server. First, we were seeing missing indexes with pg_restore to our reporting server for one of our
databaseswhen we did pg_restore with multiple jobs (a clean restore, we also tried dropping the database prior to
restore,just in case something was extant and amiss). The indexes missed were not consistent, and we were only ever
seeingerrors on import that indicated an index had not yet been built. For example: 

> pg_restore: [archiver (db)] could not execute query: ERROR:  index "index_versions_on_item_type_and_item_id" does not
exist
>    Command was: DROP INDEX public.index_versions_on_item_type_and_item_id;

Which PG version is that; particularly, which pg_restore version?
What's the exact pg_restore command you were issuing?

> We decided to move back to a multi-job regular restore, and then the restores began crashing thusly:
> [2016-09-14 02:20:36 UTC]    LOG:  server process (PID 27624) was terminated by signal 9: Killed

This is probably the dreaded Linux OOM killer.  Fix by reconfiguring your
system to disallow memory overcommit, or at least make it not apply to
Postgres, cf
https://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT

            regards, tom lane


Re: Failing Multi-Job Restores, Missing Indexes on Restore

From
Cea Stapleton
Date:
Thanks Tom!

We’re using pg_restore (PostgreSQL) 9.5.4 for the restores. We’ve used variations on the job number:

/usr/bin/pg_restore -j 6 -Fc -O -c -d DBNAME RESTORE_FILE”

We’ll take a look at the memory overcommit - would that also explain the index issues we were seeing before we were
seeingthe crashes? 

Cea Stapleton
Operations Engineer
http://www.healthfinch.com


> On Sep 29, 2016, at 7:52 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>
> Cea Stapleton <cea@healthfinch.com> writes:
>> We are having a baffling problem we hope you might be able to help with. We were hoping to speed up postgres
restoresto our reporting server. First, we were seeing missing indexes with pg_restore to our reporting server for one
ofour databases when we did pg_restore with multiple jobs (a clean restore, we also tried dropping the database prior
torestore, just in case something was extant and amiss). The indexes missed were not consistent, and we were only ever
seeingerrors on import that indicated an index had not yet been built. For example: 
>
>> pg_restore: [archiver (db)] could not execute query: ERROR:  index "index_versions_on_item_type_and_item_id" does
notexist 
>>   Command was: DROP INDEX public.index_versions_on_item_type_and_item_id;
>
> Which PG version is that; particularly, which pg_restore version?
> What's the exact pg_restore command you were issuing?
>
>> We decided to move back to a multi-job regular restore, and then the restores began crashing thusly:
>> [2016-09-14 02:20:36 UTC]    LOG:  server process (PID 27624) was terminated by signal 9: Killed
>
> This is probably the dreaded Linux OOM killer.  Fix by reconfiguring your
> system to disallow memory overcommit, or at least make it not apply to
> Postgres, cf
> https://www.postgresql.org/docs/9.5/static/kernel-resources.html#LINUX-MEMORY-OVERCOMMIT
>
>             regards, tom lane



Re: Failing Multi-Job Restores, Missing Indexes on Restore

From
Tom Lane
Date:
Cea Stapleton <cea@healthfinch.com> writes:
> We’re using pg_restore (PostgreSQL) 9.5.4 for the restores. We’ve used variations on the job number:
> /usr/bin/pg_restore -j 6 -Fc -O -c -d DBNAME RESTORE_FILE”

OK ... do you actually need the -c, and if so why?

> We’ll take a look at the memory overcommit - would that also explain the index issues we were seeing before we were
seeingthe crashes? 

Unlikely.  I'm guessing that there's some sort of race condition involved
in parallel restore with -c, but it's not very clear what.

            regards, tom lane