Thread: RDS restore failed due to WAL log and disk space-- any tidy fixes?

RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Wells Oliver

Date:

17 November 2024, 03:33:10

I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.

Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:

2024-11-17 00:07:09 UTC::@:[19861]:PANIC: could not write to file "pg_wal/xlogtemp.19861": No space left on device

And then kaboom.

I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore? wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL created?

I appreciate it.

Wells Oliver
wells.oliver@gmail.com

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Laurenz Albe

Date:

17 November 2024, 18:41:38

On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using
16jobs. 
>
> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
>
> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC:  could not write to file "pg_wal/xlogtemp.19861": No space left on device
>
>
> And then kaboom.
>
> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you
should see way less WAL generated, because data inserted into the table in the same transaction
as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,
so that PostgreSQL removes old WAL after the next checkpoint.

Yours,
Laurenz Albe

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Wells Oliver

Date:

17 November 2024, 20:12:05

Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

max_slot_wal_keep_size is -1 at the moment so I think that's why it kept a ton of WAL and ran out of space.

On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:

On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.
>
> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
>
> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC: could not write to file "pg_wal/xlogtemp.19861": No space left on device
>
>
> And then kaboom.
>
> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you
should see way less WAL generated, because data inserted into the table in the same transaction
as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,
so that PostgreSQL removes old WAL after the next checkpoint.

Yours,
Laurenz Albe

Wells Oliver
wells.oliver@gmail.com

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Ron Johnson

Date:

17 November 2024, 20:21:05

Doesn't RDS have its own replication?

Anyway, for pg_restore, I'd absolutely set archive_mode=off and wal_level=minimal, then set them to their production values when it's finished.

On Sun, Nov 17, 2024 at 12:12 PM Wells Oliver <wells.oliver@gmail.com> wrote:

Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

max_slot_wal_keep_size is -1 at the moment so I think that's why it kept a ton of WAL and ran out of space.

On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.
>
> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
>
> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC: could not write to file "pg_wal/xlogtemp.19861": No space left on device
>
>
> And then kaboom.
>
> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you
should see way less WAL generated, because data inserted into the table in the same transaction
as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,
so that PostgreSQL removes old WAL after the next checkpoint.

Yours,
Laurenz Albe

--
Wells Oliver
wells.oliver@gmail.com

Death to <Redacted>, and butter sauce.

Don't boil me, I'm still alive.

<Redacted> lobster!

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Wells Oliver

Date:

17 November 2024, 20:23:15

It does. I think it uses WAL behind the scenes. In RDS unfortunately cannot set wal_level, but you can set archive_mode.

On Sun, Nov 17, 2024 at 9:21 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:

Doesn't RDS have its own replication?

Anyway, for pg_restore, I'd absolutely set archive_mode=off and wal_level=minimal, then set them to their production values when it's finished.

On Sun, Nov 17, 2024 at 12:12 PM Wells Oliver <wells.oliver@gmail.com> wrote:
Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

max_slot_wal_keep_size is -1 at the moment so I think that's why it kept a ton of WAL and ran out of space.

On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.
>
> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
>
> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC: could not write to file "pg_wal/xlogtemp.19861": No space left on device
>
>
> And then kaboom.
>
> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you
should see way less WAL generated, because data inserted into the table in the same transaction
as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,
so that PostgreSQL removes old WAL after the next checkpoint.

Yours,
Laurenz Albe

--
Wells Oliver
wells.oliver@gmail.com

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

Wells Oliver
wells.oliver@gmail.com

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Wells Oliver

Date:

17 November 2024, 20:31:19

Actually, in RDS it seems you cannot set archive_mode either.

On Sun, Nov 17, 2024 at 9:23 AM Wells Oliver <wells.oliver@gmail.com> wrote:

It does. I think it uses WAL behind the scenes. In RDS unfortunately cannot set wal_level, but you can set archive_mode.

On Sun, Nov 17, 2024 at 9:21 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Doesn't RDS have its own replication?

Anyway, for pg_restore, I'd absolutely set archive_mode=off and wal_level=minimal, then set them to their production values when it's finished.

On Sun, Nov 17, 2024 at 12:12 PM Wells Oliver <wells.oliver@gmail.com> wrote:
Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

max_slot_wal_keep_size is -1 at the moment so I think that's why it kept a ton of WAL and ran out of space.

On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.
>
> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
>
> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC: could not write to file "pg_wal/xlogtemp.19861": No space left on device
>
>
> And then kaboom.
>
> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you
should see way less WAL generated, because data inserted into the table in the same transaction
as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,
so that PostgreSQL removes old WAL after the next checkpoint.

Yours,
Laurenz Albe

--
Wells Oliver
wells.oliver@gmail.com

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

--
Wells Oliver
wells.oliver@gmail.com

Wells Oliver
wells.oliver@gmail.com

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Wells Oliver

Date:

17 November 2024, 20:34:42

Would setting max_slot_wal_keep_size to something like 1GB ensure that WAL logs don't cause runaway disk use during restore? It's currently -1...

On Sun, Nov 17, 2024 at 9:31 AM Wells Oliver <wells.oliver@gmail.com> wrote:

Actually, in RDS it seems you cannot set archive_mode either.

On Sun, Nov 17, 2024 at 9:23 AM Wells Oliver <wells.oliver@gmail.com> wrote:
It does. I think it uses WAL behind the scenes. In RDS unfortunately cannot set wal_level, but you can set archive_mode.

On Sun, Nov 17, 2024 at 9:21 AM Ron Johnson <ronljohnsonjr@gmail.com> wrote:
Doesn't RDS have its own replication?

Anyway, for pg_restore, I'd absolutely set archive_mode=off and wal_level=minimal, then set them to their production values when it's finished.

On Sun, Nov 17, 2024 at 12:12 PM Wells Oliver <wells.oliver@gmail.com> wrote:
Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

max_slot_wal_keep_size is -1 at the moment so I think that's why it kept a ton of WAL and ran out of space.

On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB using 16 jobs.
>
> Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
>
> 2024-11-17 00:07:09 UTC::@:[19861]:PANIC: could not write to file "pg_wal/xlogtemp.19861": No space left on device
>
>
> And then kaboom.
>
> I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> created?

If you are using minimal WAL logging and you restore the dump in a single transaction, you
should see way less WAL generated, because data inserted into the table in the same transaction
as the CREATE TABLE statement need not be WAL logged.

But you might more easily solve the problem by speeding up or disabling the WAL archiver,
so that PostgreSQL removes old WAL after the next checkpoint.

Yours,
Laurenz Albe

--
Wells Oliver
wells.oliver@gmail.com

--
Death to <Redacted>, and butter sauce.
Don't boil me, I'm still alive.
<Redacted> lobster!

--
Wells Oliver
wells.oliver@gmail.com

--
Wells Oliver
wells.oliver@gmail.com

Wells Oliver
wells.oliver@gmail.com

Re: RDS restore failed due to WAL log and disk space-- any tidy fixes?

From

Laurenz Albe

Date:

17 November 2024, 21:18:58

On Sun, 2024-11-17 at 09:12 -0800, Wells Oliver wrote:
> On Sun, Nov 17, 2024 at 7:41 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
> > On Sat, 2024-11-16 at 16:33 -0800, Wells Oliver wrote:
> > > I provisioned an RDS instance with 2500GB space and began the restore of a database I know to be about 1750 GB
using16 jobs. 
> > >
> > > Unfortunately, it died very near the end when it ran out of disk space due to WAL log usage. Lots of:
> > >
> > > 2024-11-17 00:07:09 UTC::@:[19861]:PANIC:  could not write to file "pg_wal/xlogtemp.19861": No space left on
device
> > >
> > >
> > > And then kaboom.
> > >
> > > I'm wondering what my course of action should be. Can I disable/reduce WAL during a restore?
> > > wal_level is set to replica, can this temporarily be set to minimal? Should I just eat the extra
> > > costs to add headroom for the WAL? Would using fewer jobs during a restore reduce the amount of WAL
> > > created?
> >
> > If you are using minimal WAL logging and you restore the dump in a single transaction, you
> > should see way less WAL generated, because data inserted into the table in the same transaction
> > as the CREATE TABLE statement need not be WAL logged.
> >
> > But you might more easily solve the problem by speeding up or disabling the WAL archiver,
> > so that PostgreSQL removes old WAL after the next checkpoint.
>
> Interesting. I am migrating a pg_dump archive to a new server, in a single go. Does it make sense
> to disable (or speed up?) WAL archiving during the restore, then reenable it after the restore so
> a future replica could work? What would be the steps here? Would disabling or "speeding up" be faster?

Ah, I ignored that you were using a hosted database.  Then you probably cannot configure WAL archiving.

Yours,
Laurenz Albe