Thread: filesystem full during vacuum - space recovery issues

filesystem full during vacuum - space recovery issues

From
Thomas Simpson
Date:

Hi

I have a large database (multi TB) which had a vacuum full running but the database ran out of space during the rebuild of one of the large data tables.

Cleaning down the WAL files got the database restarted (an archiving problem led to the initial disk full).

However, the disk space is still at 99% as it appears the large table rebuild files are still hanging around using space and have not been deleted.

My problem now is how do I get this space back to return my free space back to where it should be?

I tried some scripts to map the data files to relations but this didn't work as removing some files led to startup failure despite them appearing to be unrelated to anything in the database - I had to put them back and then startup worked.

Any suggestions here?

Thanks

Tom


Re: filesystem full during vacuum - space recovery issues

From
Laurenz Albe
Date:
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
> I have a large database (multi TB) which had a vacuum full running but the database
> ran out of space during the rebuild of one of the large data tables.
>
> Cleaning down the WAL files got the database restarted (an archiving problem led to
> the initial disk full).
>
> However, the disk space is still at 99% as it appears the large table rebuild files
> are still hanging around using space and have not been deleted.
>
> My problem now is how do I get this space back to return my free space back to where
> it should be?
>
> I tried some scripts to map the data files to relations but this didn't work as
> removing some files led to startup failure despite them appearing to be unrelated
> to anything in the database - I had to put them back and then startup worked.
>
> Any suggestions here?

That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.

Did you run "pg_resetwal"?  If yes, that probably led to data corruption.

The above are just guesses.  Anyway, there is no good way to get rid of the files
that were left behind after the crash.  The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time.  An alternative is to restore a backup
taken before the crash.

Yours,
Laurenz Albe



Re: filesystem full during vacuum - space recovery issues

From
Imran Khan
Date:
Also, you can use multi process dump and restore using pg_dump plus pigz utility for zipping.

Thanks

On Tue, Jul 16, 2024, 4:00 AM Laurenz Albe <laurenz.albe@cybertec.at> wrote:
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
> I have a large database (multi TB) which had a vacuum full running but the database
> ran out of space during the rebuild of one of the large data tables.
>
> Cleaning down the WAL files got the database restarted (an archiving problem led to
> the initial disk full).
>
> However, the disk space is still at 99% as it appears the large table rebuild files
> are still hanging around using space and have not been deleted.
>
> My problem now is how do I get this space back to return my free space back to where
> it should be?
>
> I tried some scripts to map the data files to relations but this didn't work as
> removing some files led to startup failure despite them appearing to be unrelated
> to anything in the database - I had to put them back and then startup worked.
>
> Any suggestions here?

That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.

Did you run "pg_resetwal"?  If yes, that probably led to data corruption.

The above are just guesses.  Anyway, there is no good way to get rid of the files
that were left behind after the crash.  The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time.  An alternative is to restore a backup
taken before the crash.

Yours,
Laurenz Albe


Re: filesystem full during vacuum - space recovery issues

From
Thomas Simpson
Date:

Thanks Laurenz & Imran for your comments.

My responses inline below.

Thanks

Tom


On 15-Jul-2024 20:58, Laurenz Albe wrote:
On Mon, 2024-07-15 at 14:47 -0400, Thomas Simpson wrote:
I have a large database (multi TB) which had a vacuum full running but the database
ran out of space during the rebuild of one of the large data tables.

Cleaning down the WAL files got the database restarted (an archiving problem led to
the initial disk full).

However, the disk space is still at 99% as it appears the large table rebuild files
are still hanging around using space and have not been deleted.

My problem now is how do I get this space back to return my free space back to where
it should be?

I tried some scripts to map the data files to relations but this didn't work as
removing some files led to startup failure despite them appearing to be unrelated
to anything in the database - I had to put them back and then startup worked.

Any suggestions here?
That reads like the sad old story: "cleaning down" WAL files - you mean deleting the
very files that would have enabled PostgreSQL to recover from the crash that was
caused by the full file system.

Did you run "pg_resetwal"?  If yes, that probably led to data corruption.

No, I just removed the excess already archived WALs to get space and restarted.  The vacuum full that was running had created files for the large table it was processing and these are still hanging around eating space without doing anything useful.  The shutdown prevented the rollback cleanly removing them which seems to be the core problem.

The above are just guesses.  Anyway, there is no good way to get rid of the files
that were left behind after the crash.  The reliable way of doing so is also the way
to get rid of potential data corruption caused by "cleaning down" the database:
pg_dump the whole thing and restore the dump to a new, clean cluster.

Yes, that will be a painfully long down time.  An alternative is to restore a backup
taken before the crash.

My issue now is the dump & reload is taking a huge time; I know the hardware is capable of multi-GB/s throughput but the reload is taking a long time - projected to be about 10 days to reload at the current rate (about 30Mb/sec).  The old server and new server have a 10G link between them and storage is SSD backed, so the hardware is capable of much much more than it is doing now.

Is there a way to improve the reload performance?  Tuning of any type - even if I need to undo it later once the reload is done.

My backups were in progress when all the issues happened, so they're not such a good starting point and I'd actually prefer the clean reload since this DB has been through multiple upgrades (without reloads) until now so I know it's not especially clean.  The size has always prevented the full reload before but the database is relatively low traffic now so I can afford some time to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Re: filesystem full during vacuum - space recovery issues

From
Ron Johnson
Date:
On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
[snip] 

uge time; I know the hardware is capable of multi-GB/s throughput but the reload is taking a long time - projected to be about 10 days to reload at the current rate (about 30Mb/sec).  The old server and new server have a 10G link between them and storage is SSD backed, so the hardware is capable of much much more than it is doing now.

Is there a way to improve the reload performance?  Tuning of any type - even if I need to undo it later once the reload is done.

That would, of course, depend on what you're currently doing.  pg_dumpall of a Big Database is certainly suboptimal compared to "pg_dump -Fd --jobs=24".

This is what I run (which I got mostly from a databasesoup.com blog post) on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
        -o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
        -o "-c fsync=off" \
        -o "-c log_statement=none" \
        -o "-c log_temp_files=100kB" \
        -o "-c log_checkpoints=on" \
        -o "-c log_min_duration_statement=120000" \
        -o "-c shared_buffers=${SharedBuffs}GB" \
        -o "-c maintenance_work_mem=${MaintMem}GB" \
        -o "-c synchronous_commit=off" \
        -o "-c archive_mode=off" \
        -o "-c full_page_writes=off" \
        -o "-c checkpoint_timeout=${CheckPoint}min" \
        -o "-c max_wal_size=${MaxWalSize}GB" \
        -o "-c wal_level=minimal" \
        -o "-c max_wal_senders=0" \
        -o "-c wal_buffers=${WalBuffs}MB" \
        -o "-c autovacuum=off" 

After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut

Of course, these parameter values were for my hardware.

My backups were in progress when all the issues happened, so they're not such a good starting point and I'd actually prefer the clean reload since this DB has been through multiple upgrades (without reloads) until now so I know it's not especially clean.  The size has always prevented the full reload before but the database is relatively low traffic now so I can afford some time to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Re: filesystem full during vacuum - space recovery issues

From
Thomas Simpson
Date:

Thanks Ron for the suggestions - I applied some of the settings which helped throughput a little bit but were not an ideal solution for me - let me explain.

Due to the size, I do not have the option to use the directory mode (or anything that uses disk space) for dump as that creates multiple directories (hence why it can do multiple jobs).  I do not have the several hundred TB of space to hold the output and there is no practical way to get it, especially for a transient reload.

I have my original server plus my replica; as the replica also applied the WALs, it too filled up and went down.  I've basically recreated this as a primary server and am using a pipeline to dump from the original into this as I know that has enough space for the final loaded database and should have space left over from the clean rebuild (whereas the original server still has space exhausted due to the leftover files).

Incidentally, this state is also why going to a backup is not helpful either as the restore and then re-apply the WALs would just end up filling the disk and recreating the original problem.

Even with the improved throughput, current calculations are pointing to almost 30 days to recreate the database through dump and reload which is a pretty horrible state to be in.

I think this is perhaps an area of improvement - especially as larger PostgreSQL databases become more common, I'm not the only person who could face this issue.

Perhaps an additional dumpall mode that generates multiple output pipes (I'm piping via netcat to the other server) - it would need to combine with a multiple listening streams too and some degree of ordering/feedback to get to the essentially serialized output from the current dumpall.  But this feels like PostgreSQL expert developer territory.

Thanks

Tom


On 17-Jul-2024 09:49, Ron Johnson wrote:
On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
---8<--snip,snip---8<---
That would, of course, depend on what you're currently doing.  pg_dumpall of a Big Database is certainly suboptimal compared to "pg_dump -Fd --jobs=24".

This is what I run (which I got mostly from a databasesoup.com blog post) on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
        -o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
        -o "-c fsync=off" \
        -o "-c log_statement=none" \
        -o "-c log_temp_files=100kB" \
        -o "-c log_checkpoints=on" \
        -o "-c log_min_duration_statement=120000" \
        -o "-c shared_buffers=${SharedBuffs}GB" \
        -o "-c maintenance_work_mem=${MaintMem}GB" \
        -o "-c synchronous_commit=off" \
        -o "-c archive_mode=off" \
        -o "-c full_page_writes=off" \
        -o "-c checkpoint_timeout=${CheckPoint}min" \
        -o "-c max_wal_size=${MaxWalSize}GB" \
        -o "-c wal_level=minimal" \
        -o "-c max_wal_senders=0" \
        -o "-c wal_buffers=${WalBuffs}MB" \
        -o "-c autovacuum=off" 

After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut

Of course, these parameter values were for my hardware.

My backups were in progress when all the issues happened, so they're not such a good starting point and I'd actually prefer the clean reload since this DB has been through multiple upgrades (without reloads) until now so I know it's not especially clean.  The size has always prevented the full reload before but the database is relatively low traffic now so I can afford some time to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Re: filesystem full during vacuum - space recovery issues

From
Ron Johnson
Date:
There's no free lunch, and you can't squeeze blood from a turnip.

Single-threading will ALWAYS be slow: if you want speed, temporarily throw more hardware at it: specifically another disk (and possibly more RAM and CPU).

On Thu, Jul 18, 2024 at 9:55 AM Thomas Simpson <ts@talentstack.to> wrote:

Thanks Ron for the suggestions - I applied some of the settings which helped throughput a little bit but were not an ideal solution for me - let me explain.

Due to the size, I do not have the option to use the directory mode (or anything that uses disk space) for dump as that creates multiple directories (hence why it can do multiple jobs).  I do not have the several hundred TB of space to hold the output and there is no practical way to get it, especially for a transient reload.

I have my original server plus my replica; as the replica also applied the WALs, it too filled up and went down.  I've basically recreated this as a primary server and am using a pipeline to dump from the original into this as I know that has enough space for the final loaded database and should have space left over from the clean rebuild (whereas the original server still has space exhausted due to the leftover files).

Incidentally, this state is also why going to a backup is not helpful either as the restore and then re-apply the WALs would just end up filling the disk and recreating the original problem.

Even with the improved throughput, current calculations are pointing to almost 30 days to recreate the database through dump and reload which is a pretty horrible state to be in.

I think this is perhaps an area of improvement - especially as larger PostgreSQL databases become more common, I'm not the only person who could face this issue.

Perhaps an additional dumpall mode that generates multiple output pipes (I'm piping via netcat to the other server) - it would need to combine with a multiple listening streams too and some degree of ordering/feedback to get to the essentially serialized output from the current dumpall.  But this feels like PostgreSQL expert developer territory.

Thanks

Tom


On 17-Jul-2024 09:49, Ron Johnson wrote:
On Wed, Jul 17, 2024 at 9:26 AM Thomas Simpson <ts@talentstack.to> wrote:
---8<--snip,snip---8<---
That would, of course, depend on what you're currently doing.  pg_dumpall of a Big Database is certainly suboptimal compared to "pg_dump -Fd --jobs=24".

This is what I run (which I got mostly from a databasesoup.com blog post) on the target instance before doing "pg_restore -Fd --jobs=24":
declare -i CheckPoint=30
declare -i SharedBuffs=32
declare -i MaintMem=3
declare -i MaxWalSize=36
declare -i WalBuffs=64
pg_ctl restart -wt$TimeOut -mfast \
        -o "-c hba_file=$PGDATA/pg_hba_maintmode.conf" \
        -o "-c fsync=off" \
        -o "-c log_statement=none" \
        -o "-c log_temp_files=100kB" \
        -o "-c log_checkpoints=on" \
        -o "-c log_min_duration_statement=120000" \
        -o "-c shared_buffers=${SharedBuffs}GB" \
        -o "-c maintenance_work_mem=${MaintMem}GB" \
        -o "-c synchronous_commit=off" \
        -o "-c archive_mode=off" \
        -o "-c full_page_writes=off" \
        -o "-c checkpoint_timeout=${CheckPoint}min" \
        -o "-c max_wal_size=${MaxWalSize}GB" \
        -o "-c wal_level=minimal" \
        -o "-c max_wal_senders=0" \
        -o "-c wal_buffers=${WalBuffs}MB" \
        -o "-c autovacuum=off" 

After the pg_restore -Fd --jobs=24 and vacuumdb --analyze-only --jobs=24:
pg_ctl stop -wt$TimeOut && pg_ctl start -wt$TimeOut

Of course, these parameter values were for my hardware.

My backups were in progress when all the issues happened, so they're not such a good starting point and I'd actually prefer the clean reload since this DB has been through multiple upgrades (without reloads) until now so I know it's not especially clean.  The size has always prevented the full reload before but the database is relatively low traffic now so I can afford some time to reload, but ideally not 10 days.

Yours,
Laurenz Albe

Re: filesystem full during vacuum - space recovery issues

From
Paul Smith*
Date:
On 15/07/2024 19:47, Thomas Simpson wrote:
>
> My problem now is how do I get this space back to return my free space 
> back to where it should be?
>
> I tried some scripts to map the data files to relations but this 
> didn't work as removing some files led to startup failure despite them 
> appearing to be unrelated to anything in the database - I had to put 
> them back and then startup worked.
>
I don't know what you tried to do

What would normally happen on a failed VACUUM FULL that fills up the 
disk so the server crashes is that there are loads of data files 
containing the partially rebuilt table. Nothing 'internal' to PostgreSQL 
will point to those files as the internal pointers all change to the new 
table in an ACID way, so you should be able to delete them.

You can usually find these relatively easily by looking in the relevant 
tablespace directory for the base filename for a new huge table (lots 
and lots of files with the same base name - eg looking for files called 
*.1000 will find you base filenames for relations over about 1TB) and 
checking to see if pg_filenode_relation() can't turn the filenode into a 
relation. If that's the case that they're not currently in use for a 
relation, then you should be able to just delete all those files

Is this what you tried, or did your 'script to map data files to 
relations' do something else? You were a bit ambiguous about that part 
of things.

Paul




Re: filesystem full during vacuum - space recovery issues

From
Thomas Simpson
Date:


On 18-Jul-2024 11:19, Paul Smith* wrote:
On 15/07/2024 19:47, Thomas Simpson wrote:

My problem now is how do I get this space back to return my free space back to where it should be?

I tried some scripts to map the data files to relations but this didn't work as removing some files led to startup failure despite them appearing to be unrelated to anything in the database - I had to put them back and then startup worked.

I don't know what you tried to do

What would normally happen on a failed VACUUM FULL that fills up the disk so the server crashes is that there are loads of data files containing the partially rebuilt table. Nothing 'internal' to PostgreSQL will point to those files as the internal pointers all change to the new table in an ACID way, so you should be able to delete them.

You can usually find these relatively easily by looking in the relevant tablespace directory for the base filename for a new huge table (lots and lots of files with the same base name - eg looking for files called *.1000 will find you base filenames for relations over about 1TB) and checking to see if pg_filenode_relation() can't turn the filenode into a relation. If that's the case that they're not currently in use for a relation, then you should be able to just delete all those files

Is this what you tried, or did your 'script to map data files to relations' do something else? You were a bit ambiguous about that part of things.

[BTW, v9.6 which I know is old but this server is stuck there]

Yes, I was querying relfilenode from pg_class to get the filename (integer) and then comparing a directory listing for files which did not match the relfilenode as candidates to remove.

I moved these elsewhere (i.e. not delete, just move out the way so I could move them back in case of trouble).

Without these apparently unrelated files, the database did not start and complained about them being missing, so I had to put them back.  This was despite not finding any reference to the filename/number in pg_class.

At that point I gave up since I cannot afford to make the problem worse!

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

Thanks

Tom


Paul



Re: filesystem full during vacuum - space recovery issues

From
Ron Johnson
Date:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip] 

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

 No database does fast single-threaded backups.

Re: filesystem full during vacuum - space recovery issues

From
Thomas Simpson
Date:


On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip] 

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

 No database does fast single-threaded backups.

Agreed.  My thought is that is should be possible for a 'new dumpall' to be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread is available)  ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)

The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work.  It may even be doing a lot of this for 'directory' mode?  The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.

Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.

Probably not really an admin question at this point, more a feature enhancement.

Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?

Thanks

Tom


Re: filesystem full during vacuum - space recovery issues

From
Ron Johnson
Date:
Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be together.

Just temporarily add another disk for backups.

On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:


On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip] 

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

 No database does fast single-threaded backups.

Agreed.  My thought is that is should be possible for a 'new dumpall' to be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread is available)  ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)

The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work.  It may even be doing a lot of this for 'directory' mode?  The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.

Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.

Probably not really an admin question at this point, more a feature enhancement.

Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?

Thanks

Tom


[Added cross post to pgsql-hackers@lists.postgresql.org - background is multi-TB database needs recovered via pgdumpall & reload, thoughts on ways to make pg_dump scale to multi-thread to expedite loading to a new cluster.  Straight dump to a file is impractical as the dump will be >200TB; hackers may be a better home for the discussion than current admin list]

Hi Ron

On 18-Jul-2024 18:41, Ron Johnson wrote:
Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be together.

Just temporarily add another disk for backups.

For clarity, I'm not proposing multi threaded writing to one file; the proposal is a new special mode which specifically makes multiple output streams across *network sockets* to a listener which is listening on the other side.  The goal is avoiding any files at all and only using multiple network streams to gain multi-threaded processing with some co-ordination to keep things organized and consistent.

This would really be specifically for the use-case of dump/reload upgrade or recreate rather than everyday use.  And particularly for very large databases.

Looking at pg_dump.c it's doing the baseline organization but the extension would be adding the required coordination with the destination.  So, for a huge table (I have many) these would go in different streams but if there is a dependency (FK relations etc) the checkpoint needs to ensure those are met before proceeding.  Worst case scenario it would end up using only 1 thread but it would be very unusual to have a database where every table depends on another table all the way down.

In theory at least, some gains should be achieved for typical databases where a degree of parallelism is possible.
Thanks

Tom



On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:


On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip] 

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

 No database does fast single-threaded backups.

Agreed.  My thought is that is should be possible for a 'new dumpall' to be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread is available)  ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)

The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work.  It may even be doing a lot of this for 'directory' mode?  The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.

Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.

Probably not really an admin question at this point, more a feature enhancement.

Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?

Thanks

Tom


[Added cross post to pgsql-hackers@lists.postgresql.org - background is multi-TB database needs recovered via pgdumpall & reload, thoughts on ways to make pg_dump scale to multi-thread to expedite loading to a new cluster.  Straight dump to a file is impractical as the dump will be >200TB; hackers may be a better home for the discussion than current admin list]

Hi Ron

On 18-Jul-2024 18:41, Ron Johnson wrote:
Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be together.

Just temporarily add another disk for backups.

For clarity, I'm not proposing multi threaded writing to one file; the proposal is a new special mode which specifically makes multiple output streams across *network sockets* to a listener which is listening on the other side.  The goal is avoiding any files at all and only using multiple network streams to gain multi-threaded processing with some co-ordination to keep things organized and consistent.

This would really be specifically for the use-case of dump/reload upgrade or recreate rather than everyday use.  And particularly for very large databases.

Looking at pg_dump.c it's doing the baseline organization but the extension would be adding the required coordination with the destination.  So, for a huge table (I have many) these would go in different streams but if there is a dependency (FK relations etc) the checkpoint needs to ensure those are met before proceeding.  Worst case scenario it would end up using only 1 thread but it would be very unusual to have a database where every table depends on another table all the way down.

In theory at least, some gains should be achieved for typical databases where a degree of parallelism is possible.
Thanks

Tom



On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:


On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip] 

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

 No database does fast single-threaded backups.

Agreed.  My thought is that is should be possible for a 'new dumpall' to be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread is available)  ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)

The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work.  It may even be doing a lot of this for 'directory' mode?  The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.

Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.

Probably not really an admin question at this point, more a feature enhancement.

Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?

Thanks

Tom


Re: filesystem full during vacuum - space recovery issues

From
Scott Ribe
Date:
1) Add new disk, use a new tablespace to move some big tables to it, to get back up and running
2) Replica server provisioned sufficiently for the db, pg_basebackup to it
3) Get streaming replication working
4) Switch over to new server

In other words, if you don't want terrible downtime, you need yet another server fully provisioned to be able to run
yourdb. 

--
Scott Ribe
scott_ribe@elevated-dev.com
https://www.linkedin.com/in/scottribe/



> On Jul 18, 2024, at 4:41 PM, Ron Johnson <ronljohnsonjr@gmail.com> wrote:
>
> Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be
together.
>
> Just temporarily add another disk for backups.
>
> On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:
>
> On 18-Jul-2024 16:32, Ron Johnson wrote:
>> On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
>> [snip]
>> [BTW, v9.6 which I know is old but this server is stuck there]
>> [snip]
>> I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs
todump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider
waysto improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade
guideand the general fail-safe route to get out of trouble. 
>>  No database does fast single-threaded backups.
> Agreed.  My thought is that is should be possible for a 'new dumpall' to be multi-threaded.
> Something like :
> * Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared
toaccept via a control port) 
> * Select each database in turn
> * Organize the tables which do not have references themselves
> * Send each table separately in each thread (or queue them until a thread is available)  ('Stage 1')
> * Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we
havea known consistent state that is safe to proceed to subsequent tables 
> * Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and
havetheir data) ('Stage 2') 
> * Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)
> The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread]
otherwisethe dump/load would not work.  It may even be doing a lot of this for 'directory' mode?  The change here is
organizingn threads to process them concurrently where possible and coordinating the pipes so they only send data which
canbe accepted. 
> The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so
feedback completion of each stage. 
> Something like a destination host and control channel port to establish the pipes and create additional netcat pipes
onincremental ports above the control port for each thread used. 
> Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing
thedump data so it can be consistently loaded. 
> Probably not really an admin question at this point, more a feature enhancement.
> Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?
> Thanks
> Tom
>




200TB... how do you currently back up your database?

On Fri, Jul 19, 2024 at 5:08 AM Thomas Simpson <ts@talentstack.to> wrote:

[Added cross post to pgsql-hackers@lists.postgresql.org - background is multi-TB database needs recovered via pgdumpall & reload, thoughts on ways to make pg_dump scale to multi-thread to expedite loading to a new cluster.  Straight dump to a file is impractical as the dump will be >200TB; hackers may be a better home for the discussion than current admin list]

Hi Ron

On 18-Jul-2024 18:41, Ron Johnson wrote:
Multi-threaded writing to the same giant text file won't work too well, when all the data for one table needs to be together.

Just temporarily add another disk for backups.

For clarity, I'm not proposing multi threaded writing to one file; the proposal is a new special mode which specifically makes multiple output streams across *network sockets* to a listener which is listening on the other side.  The goal is avoiding any files at all and only using multiple network streams to gain multi-threaded processing with some co-ordination to keep things organized and consistent.

This would really be specifically for the use-case of dump/reload upgrade or recreate rather than everyday use.  And particularly for very large databases.

Looking at pg_dump.c it's doing the baseline organization but the extension would be adding the required coordination with the destination.  So, for a huge table (I have many) these would go in different streams but if there is a dependency (FK relations etc) the checkpoint needs to ensure those are met before proceeding.  Worst case scenario it would end up using only 1 thread but it would be very unusual to have a database where every table depends on another table all the way down.

In theory at least, some gains should be achieved for typical databases where a degree of parallelism is possible.
Thanks

Tom



On Thu, Jul 18, 2024 at 4:55 PM Thomas Simpson <ts@talentstack.to> wrote:


On 18-Jul-2024 16:32, Ron Johnson wrote:
On Thu, Jul 18, 2024 at 3:01 PM Thomas Simpson <ts@talentstack.to> wrote:
[snip]

[BTW, v9.6 which I know is old but this server is stuck there]

[snip] 

I know I'm stuck with the slow rebuild at this point.  However, I doubt I am the only person in the world that needs to dump and reload a large database.  My thought is this is a weak point for PostgreSQL so it makes sense to consider ways to improve the dump reload process, especially as it's the last-resort upgrade path recommended in the upgrade guide and the general fail-safe route to get out of trouble.

 No database does fast single-threaded backups.

Agreed.  My thought is that is should be possible for a 'new dumpall' to be multi-threaded.

Something like :

* Set number of threads on 'source' (perhaps by querying a listening destination for how many threads it is prepared to accept via a control port)

* Select each database in turn

* Organize the tables which do not have references themselves

* Send each table separately in each thread (or queue them until a thread is available)  ('Stage 1')

* Rendezvous stage 1 completion (pause sending, wait until feedback from destination confirming all completed) so we have a known consistent state that is safe to proceed to subsequent tables

* Work through tables that do refer to the previously sent in the same way (since the tables they reference exist and have their data) ('Stage 2')

* Repeat progressively until all tables are done ('Stage 3', 4 etc. as necessary)

The current dumpall is essentially doing this table organization currently [minus stage checkpoints/multi-thread] otherwise the dump/load would not work.  It may even be doing a lot of this for 'directory' mode?  The change here is organizing n threads to process them concurrently where possible and coordinating the pipes so they only send data which can be accepted.

The destination would need to have a multi-thread listen and co-ordinate with the sender on some control channel so feed back completion of each stage.

Something like a destination host and control channel port to establish the pipes and create additional netcat pipes on incremental ports above the control port for each thread used.

Dumpall seems like it could be a reasonable start point since it is already doing the complicated bits of serializing the dump data so it can be consistently loaded.

Probably not really an admin question at this point, more a feature enhancement.

Is there anything fundamentally wrong that someone with more intimate knowledge of dumpall could point out?

Thanks

Tom


Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.

Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk,
butdirectly to the destination rather than write locally then copy? 

Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get
upand running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data
modificationactivity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster
thannew WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that
manychanges, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario,
youdon't need to care how long pg_basebackup takes. 

If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it
atable at a time--partitioning would help here, if practical. 


Do you actually have 100G networking between the nodes? Because if not, a single CPU should be able to saturate 10G.

Likewise the receiving end would need disk capable of keeping up. Which brings up the question, why not write to disk,
butdirectly to the destination rather than write locally then copy? 

Do you require dump-reload because of suspected corruption? That's a tough one. But if not, if the goal is just to get
upand running on a new server, why not pg_basebackup, streaming replica, promote? That depends on the level of data
modificationactivity being low enough that pg_basebackup can keep up with WAL as it's generated and apply it faster
thannew WAL comes in, but given that your server is currently keeping up with writing that much WAL and flushing that
manychanges, seems likely it would keep up as long as the network connection is fast enough. Anyway, in that scenario,
youdon't need to care how long pg_basebackup takes. 

If you do need a dump/reload because of suspected corruption, the only thing I can think of is something like doing it
atable at a time--partitioning would help here, if practical. 


> On Jul 19, 2024, at 7:46 AM, Thomas Simpson <ts@talentstack.to> wrote:
>
> I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.

I was gone from my computer for a day and lost track of the thread.

Perhaps logical replication could help you out here?


Hi Scott

On 19-Jul-2024 15:34, Scott Ribe wrote:
On Jul 19, 2024, at 7:46 AM, Thomas Simpson <ts@talentstack.to> wrote:

I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.
I was gone from my computer for a day and lost track of the thread.

Perhaps logical replication could help you out here?

I'm not sure - perhaps, but at this point, I've got that dump/reload running and provided it completes ok (in about 20 days time at current rate), I'll be fine with this.

The database itself is essentially an archive of data so is no longer being added to at this point, so it's an annoyance for the rebuild time rather than a disaster.

[But incidentally, I am working on an even larger project which is likely to make this one seem small, so improvement around large databases is important to me.]

However, my thought is around how to avoid this issue in the future and to improve the experience for others faced with the dump-reload which is always the fall-back upgrade suggestion between versions.

Getting parallelism should be possible and the current pg_dump does that for directory mode from what I can see - making multiple threads etc.  according to parallel.c in pg_dump, it even looks like most of where my thought process was going is actually already there.

The extension should be adding synchronization/checkpointing between the generating dump and the receiving reload to ensure objects are not processed until all their requirements are already present in the new database.  This is all based around routing via network streams instead of the filesystem as currently happens.

Perhaps this is already in place since the restore can be done in parallel, so must need to implement that ordering already?  If someone with a good understanding of dump is able to comment or even give suggestions, I'm not against making an attempt to implement something as a first attempt.

I see Tom Lane from git blame did a bunch of work around the parallel dump back in 2020 - perhaps he could make suggestions either via private direct email or the list ?

Thanks

Tom


Hi Doug

On 19-Jul-2024 17:21, Doug Reynolds wrote:
Thomas—

Why are you using logical backups for a database this large?  A solution like PgBackRest?  Obviously, if you are going to upgrade, but for operational use, that seems to be a slow choice.

In normal operation the server runs as a primary-replica and pgbackrest handles backups.  Right when disk space was used up, pgbackrest also took a backup during the failed vacuum so going back to it (or anything earlier) would also roll forward the WALs for recovery to date and put me right back where I am just now by running out of space part way through.

It's a pragmatic decision that trying various things short of the dump-reload would take a number of days for me to try and see if I could get them to work with a high likelihood of needing to resort to dump-reload anyway.  I'd already tried a few file matching/moving exercises by they all prevented the database starting up so I cut my losses and started the dump-reload this week instead of next week since there's a limited window before this becomes a larger problem.

My thoughts on improving pg_dump are to help make it a better tool for worst case scenarios like this for the future or for those that like the dump-reload as part of upgrades but have reasonable size databases.

Thanks

Tom



Doug

On Jul 19, 2024, at 4:26 PM, Thomas Simpson <ts@talentstack.to> wrote:



Hi Scott

On 19-Jul-2024 15:34, Scott Ribe wrote:
On Jul 19, 2024, at 7:46 AM, Thomas Simpson <ts@talentstack.to> wrote:

I realize some of the background was snipped on what I sent to the hacker list, I'll try to fill in the details.
I was gone from my computer for a day and lost track of the thread.

Perhaps logical replication could help you out here?

I'm not sure - perhaps, but at this point, I've got that dump/reload running and provided it completes ok (in about 20 days time at current rate), I'll be fine with this.

The database itself is essentially an archive of data so is no longer being added to at this point, so it's an annoyance for the rebuild time rather than a disaster.

[But incidentally, I am working on an even larger project which is likely to make this one seem small, so improvement around large databases is important to me.]

However, my thought is around how to avoid this issue in the future and to improve the experience for others faced with the dump-reload which is always the fall-back upgrade suggestion between versions.

Getting parallelism should be possible and the current pg_dump does that for directory mode from what I can see - making multiple threads etc.  according to parallel.c in pg_dump, it even looks like most of where my thought process was going is actually already there.

The extension should be adding synchronization/checkpointing between the generating dump and the receiving reload to ensure objects are not processed until all their requirements are already present in the new database.  This is all based around routing via network streams instead of the filesystem as currently happens.

Perhaps this is already in place since the restore can be done in parallel, so must need to implement that ordering already?  If someone with a good understanding of dump is able to comment or even give suggestions, I'm not against making an attempt to implement something as a first attempt.

I see Tom Lane from git blame did a bunch of work around the parallel dump back in 2020 - perhaps he could make suggestions either via private direct email or the list ?

Thanks

Tom


On Fri, Jul 19, 2024 at 10:19 PM Thomas Simpson <ts@talentstack.to> wrote:

Hi Doug

On 19-Jul-2024 17:21, Doug Reynolds wrote:
Thomas—

Why are you using logical backups for a database this large?  A solution like PgBackRest?  Obviously, if you are going to upgrade, but for operational use, that seems to be a slow choice.

In normal operation the server runs as a primary-replica and pgbackrest handles backups.

Expire the oldest pgbackrest, so as to free up space for a multithreaded pg_dump. 

  Right when disk space was used up, pgbackrest also took a backup during the failed vacuum so going back to it (or anything earlier) would also roll forward the WALs for recovery to date and put me right back where I am just now by running out of space part way through.


Who says you have to restore to the failure point?  That's what the "--target" option is for.

For example, if you took a full backup on 7/14 at midnight, and want to restore to 7/18 23:00, run:
declare LL=detail
declare PGData=/path/to/data
declare -i Threads=`nproc`-2
declare BackupSet=20240714-000003F
declare RestoreUntil="2024-07-18 23:00"
pgbackrest restore \
    --stanza=localhost \
    --log-level-file=$LL \
    --log-level-console=$LL \
    --process-max=${Threads}
    --pg1-path=$PGData \
    --set=$BackupSet \
    --type=time --target="${RestoreUntil}"