Thread: Journal based VACUUM FULL

Journal based VACUUM FULL

From
Ryan David Sheasby
Date:
Hi Team.

New to contributing so hopefully this is the right place. I've searched the forum and it seems this is the place for feature requests/suggestions.

I was reading on VACUUM and VACUUM FULL and saw that the current implementation of VACUUM FULL writes to an entirely new file and then switches to it, as opposed to rewriting the current file in-place. I assume the reason for this is safety in the event the database shuts down in the middle of the vacuum, the most you will lose is the progress of the vacuum and have to start from scratch but otherwise the database will retain its integrity and not become corrupted. This seems to be an effective but somewhat rudimentary system that could be improved. Most notably, the rewriting of almost the entire database takes up basically double the storage during the duration of the rewrite which can become expensive or even simply inconvenient in IaaS(and probably other) installations where the drive sizes are scaled on demand. Even if the VACUUM FULL doesn't need to run often, having to reallocate drive space for an effective duplication is not ideal. My suggestion is a journal based in-place rewrite of the database files.

This means that the VACUUM FULL will do a "pre-processing" pass over the database and figure out at a fairly low level what operations need to be done to compact the database back to it's "correct" size. These operations will be written in their entirety to a journal which records all the operations about to be performed, with some mechanism for checking if they have already been performed, using the same principle described here: https://en.wikipedia.org/wiki/Journaling_file_system. This will allow an in-place rewrite of the file in a safe manner such that you are able to recover from an unexpected shutdown by resuming the VACUUM FULL from the journal, or by detecting where the copy hole is in the file and recording/ignoring it

The journal could be something as simple as a record of which byte ranges need to be copied into which other byte ranges locations. The journal should record whenever a byte range copy completes for the sake of error recovery. Obviously, each byte range will have a max size of the copy distance from the source to the destination so that the destination will not overwrite the source, therefore making recovery impossible(how can you know where in the copy you stopped?). However, this will have a snowball effect as the further you are in the rewrite, the further the source and destination ranges will be so you can copy bigger chunks at a time, and won't have to update the journal's completion flags as often. In the case of a shutdown during a copy, you merely read the journal, looking for the first copy that isn't completed yet, and continue rewriting from there. Even if some of the bytes have been copied already, there will be no corruption as you haven't overwritten the source bytes at all. Finally, a simple file truncate can take place once all the copies are complete, and the journal can be deleted. This means the headroom required in the filesystem would be much smaller, and would pay for itself in any copy of at least 17 bytes or more (assuming 2*8 byte pointers plus a bit as a completion flag). The only situation in which this system would consume more space than a total copy is if the database has more than 50% garbage, and the garbage is perfectly spread out i.e. isn't in large chunks that can be copied at once and therefore recorded in the journal at once, and each piece of garbage is smaller than 17 bytes. Obviously, the journal itself would need a error checking mechanism to ensure the journal was correctly and completely written, but this can be as simple as a total file hash at the end of the file.

An alternative to the completion flags is to compute a hash of the data to be copied and store it in the journal, and then in recovery you can compare the destination with the hash. This has the advantage of not needing to write to the journal to keep it up to date during the operations, but the disadvantages associated with having to compute many hashes while recovering and storing the hashes in the journal, taking up more space. It's also arguably less safe as there is always the chance(albeit extremely unlikely) of a collision, which would mean that the data is not actually validated. I would argue the risk of this is lower than the risk of bit-rot flipping the completion bit, however.

A journaling system like this *might* have performance benefits too, specifically when running in less intelligent file systems like NTFS which can become easily fragmented(causing potentially big performance issues on spinning disks). Rewriting the same file will never require a file-system de-fragment. The other advantage as mentioned before is in the case of auto-scaling drives if used as storage for the DB(as they often are in IaaS/Paas services). Not having to scale up rapidly could be a performance boost in some cases.

Finally, a journaling system like this will also lend itself to stopping/resuming in the middle of the VACUUM FULL. Once the journal is created and the rewrites have started, assuming the journal "completion" flag is kept up to date, you can stop the operation in the middle(presumably writing the current "gap" with null bytes or otherwise indicating to the DB that there's a gap in the middle that should be ignored), and continue using the database as usual. This means you can do a "soft" recovery wherein the database is only halfway vacuumed but it's till perfectly operational and functional. You can also resume from this soft recovery by simply continuing to write from the last copy that was completed. Obviously you will only regain disk space when you reach the end and truncate the file but you are at least able to pause/resume the operation, waiting only for the current copy block to finish instead of for the entire VACUUM FULL to finish.

I hope this was a coherent explanation of my suggestion. It's possible and maybe even likely that there's a glaring misunderstanding or assumption on my part that means this isn't practical, but I'd still love to get feedback on it.

Thanks
Ryan Sheasby

Re: Journal based VACUUM FULL

From
Andreas Karlsson
Date:
On 2/21/19 12:16 AM, Ryan David Sheasby wrote:
> I was reading on VACUUM and VACUUM FULL and saw that the current 
> implementation of VACUUM FULL writes to an entirely new file and then 
> switches to it, as opposed to rewriting the current file in-place. I 
> assume the reason for this is safety in the event the database shuts 
> down in the middle of the vacuum, the most you will lose is the progress 
> of the vacuum and have to start from scratch but otherwise the database 
> will retain its integrity and not become corrupted. This seems to be an 
> effective but somewhat rudimentary system that could be improved. Most 
> notably, the rewriting of almost the entire database takes up basically 
> double the storage during the duration of the rewrite which can become 
> expensive or even simply inconvenient in IaaS(and probably other) 
> installations where the drive sizes are scaled on demand. Even if the 
> VACUUM FULL doesn't need to run often, having to reallocate drive space 
> for an effective duplication is not ideal. My suggestion is a journal 
> based in-place rewrite of the database files.

Hi,

VACUUM FULL used to modify the table in-place in PostgreSQL 8.4 and 
earlier but that solution was slow and did often cause plenty of index 
bloat while moving the rows around in the table. Which is why PostgreSQL 
9.0 switched it to rewiring the whole table and its indexes.

I have not heard many requests for bringing back the old behavior, but 
I could easily have missed them. Either way I do not think there would 
be much demand for an in-place VACUUM FULL unless the index bloat 
problem is also solved.

Additionally I do not think that the project would want a whole new kind 
of infrastructure just to solve this very narrow case. PostgreSQL 
already has its own journal (the write-ahead log) which is used to 
ensure crash safety, and I think any proposed solution for this would 
need to use the WAL.

Andreas


Re: Journal based VACUUM FULL

From
Ryan David Sheasby
Date:
Thanks for getting back to me. I had a small discussion with @sfrost on the slack team and understand the issue better now. I must admit I didn't realize that the scope of WAL extended to VACUUM operations which is why I suggested a new journaling system. I realize now the issue is not safety(as the WAL already sorts out that issue), but performance. I will rethink my suggestion and let you know if I come up with a useful/performant way of doing this.

Thanks
Ryan Sheasby



On Thu, Feb 21, 2019 at 5:27 PM Andreas Karlsson <andreas@proxel.se> wrote:
On 2/21/19 12:16 AM, Ryan David Sheasby wrote:
> I was reading on VACUUM and VACUUM FULL and saw that the current
> implementation of VACUUM FULL writes to an entirely new file and then
> switches to it, as opposed to rewriting the current file in-place. I
> assume the reason for this is safety in the event the database shuts
> down in the middle of the vacuum, the most you will lose is the progress
> of the vacuum and have to start from scratch but otherwise the database
> will retain its integrity and not become corrupted. This seems to be an
> effective but somewhat rudimentary system that could be improved. Most
> notably, the rewriting of almost the entire database takes up basically
> double the storage during the duration of the rewrite which can become
> expensive or even simply inconvenient in IaaS(and probably other)
> installations where the drive sizes are scaled on demand. Even if the
> VACUUM FULL doesn't need to run often, having to reallocate drive space
> for an effective duplication is not ideal. My suggestion is a journal
> based in-place rewrite of the database files.

Hi,

VACUUM FULL used to modify the table in-place in PostgreSQL 8.4 and
earlier but that solution was slow and did often cause plenty of index
bloat while moving the rows around in the table. Which is why PostgreSQL
9.0 switched it to rewiring the whole table and its indexes.

I have not heard many requests for bringing back the old behavior, but
I could easily have missed them. Either way I do not think there would
be much demand for an in-place VACUUM FULL unless the index bloat
problem is also solved.

Additionally I do not think that the project would want a whole new kind
of infrastructure just to solve this very narrow case. PostgreSQL
already has its own journal (the write-ahead log) which is used to
ensure crash safety, and I think any proposed solution for this would
need to use the WAL.

Andreas

Re: Journal based VACUUM FULL

From
Andres Freund
Date:
Hi,

On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
> I have not heard many requests for bringing back the old behavior, but I
> could easily have missed them. Either way I do not think there would be much
> demand for an in-place VACUUM FULL unless the index bloat problem is also
> solved.

Yea, I don't think there's much either. What I think there's PLENTY need
for is something like pg_repack in core.  And could argue that the
trigger based logging it does to catch up to changes made concurrently
with the rewrite, to the old table, is a form of journaling...

Greetings,

Andres Freund


Re: Journal based VACUUM FULL

From
Andrew Dunstan
Date:
On 2/22/19 2:15 PM, Andres Freund wrote:
> Hi,
>
> On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
>> I have not heard many requests for bringing back the old behavior, but I
>> could easily have missed them. Either way I do not think there would be much
>> demand for an in-place VACUUM FULL unless the index bloat problem is also
>> solved.
> Yea, I don't think there's much either. What I think there's PLENTY need
> for is something like pg_repack in core.  And could argue that the
> trigger based logging it does to catch up to changes made concurrently
> with the rewrite, to the old table, is a form of journaling...
>

+1. Maybe this is something that should be on the agenda of the next
developers' meeting.


cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Journal based VACUUM FULL

From
Stephen Frost
Date:
Greetings,

* Andrew Dunstan (andrew.dunstan@2ndquadrant.com) wrote:
> On 2/22/19 2:15 PM, Andres Freund wrote:
> > On 2019-02-21 16:27:06 +0100, Andreas Karlsson wrote:
> >> I have not heard many requests for bringing back the old behavior, but I
> >> could easily have missed them. Either way I do not think there would be much
> >> demand for an in-place VACUUM FULL unless the index bloat problem is also
> >> solved.
> > Yea, I don't think there's much either. What I think there's PLENTY need
> > for is something like pg_repack in core.  And could argue that the
> > trigger based logging it does to catch up to changes made concurrently
> > with the rewrite, to the old table, is a form of journaling...
>
> +1. Maybe this is something that should be on the agenda of the next
> developers' meeting.

Seems more appropriate to the developer unconference, though perhaps
that's what you meant..?

Thanks!

Stephen

Attachment