Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers

From Robert Haas
Subject Re: POC: Cleaning up orphaned files using undo logs
Date
Msg-id CA+TgmoZFco8ENtQv9wK47EHpL0XXCvbJ5D7YaFP4cMQUbCWW6w@mail.gmail.com
Whole thread Raw
In response to Re: POC: Cleaning up orphaned files using undo logs  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote:
> Here's a new version.

Here's a relatively complete review of 0019 and 0020 and a remark or
two on the beginning of 0003.

Regarding 0020:

The documentation claims that undo data exists in a 64-bit address
space divided into 2^34 undo logs, each with a theoretical capacity of
1TB, but that would require 74 bits.

I am mildly suspicious that, on a busy system, the use of 1MB segment
files could result in slowdowns due to frequent filesystem operations.
We just recently made it more convenient to change the WAL segment
size, mostly so that people on very busy systems could crank it up
from 16MB to, say, 64MB or 256MB. It's true that the considerations
are a bit different here, because undo logs don't have to be archived,
and because we might be using many undo logs simultaneously rather
than only 1 for the whole system, but it's still true that if you've
got a bunch of backends blasting out undo at top speed, you're going
to have to recycle files *extremely* quickly.  How much performance
testing have you done to assess the effect of segment size? Do you
think there's an argument for making this 1MB size configurable at
initdb-time? Or even variable at runtime, so that we use larger files
if we're filling them up in < 100ms or whatever?

I don't think the last paragraph is entirely accurate.  The access
method gets to control what records are written, but the general
format of the records is fixed by the undo system. Perhaps the undo
log code isn't what cares about that, but whether it's the undo log
code or the undo access code or the undo processing code isn't likely
to seem relevant to developers.

Regarding 0019:

I think there's a substantial amount of duplication between 0019 and
0020, and I'm not sure that we ought to have both.  They both talk
about the purpose of undo, the way the adddress space is divided, etc.
I understand that it would be a little weird to include all of the
information from 0019 in the user-facing documentation, and I also
understand that it won't work to have no user-facing documentation at
all, but it still seems a little odd to me.  Possibly 0019 could refer
to the SGML documentation for preliminaries and then add only those
details that are not covered there.

How could we avoid the limit on the total size of an active
transaction mentioned here? And what would be the cost of such a
scheme? If we've filled an undo log and moved on to another one, why
can't we evict the one that's full and reuse the shared memory slot,
bringing it back in later when required?  I suspect the answer is that
there is a locking rule involved. I think this README would be a good
place to document things like locking rules, or a least to refer to
where they are documented. I also think we should mull over whether we
could relax the rule without too much pain.  I expect that at least
part of the problem is that somebody might have a pointer to an
UndoLogSlot which could become stale if we recycle a slot, but that
can already happen at least when the log is fully discarded, so maybe
allowing it to happen in other cases wouldn't be too bad.

I know you're laughing at me on the inside, worrying about a
transaction that touches so many TB of data that it manages to exhaust
all the undo log slots, but I don't think that's a completely crazy
scenario. There are PB-scale databases out there, and it would be nice
to think that PostgreSQL could capture more of those workloads.  They
will probably become more common over time.

Reading the section on persistence levels and tablespaces makes me
wonder what happens to address space that gets allocated to temporary
and unlogged undo logs. It seems pretty important to make sure that we
at least don't leak anything significant, and maybe that we actually
recycle the address space or share it across backends. That is, if
several backends are all writing temporary undo, there's no intrinsic
reason why they can't all be using the same temporary undo logs, as
long as the file naming works OK for that (e.g. if it follows the same
pattern we use for relation names). Any undo logs that get allocated
to unlogged undo can be recycled - either for unlogged undo or
otherwise - after a crash, and any that are partially filled can be
rewound. I don't know how much effort we're expending on any of that
right now, but it seems like it would be worth discussing in this
README, and possibly improving.

When the undo log contents section mentions that "client code is
responsible for stepping over the page headers and advancing to the
next page," that's again a somewhat middle-of-the-patch stack
perspective. I am not sure exactly how this should be phrased, but the
point is that the client code we're talking about is not the AM but
the next patch in the stack. I think developers will view the AM as
the client and our wording probably ought to reflect that.

"keepign" is not spelled correctly. A little later on, "checkpoin" is
missing a letter.

I think it would be worth mentioning how you solved the problem of
inferring during recovery the position within the page where the
record needs to be placed.

The bit about checkpoint files written to pg_undo being potentially
inconsistent is confusing.  If the files are written before the
checkpoint is completed, fsync'd, and not modified afterwards, how can
they be inconsistent?

Regarding 0003:

UndoLogSharedData could use a more extensive comment.  It's not very
clear what low_logno and next_logno are, and it also seems like it
would be worth mentioning how the free lists are linked.  On a similar
note, I think the file header comment ought to reference the undo
README added by 0019 and perhaps also the documentation added by 0020,
and I think 0019 and 0020 ought to be flattened into 0003.

I meant to write more about 0003 before sending this, but I am out of
time and it seems more useful to send what I have now than to wait
until I have more...

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: POC: Cleaning up orphaned files using undo logs
Next
From: Laurenz Albe
Date:
Subject: Re: pg_receivewal documentation