Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoZFco8ENtQv9wK47EHpL0XXCvbJ5D7YaFP4cMQUbCWW6w@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Mon, Jul 1, 2019 at 3:54 AM Thomas Munro <thomas.munro@gmail.com> wrote: > Here's a new version. Here's a relatively complete review of 0019 and 0020 and a remark or two on the beginning of 0003. Regarding 0020: The documentation claims that undo data exists in a 64-bit address space divided into 2^34 undo logs, each with a theoretical capacity of 1TB, but that would require 74 bits. I am mildly suspicious that, on a busy system, the use of 1MB segment files could result in slowdowns due to frequent filesystem operations. We just recently made it more convenient to change the WAL segment size, mostly so that people on very busy systems could crank it up from 16MB to, say, 64MB or 256MB. It's true that the considerations are a bit different here, because undo logs don't have to be archived, and because we might be using many undo logs simultaneously rather than only 1 for the whole system, but it's still true that if you've got a bunch of backends blasting out undo at top speed, you're going to have to recycle files *extremely* quickly. How much performance testing have you done to assess the effect of segment size? Do you think there's an argument for making this 1MB size configurable at initdb-time? Or even variable at runtime, so that we use larger files if we're filling them up in < 100ms or whatever? I don't think the last paragraph is entirely accurate. The access method gets to control what records are written, but the general format of the records is fixed by the undo system. Perhaps the undo log code isn't what cares about that, but whether it's the undo log code or the undo access code or the undo processing code isn't likely to seem relevant to developers. Regarding 0019: I think there's a substantial amount of duplication between 0019 and 0020, and I'm not sure that we ought to have both. They both talk about the purpose of undo, the way the adddress space is divided, etc. I understand that it would be a little weird to include all of the information from 0019 in the user-facing documentation, and I also understand that it won't work to have no user-facing documentation at all, but it still seems a little odd to me. Possibly 0019 could refer to the SGML documentation for preliminaries and then add only those details that are not covered there. How could we avoid the limit on the total size of an active transaction mentioned here? And what would be the cost of such a scheme? If we've filled an undo log and moved on to another one, why can't we evict the one that's full and reuse the shared memory slot, bringing it back in later when required? I suspect the answer is that there is a locking rule involved. I think this README would be a good place to document things like locking rules, or a least to refer to where they are documented. I also think we should mull over whether we could relax the rule without too much pain. I expect that at least part of the problem is that somebody might have a pointer to an UndoLogSlot which could become stale if we recycle a slot, but that can already happen at least when the log is fully discarded, so maybe allowing it to happen in other cases wouldn't be too bad. I know you're laughing at me on the inside, worrying about a transaction that touches so many TB of data that it manages to exhaust all the undo log slots, but I don't think that's a completely crazy scenario. There are PB-scale databases out there, and it would be nice to think that PostgreSQL could capture more of those workloads. They will probably become more common over time. Reading the section on persistence levels and tablespaces makes me wonder what happens to address space that gets allocated to temporary and unlogged undo logs. It seems pretty important to make sure that we at least don't leak anything significant, and maybe that we actually recycle the address space or share it across backends. That is, if several backends are all writing temporary undo, there's no intrinsic reason why they can't all be using the same temporary undo logs, as long as the file naming works OK for that (e.g. if it follows the same pattern we use for relation names). Any undo logs that get allocated to unlogged undo can be recycled - either for unlogged undo or otherwise - after a crash, and any that are partially filled can be rewound. I don't know how much effort we're expending on any of that right now, but it seems like it would be worth discussing in this README, and possibly improving. When the undo log contents section mentions that "client code is responsible for stepping over the page headers and advancing to the next page," that's again a somewhat middle-of-the-patch stack perspective. I am not sure exactly how this should be phrased, but the point is that the client code we're talking about is not the AM but the next patch in the stack. I think developers will view the AM as the client and our wording probably ought to reflect that. "keepign" is not spelled correctly. A little later on, "checkpoin" is missing a letter. I think it would be worth mentioning how you solved the problem of inferring during recovery the position within the page where the record needs to be placed. The bit about checkpoint files written to pg_undo being potentially inconsistent is confusing. If the files are written before the checkpoint is completed, fsync'd, and not modified afterwards, how can they be inconsistent? Regarding 0003: UndoLogSharedData could use a more extensive comment. It's not very clear what low_logno and next_logno are, and it also seems like it would be worth mentioning how the free lists are linked. On a similar note, I think the file header comment ought to reference the undo README added by 0019 and perhaps also the documentation added by 0020, and I think 0019 and 0020 ought to be flattened into 0003. I meant to write more about 0003 before sending this, but I am out of time and it seems more useful to send what I have now than to wait until I have more... -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: