Re: POC: Cleaning up orphaned files using undo logs - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: POC: Cleaning up orphaned files using undo logs |
Date | |
Msg-id | CA+TgmoYhDt=nQeshk_pOJvhvsgfXdxLD3+g2+__FVtU_Yb4x7Q@mail.gmail.com Whole thread Raw |
In response to | Re: POC: Cleaning up orphaned files using undo logs (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
On Fri, Aug 23, 2019 at 2:04 AM Thomas Munro <thomas.munro@gmail.com> wrote: > 2. Strict self-update-only: We could update it as part of > transaction cleanup. That is, when you commit or abort, probably at > some time when your transaction is still advertised as running, you go > and update your own transaction header with your the size. If you > never reach that stage, I think you can fix it during crash recovery, > during the startup scan that feeds the rollback request queues. That > is, if you encounter a transaction header with length == 0, it must be > the final one and its length is therefore known and can be updated, > before you allow new transactions to begin. There are some > complications like backends that exit without crashing, which I > haven't thought about. As Amit just pointed out to me, that means > that the update is not folded into the same buffer access as the next > transaction, but perhaps you can mitigate that by not updating your > header if the next header will be on the same page -- the next > transaction can do it safely then (this page with the insert pointer > on it can't be discarded). As Dilip just pointed out to me, it means > that you always do an update that you might not never need to do if > the transaction is discarded, to which I have no answer. Bleugh. Andres and I have spent a lot of time on the phone over the last couple of days and I think we both kind of like this option. I don't think that the costs are likely to be very significant: you're talking about pinning, locking, dirtying, unlocking, and unpinning one buffer at commit time, or maybe two if your transaction touched both logged and unlogged tables. If the transaction is short enough for that overhead to matter, that buffer is probably already in shared_buffers, is probably already dirty, and is probably already in your CPU's cache. So I think the overhead will turn out to be low. Moreover, I doubt that we want to separately discard every transaction anyway. If you have very light-weight transactions, you don't want to add an extra WAL record per transaction anyway. Increasing the number of separate WAL records per transaction from say 5 to 6 would be a significant additional cost. You probably want to perform a discard, say, every 5 seconds or sooner if you can discard at least 64kB of undo, or something of that sort. So we're not going to save the overhead of updating the previous transaction header often enough to make much difference unless we're discarding so aggressively that we incur a much larger overhead elsewhere. I think. I am a little concerned about the backends that exit without crashing. Andres seems to want to treat that case as a bug to be fixed, but I doubt whether that's going to be practical. We're really only talking about extreme corner cases here, because before_shmem_exit(ShutdownPostgres, 0) means we'll AbortOutOfAnyTransaction() which should RecordTransactionAbort(). Only if we fail in the AbortTransaction() prior to reaching RecordTransactionAbort() will we manage to reach the later cleanup stages without having written an abort record. I haven't scrutinized that code lately to see exactly how things can go wrong there, but there shouldn't be a whole lot. However, there's probably a few things, like maybe a really poorly-timed malloc() failure. A zero-order solution would be to install a deadman switch. At on_shmem_exit time, you must detach from any undo log to which you are connected, so that somebody else can attach to it later. We can stick in a cross-check there that you haven't written any undo bytes to that log and PANIC if you have. Then the system must be water-tight. Perhaps it's possible to do better: if we could identify the cases in which such logic gets reached, we could try to guarantee that WAL is written and the undo log safely detached before we get there. But at the various least we can promote ERROR/FATAL to PANIC in the relevant case. A better solution would be to detect the problem and make sure we recover from it before reusing the undo log. Suppose each undo log has three states: (1) nobody's attached, (2) somebody's attached, and (3) nobody's attached but the last record might need a fixup. When we start up, all undo logs are in state 3, and the discard worker runs around and puts them into state 1. Subsequently, they alternate between states 1 and 2 for as long as the system remains up. But if as an exceptional case we reach on_shmem_exit without having detached the undo log, because of cascading failures, then we put the undo log in state 3. The discard worker already knows how to move undo logs from state 3 to state 1, and it can do the same thing here. Until it does nobody else can reuse that undo log. I might be missing something, but I think that would nail this down pretty tightly. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
pgsql-hackers by date: