Thread: Idea for nested transactions / savepoints
I have been thinking about how to implement nested transactions / savepoints. As you may remember, Vadim wants to add UNDO to WAL and thus enable this feature. Some objected because of the added WAL complexity and the problem with long running transactions requiring lots of WAL segments. I have not been able to come up with any solution that doesn't have some UNDO capability to mark aborted tuples of the current transaction. My idea is that we not put UNDO information into WAL but keep a List of rel ids / tuple ids in the memory of each backend and do the undo inside the backend. We could go around and clear our transaction id from tuples that need to be undone. Basically, I am suggesting a per-backend UNDO segment. This seems to enable nested transactions without the disadvantages of putting it in WAL. Am I missing something about why UNDO should be in WAL? I realize UNDO in WAL would allow UNDO of any transaction, but we don't need that in our current non-overwriting system. It is only nested transactions we need to undo, and I don't think we need WAL writing for that because we are always undoing something before we commit the main transaction. In a crash recover, the entire transaction is aborted anyway. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: > My idea is that we not put UNDO information into WAL but keep a List of > rel ids / tuple ids in the memory of each backend and do the undo inside > the backend. The complaints about WAL size amount to "we don't have the disk space to keep track of this, for long-running transactions". If it doesn't fit on disk, how likely is it that it will fit in memory? regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > > My idea is that we not put UNDO information into WAL but keep a List of > > rel ids / tuple ids in the memory of each backend and do the undo inside > > the backend. > > The complaints about WAL size amount to "we don't have the disk space > to keep track of this, for long-running transactions". If it doesn't > fit on disk, how likely is it that it will fit in memory? Sure, we can put on the disk if that is better. I thought the problem with WAL undo is that you have to keep UNDO info around for all transactions that are older than the earliest transaction. So, if I start a nested transaction, and then sit at a prompt for 8 hours, all WAL logs are kept for 8 hours. We can create a WAL file for every backend, and record just the nested transaction information. In fact, once a nested transaction finishes, we don't need the info anymore. Certainly we don't need to flush these to disk. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian <pgman@candle.pha.pa.us> writes: >> The complaints about WAL size amount to "we don't have the disk space >> to keep track of this, for long-running transactions". If it doesn't >> fit on disk, how likely is it that it will fit in memory? > Sure, we can put on the disk if that is better. I think you missed my point. Unless something can be done to make the log info a lot smaller than it is now, keeping it all around until transaction end is just not pleasant. Waving your hands and saying that we'll keep it in a different place doesn't affect the fundamental problem: if the transaction runs a long time, the log is too darn big. There probably are things we can do --- for example, I bet an UNDO log kept in this way wouldn't need to include page images. But it's that sort of consideration that will make or break UNDO, not where we store the info. regards, tom lane
> Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> The complaints about WAL size amount to "we don't have the disk space > >> to keep track of this, for long-running transactions". If it doesn't > >> fit on disk, how likely is it that it will fit in memory? > > > Sure, we can put on the disk if that is better. > > I think you missed my point. Unless something can be done to make the > log info a lot smaller than it is now, keeping it all around until > transaction end is just not pleasant. Waving your hands and saying > that we'll keep it in a different place doesn't affect the fundamental > problem: if the transaction runs a long time, the log is too darn big. When you said long running, I thought you were concerned about long running in duration, not large transaction. Long duration in one-WAL setup would cause all transaction logs to be kept. Large transactions are another issue. One solution may be to store just the relid if many tuples are modified in the same table. If you stored the command counter for start/end of the nested transaction, it would be possible to sequential scan the table and undo all the affected tuples. Does that help? Again, I am just throwing out ideas here, hoping something will catch. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Tom Lane wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > >> The complaints about WAL size amount to "we don't have the disk space > >> to keep track of this, for long-running transactions". If it doesn't > >> fit on disk, how likely is it that it will fit in memory? > > > Sure, we can put on the disk if that is better. > > I think you missed my point. Unless something can be done to make the > log info a lot smaller than it is now, keeping it all around until > transaction end is just not pleasant. Waving your hands and saying > that we'll keep it in a different place doesn't affect the fundamental > problem: if the transaction runs a long time, the log is too darn big. Keeping it in a different place does have other benefits - you can discard each subtransaction after it is committed/aborted regardless of what WAL log does, so the chap who did a "begin transaction" 8 hours ago does not get subtransactions kept as well, thus postponing the problem a lot. > There probably are things we can do --- for example, I bet an UNDO > log kept in this way wouldn't need to include page images. Not keeping something that does not need to be kept is always a good idea when preserving space is important. > But it's that sort of consideration that will make or break UNDO, > not where we store the info. But "how long do we need to keep the info" _is_ an important consideration. -------------- Hannu
> > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >> The complaints about WAL size amount to "we don't have the disk space > > >> to keep track of this, for long-running transactions". If it doesn't > > >> fit on disk, how likely is it that it will fit in memory? > > > > > Sure, we can put on the disk if that is better. > > > > I think you missed my point. Unless something can be done to make the > > log info a lot smaller than it is now, keeping it all around until > > transaction end is just not pleasant. Waving your hands and saying > > that we'll keep it in a different place doesn't affect the fundamental > > problem: if the transaction runs a long time, the log is too darn big. > > When you said long running, I thought you were concerned about long > running in duration, not large transaction. Long duration in one-WAL > setup would cause all transaction logs to be kept. Large transactions > are another issue. > > One solution may be to store just the relid if many tuples are modified > in the same table. If you stored the command counter for start/end of > the nested transaction, it would be possible to sequential scan the > table and undo all the affected tuples. Does that help? Again, I am > just throwing out ideas here, hoping something will catch. Actually, we need to keep around nested transaction UNDO information only until the nested transaction exits to the main transaction: BEGIN WORK; BEGIN WORK; COMMIT; -- we can throw away the UNDO here BEGIN WORK; BEGIN WORK; ... COMMIT COMMIT; -- we can throw away the UNDO hereCOMMIT; We are using the outside transaction for our ACID capabilities, and just using UNDO for nested transaction capability. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Added to TODO.detail/transactions as a nested transaction idea. > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > >> The complaints about WAL size amount to "we don't have the disk space > > >> to keep track of this, for long-running transactions". If it doesn't > > >> fit on disk, how likely is it that it will fit in memory? > > > > > Sure, we can put on the disk if that is better. > > > > I think you missed my point. Unless something can be done to make the > > log info a lot smaller than it is now, keeping it all around until > > transaction end is just not pleasant. Waving your hands and saying > > that we'll keep it in a different place doesn't affect the fundamental > > problem: if the transaction runs a long time, the log is too darn big. > > When you said long running, I thought you were concerned about long > running in duration, not large transaction. Long duration in one-WAL > setup would cause all transaction logs to be kept. Large transactions > are another issue. > > One solution may be to store just the relid if many tuples are modified > in the same table. If you stored the command counter for start/end of > the nested transaction, it would be possible to sequential scan the > table and undo all the affected tuples. Does that help? Again, I am > just throwing out ideas here, hoping something will catch. > > -- > Bruce Momjian | http://candle.pha.pa.us > pgman@candle.pha.pa.us | (610) 853-3000 > + If your life is a hard drive, | 830 Blythe Avenue > + Christ can be your backup. | Drexel Hill, Pennsylvania 19026 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026