Thread: Remove xmin and cmin from frozen tuples
Hi Hackers, I think it would be a waste to retain xmin and cmin for frozen tuples because their values represent only 'visible for all transactions'. Additionally, most tuples in database can be frozen potentially. I wrote a makeshift patch to compress xmin and cmin (8bytes) to 1-bit flag, using tuple overlaping. Is this idea worth trying? Also, it will be useful to combine it and more aggressive freeze vacuum, for example, a freezer integrated with bgwriter. (The following is test of the attached patch) * Test query 1. create table test (a int); 2. insert into test select * from generate_series(1, 100000); 3. update test set a = a where a % 100 = 0; # to force defrag 4. select * from pgstattuple('test'); 5. vacuum freeze test; 6. select * from pgstattuple('test'); * Results of pgstattuple -[ before vacuum ]-+-------- table_len | 3645440 tuple_count | 100000 tuple_len | 3200000 tuple_percent | 87.78 dead_tuple_count | 1000 dead_tuple_len | 32000 dead_tuple_percent | 0.88 free_space | 536 free_percent | 0.01 -[ 8.1beta1 orig ]-+-------- table_len | 3645440 tuple_count | 100000 tuple_len | 3200000 tuple_percent | 87.78 dead_tuple_count | 0 dead_tuple_len | 0 dead_tuple_percent | 0 free_space | 30772 <-- about 32byte * 1000 (dead tuples) free_percent | 0.84 -[ patched ]-------+-------- table_len | 3645440 tuple_count | 100000 tuple_len | 3200000 tuple_percent | 87.78 dead_tuple_count | 0 dead_tuple_len | 0 dead_tuple_percent | 0 free_space | 823628 <-- + 8byte * 100000 (whole tuples) free_percent | 22.59 --- ITAGAKI Takahiro NTT Cyber Space Laboratories
Attachment
On Thu, Sep 01, 2005 at 10:45:44AM +0900, ITAGAKI Takahiro wrote: Hi, > I think it would be a waste to retain xmin and cmin for frozen tuples > because their values represent only 'visible for all transactions'. > Additionally, most tuples in database can be frozen potentially. I think this is an interesting idea. I was thinking that when the tuple needs to be obsoleted it would need to grow to accomodate the Xmax, but you are not actually proposing to remove that, so it seems sensible. In fact, it is perfectly reasonable to remove Xmin and Cmin, because after the tuple is frozen, the Xmin never changes again. Now, one thing of note is that you need to "compress" the page in order to actually be able to use the just-freed space. VACUUM could do that, but maybe it would be better to do it on-line -- the freezing process is going to have to write the page regardless. I wonder if with your patch the page is compressed on the same VACUUM execution that freezes the tuple? One thing that comes to mind is that this makes somewhat easier to build a tool to write pre-built tables, for bulk-loading purposes. You just construct the binary file with the HEAP_FROZEN bit set, and then attach the file to a dummy table. (Then again, you can do it today, using a Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't advocating a tool to do that. It is very hard to do with user-defined types, but for BI/DW you mostly don't need those, do you?) -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Cuando no hay humildad las personas se degradan" (A. Christie)
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > I think it would be a waste to retain xmin and cmin for frozen tuples > because their values represent only 'visible for all transactions'. True, but the hard part is getting rid of the storage for them. > I wrote a makeshift patch to compress xmin and cmin (8bytes) to > 1-bit flag, using tuple overlaping. > Is this idea worth trying? I think this is incredibly ugly :-(. It eliminates a fairly basic assumption which is that items on a page don't overlap. The space savings cannot be worth the loss in testability and reliability. To take just one problem, it is no longer possible to check an item offset for validity against pd_upper. If we're going to do this, we need a more invasive patch that changes the structure of heaptuple headers in a more fundamental way, and avoids breaking the page layout representation. (Something like the way Oids are now handled might work, although there are alignment issues to worry about, and it'd take more work on VACUUM's part to convert a tuple to frozen state.) I'm also less than enthused about using up our last infomask bit for a relatively unimportant purpose. We might need that for something bigger someday... though I can't presently guess what. regards, tom lane
Alvaro, > One thing that comes to mind is that this makes somewhat easier to build > a tool to write pre-built tables, for bulk-loading purposes. You just > construct the binary file with the HEAP_FROZEN bit set, and then attach > the file to a dummy table. (Then again, you can do it today, using a > Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't > advocating a tool to do that. It is very hard to do with user-defined > types, but for BI/DW you mostly don't need those, do you?) Hmmm ... can you expand on this a little? We'd discussed "frozen partitions" but hadn't thought to get around to them for a while, expecting the kind of issues which Tom just raised. -- Josh Berkus Aglio Database Solutions San Francisco
Alvaro Herrera <alvherre@alvh.no-ip.org> wrote: > Now, one thing of note is that you need to "compress" the page in order > to actually be able to use the just-freed space. VACUUM could do that, > but maybe it would be better to do it on-line -- the freezing process is > going to have to write the page regardless. I agree. I think an good position of freezer is on bgwriter. My idea is: 1. Just before bgwriter writes an dirty page in LRU order, 2. Freeze tuples in the page and repair fragmentation.3. (Replace the fsm page that has least freespace.) 4. Flush the page. > I wonder if with your patch > the page is compressed on the same VACUUM execution that freezes the tuple? Yes, defragmentation is performed after freezing, but the page has at least one dead tuple. In current VACUUM implementation, pages that have no dead tuples will not be defraged. So you cannot "compress" just after bulk-load. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
Tom Lane <tgl@sss.pgh.pa.us> wrote: > I think this is incredibly ugly :-(. Yes, I think so, too :-( My patch is product of the thought that I don't want to modify codes widely. So if we want to do it more cool way, lots of changes are needed as you said. > I'm also less than enthused about using up our last infomask bit for > a relatively unimportant purpose. We might need that for something > bigger someday... though I can't presently guess what. I think it is not a problem, because the header still has rooms for several bits. I assume that the combination of HEAP_XMIN_COMMITTED + HEAP_XMIN_INVALID has currently no meaning, right? If so, HEAP_FROZEN can be assigned here. Also, t_natts is currently 16-bits, but it can be cut to 11-bits because MaxTupleAttributeNumber is 1664 < 2^11. --- ITAGAKI Takahiro NTT Cyber Space Laboratories
ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > I agree. I think an good position of freezer is on bgwriter. > My idea is: > 1. Just before bgwriter writes an dirty page in LRU order, > 2. Freeze tuples in the page and repair fragmentation. > 3. (Replace the fsm page that has least freespace.) > 4. Flush the page. This is a bad idea. The bgwriter isn't the place to be doing freezing, because there is no reasonable way for it to guarantee that all old tuples in a table (or any larger unit) have been frozen. So you'd still need VACUUM to ensure no wraparound. Plus, you can't do such changes without emitting an XLOG record, which is something we don't want happening in the bgwriter's inner loop. Even more to the point, you can't do such changes without getting a superexclusive lock on the page (not only locked, but no one else has it pinned), which is a real nonstarter for the bgwriter, both for performance and possible deadlock issues. regards, tom lane
On Wed, Aug 31, 2005 at 09:14:42PM -0700, Josh Berkus wrote: > > One thing that comes to mind is that this makes somewhat easier to build > > a tool to write pre-built tables, for bulk-loading purposes. You just > > construct the binary file with the HEAP_FROZEN bit set, and then attach > > the file to a dummy table. (Then again, you can do it today, using a > > Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't > > advocating a tool to do that. It is very hard to do with user-defined > > types, but for BI/DW you mostly don't need those, do you?) > > Hmmm ... can you expand on this a little? We'd discussed "frozen partitions" > but hadn't thought to get around to them for a while, expecting the kind of > issues which Tom just raised. What issues did he raise on this? What I'm saying is that you can write a heap file, on which the tuples would all have xmin=FrozenTransactionId, xmax=Invalid, and the corresponding bits set in the infomask. This ensures that no matter the state of the server, you can plug the file in and all tuples will be valid. The "only" problem is figuring out how to lay the data in the tuples themselves, w.r.t endianness and such. This is platform-dependent, so you have to write code to do it correctly. In absence of user-defined types, this should not be _too_ hard to do. Of course, such a program would in general also be Postgres-version-dependent. Note that this is a very different business from skipping the Xmin and Cmin from the tuple header -- in fact, there's no relation to that at all. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com FOO MANE PADME HUM
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > What I'm saying is that you can write a heap file, on which the tuples > would all have xmin=FrozenTransactionId, xmax=Invalid, and the > corresponding bits set in the infomask. This ensures that no matter the > state of the server, you can plug the file in and all tuples will be > valid. > The "only" problem is figuring out how to lay the data in the tuples > themselves, w.r.t endianness and such. This is platform-dependent, so > you have to write code to do it correctly. In absence of user-defined > types, this should not be _too_ hard to do. Of course, such a program > would in general also be Postgres-version-dependent. Of course, it's fair to ask whether such a program would be any faster than binary-mode COPY by the time you got done ... or enough faster to justify your effort, anyway. THe only fundamental disadvantage that COPY labors under is having to write WAL records. It might be interesting to do something similar to the recent hacks for CREATE TABLE AS, so that a COPY into a table just created in the current transaction would skip writing WAL and instead fsync the table at the end. regards, tom lane
On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote: > On Thu, Sep 01, 2005 at 10:45:44AM +0900, ITAGAKI Takahiro wrote: > > Hi, > > > I think it would be a waste to retain xmin and cmin for frozen tuples > > because their values represent only 'visible for all transactions'. > > Additionally, most tuples in database can be frozen potentially. > > I think this is an interesting idea. Agreed, especially since it would avoid the need to vacuum altogether. > I was thinking that when the tuple > needs to be obsoleted it would need to grow to accomodate the Xmax, but > you are not actually proposing to remove that, so it seems sensible. In > fact, it is perfectly reasonable to remove Xmin and Cmin, because after > the tuple is frozen, the Xmin never changes again. It's a good idea, but the Xmin is set to FrozenTransactionId, which is how we know it is frozen, so how can we remove Xmin? The way to do this is surely by using a row version id that is different for this format. Getting 8 or 16 bytes per row back would be a very useful gain. > Now, one thing of note is that you need to "compress" the page in order > to actually be able to use the just-freed space. VACUUM could do that, > but maybe it would be better to do it on-line -- the freezing process is > going to have to write the page regardless. I wonder if with your patch > the page is compressed on the same VACUUM execution that freezes the > tuple? Only if you do a FULL, which is currently incompatible with a FREEZE. There's no point in compressing a block if you can't also redistribute rows between blocks to fill up the spaces, so another reason why it has to be a FULL. Unless you do this at load time, which is why I guess you mention.... > One thing that comes to mind is that this makes somewhat easier to build > a tool to write pre-built tables, for bulk-loading purposes. You just > construct the binary file with the HEAP_FROZEN bit set, and then attach > the file to a dummy table. (Then again, you can do it today, using a > Xmin of FrozenTransactionId. I wonder why the Bizgres people isn't > advocating a tool to do that. It is very hard to do with user-defined > types, but for BI/DW you mostly don't need those, do you?) Loading a table using COPY with frozen bits set was suggested in May, so yeh... it was suggested. At that time it was rejected, since earlier transactions would then be able to see rows they ought not be able to see. Thinking some more about this, this is only the inverse situation of a TRUNCATE. With truncate we remove tuples that ought to still be visible to pre-existing transactions. So there shouldn't really be an issue with loading pre-frozen tuples - as long as you accept the consequences for row visibility. Externally writing blocks is possible, but it bypasses a lot of other features. My current preference would be to have bulk_heap_insert() function to add a whole page at a time rather than inserting rows one at at a time. The main objective for a load is to make it disk bound; once we've achieved that by some further tuning, writing an external file would cost around the same as writing it internally from the DBMS. Oracle (direct path loader) and Teradata (Fastload) load data in complete blocks using a reduced code pathway, so I guess I was just following on, but I'm genuinely open to further persuasion if there is a better way. Having a table marked as INSERT ONLY would allow us to save 8 bytes/row, loading it pre-frozen (in some way) would save another 8 bytes/row and allow us to permanently avoid VACUUMing the table. That would be even better when we have per-table XID wrap avoidance. Best Regards, Simon Riggs
On Thu, Sep 01, 2005 at 11:08:36AM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > What I'm saying is that you can write a heap file, on which the tuples > > would all have xmin=FrozenTransactionId, xmax=Invalid, and the > > corresponding bits set in the infomask. This ensures that no matter the > > state of the server, you can plug the file in and all tuples will be > > valid. > > > The "only" problem is figuring out how to lay the data in the tuples > > themselves, w.r.t endianness and such. This is platform-dependent, so > > you have to write code to do it correctly. In absence of user-defined > > types, this should not be _too_ hard to do. Of course, such a program > > would in general also be Postgres-version-dependent. > > Of course, it's fair to ask whether such a program would be any faster > than binary-mode COPY by the time you got done ... or enough faster to > justify your effort, anyway. It may not be faster generating the data in the first place, but you don't have to vacuum the table, nor you are subject to hint bits changing, resulting in more unnecessary I/O. This can't be avoided with COPY, because there's always the chance that it will fail partway through, so you can't write frozen tuples. With an external program, you can just dump the invalid line somewhere else and continue with the rest. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Just treat us the way you want to be treated + some extra allowancefor ignorance." (MichaelBrusser)
Alvaro, > What issues did he raise on this? On having no Xmin. > What I'm saying is that you can write a heap file, on which the tuples > would all have xmin=FrozenTransactionId, xmax=Invalid, and the > corresponding bits set in the infomask. This ensures that no matter the > state of the server, you can plug the file in and all tuples will be > valid. > > The "only" problem is figuring out how to lay the data in the tuples > themselves, w.r.t endianness and such. This is platform-dependent, so > you have to write code to do it correctly. In absence of user-defined > types, this should not be _too_ hard to do. Of course, such a program > would in general also be Postgres-version-dependent. So, bulk loading by file generation? So the idea is that you would generate a properly formatted PostgreSQL table file, and then in one transaction create the table and attach it? Seems like this would have the additional limitation of being useful only for loading new partitions/new tables. However, it would have some significant advantages for bulk loading ... chiefly that the data page generation and associated computations could be done *off* the database server. This might help considerably in getting around the 100mb/s data computation ceiling we're hitting ... -- Josh Berkus Aglio Database Solutions San Francisco
Tom, > THe only fundamental disadvantage that COPY labors under is having to > write WAL records. It might be interesting to do something similar to > the recent hacks for CREATE TABLE AS, so that a COPY into a table just > created in the current transaction would skip writing WAL and instead > fsync the table at the end. Yes, I thought we discussed doing this for empty tables -- it would be, per our tests, a +10% to +30% boost to COPY. But there was some problem the patch? -- Josh Berkus Aglio Database Solutions San Francisco
On Thu, Sep 01, 2005 at 09:20:48AM -0700, Josh Berkus wrote: > > What I'm saying is that you can write a heap file, on which the tuples > > would all have xmin=FrozenTransactionId, xmax=Invalid, and the > > corresponding bits set in the infomask. This ensures that no matter the > > state of the server, you can plug the file in and all tuples will be > > valid. > > So, bulk loading by file generation? So the idea is that you would generate > a properly formatted PostgreSQL table file, and then in one transaction > create the table and attach it? Exactly. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Changing the world ... one keyboard at a time!" (www.DVzine.org)
On Thu, Sep 01, 2005 at 04:34:09PM +0100, Simon Riggs wrote: > On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote: > > I was thinking that when the tuple > > needs to be obsoleted it would need to grow to accomodate the Xmax, but > > you are not actually proposing to remove that, so it seems sensible. In > > fact, it is perfectly reasonable to remove Xmin and Cmin, because after > > the tuple is frozen, the Xmin never changes again. > > It's a good idea, but the Xmin is set to FrozenTransactionId, which is > how we know it is frozen, so how can we remove Xmin? The way to do this > is surely by using a row version id that is different for this format. Per Takahiro's patch, you don't need to set the Xmin to FrozenTransactionId -- what you do instead is set a bit in the infomask. > > Now, one thing of note is that you need to "compress" the page in order > > to actually be able to use the just-freed space. VACUUM could do that, > > but maybe it would be better to do it on-line -- the freezing process is > > going to have to write the page regardless. I wonder if with your patch > > the page is compressed on the same VACUUM execution that freezes the > > tuple? > > Only if you do a FULL, which is currently incompatible with a FREEZE. Well, if we are going to mess with what FREEZE is doing, we can as well make it compress the page. Note that to compress the page you don't need to touch the indexes. I don't remember the exact reason why FULL is incompatible with FREEZE, but AFAIR it's not fundamentally unsolvable (just very hard.) > There's no point in compressing a block if you can't also redistribute > rows between blocks to fill up the spaces, so another reason why it has > to be a FULL. That's a good point. > > One thing that comes to mind is that this makes somewhat easier to build > > a tool to write pre-built tables, for bulk-loading purposes. You just > > construct the binary file with the HEAP_FROZEN bit set, and then attach > > the file to a dummy table. > > Loading a table using COPY with frozen bits set was suggested in May, so > yeh... it was suggested. I'm not proposing to use COPY for that. It has loads of problems, which is why the patch was rejected. Using an external program is a different matter. > Externally writing blocks is possible, but it bypasses a lot of other > features. Like what? I don't really care for this feature, mind you -- I was merely mentioning the idea as it crossed my mind. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "The problem with the facetime model is not just that it's demoralizing, but that the people pretending to work interrupt the ones actually working." (Paul Graham)
Josh Berkus <josh@agliodbs.com> writes: >> THe only fundamental disadvantage that COPY labors under is having to >> write WAL records. It might be interesting to do something similar to >> the recent hacks for CREATE TABLE AS, so that a COPY into a table just >> created in the current transaction would skip writing WAL and instead >> fsync the table at the end. > Yes, I thought we discussed doing this for empty tables -- it would be, per > our tests, a +10% to +30% boost to COPY. > But there was some problem the patch? I have seen no such patch AFAIR. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On Thu, Sep 01, 2005 at 04:34:09PM +0100, Simon Riggs wrote: >> On Wed, 2005-08-31 at 22:25 -0400, Alvaro Herrera wrote: >>> Now, one thing of note is that you need to "compress" the page in order >>> to actually be able to use the just-freed space. VACUUM could do that, >>> but maybe it would be better to do it on-line -- the freezing process is >>> going to have to write the page regardless. I wonder if with your patch >>> the page is compressed on the same VACUUM execution that freezes the >>> tuple? >> >> Only if you do a FULL, which is currently incompatible with a FREEZE. > Well, if we are going to mess with what FREEZE is doing, we can as well > make it compress the page. Anyone looked at the code lately??? PageRepairFragmentation is part of any kind of vacuum. As long as you don't reassign tuple IDs (which it doesn't) there's no impact on indexes. regards, tom lane
Additional background daemon (was: Remove xmin and cmin from frozen tuples)
From
"Jim C. Nasby"
Date:
On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > > I agree. I think an good position of freezer is on bgwriter. > > My idea is: > > 1. Just before bgwriter writes an dirty page in LRU order, > > 2. Freeze tuples in the page and repair fragmentation. > > 3. (Replace the fsm page that has least freespace.) > > 4. Flush the page. > > This is a bad idea. The bgwriter isn't the place to be doing freezing, > because there is no reasonable way for it to guarantee that all old > tuples in a table (or any larger unit) have been frozen. So you'd still > need VACUUM to ensure no wraparound. Plus, you can't do such changes > without emitting an XLOG record, which is something we don't want > happening in the bgwriter's inner loop. Even more to the point, you > can't do such changes without getting a superexclusive lock on the page > (not only locked, but no one else has it pinned), which is a real > nonstarter for the bgwriter, both for performance and possible deadlock > issues. So is this something that another daemon could handle? Presumably one that would operate on pages before they were written out by bgwriter. Basically, right now any time someone thinks of something that could be done in the background, bgwriter is the automatic candidate because it's the only daemon in the backend. And it's often rejected for valid technical reasons, but that doesn't mean we can't have additional daemons that operate either before or after bgwriter. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote: >> This is a bad idea. The bgwriter isn't the place to be doing freezing, > So is this something that another daemon could handle? Possibly, but I'd be inclined to think of it as autovacuum's problem. regards, tom lane
Re: Additional background daemon (was: Remove xmin and cmin from frozen tuples)
From
"Jim C. Nasby"
Date:
On Thu, Sep 01, 2005 at 11:22:07PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > On Thu, Sep 01, 2005 at 09:22:35AM -0400, Tom Lane wrote: > >> This is a bad idea. The bgwriter isn't the place to be doing freezing, > > > So is this something that another daemon could handle? > > Possibly, but I'd be inclined to think of it as autovacuum's problem. Possibly, although what tends to make bgwriter interesting for these things is that we want to perform some operation on pages between when they get modified and when they get written out to disk. AFAIK autovacuum wouldn't currently serve that purpose (though I could be wrong). In any case, the big point is that there are ideas out there that might warrant an additional daemon besides bgwriter (my recollection is that this isn't the first proposal that's been objected to on the basis of bgwriter being the wrong place to do something). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com 512-569-9461
> [skip] > happening in the bgwriter's inner loop. Even more to the point, you > can't do such changes without getting a superexclusive lock on the > page > (not only locked, but no one else has it pinned), which is a real > nonstarter for the bgwriter, both for performance and possible > deadlock > issues. Hi Tom, I do not want to discuss in deep the place to do this job, it is really over my head. But, you said you need a "super-exclusive lock", and I wonder if a wait-free algorithm would be good for pg in a general manner. I have given some references about it already on the list. http://archives.postgresql.org/pgsql-hackers/2005-02/msg00263.php Cordialement, Jean-Gérard Pailloncy
Tom Lane wrote: > Of course, it's fair to ask whether such a program would be any faster > than binary-mode COPY by the time you got done ... or enough faster to > justify your effort, anyway. > > THe only fundamental disadvantage that COPY labors under is having to > write WAL records. It might be interesting to do something similar to > the recent hacks for CREATE TABLE AS, so that a COPY into a table just > created in the current transaction would skip writing WAL and instead > fsync the table at the end. Added to TODO: o Allow COPY into an empty table to skip WAL logging -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Tom Lane wrote: > ITAGAKI Takahiro <itagaki.takahiro@lab.ntt.co.jp> writes: > > I think it would be a waste to retain xmin and cmin for frozen tuples > > because their values represent only 'visible for all transactions'. > > True, but the hard part is getting rid of the storage for them. > > > I wrote a makeshift patch to compress xmin and cmin (8bytes) to > > 1-bit flag, using tuple overlaping. > > Is this idea worth trying? > > I think this is incredibly ugly :-(. It eliminates a fairly basic > assumption which is that items on a page don't overlap. The space > savings cannot be worth the loss in testability and reliability. > To take just one problem, it is no longer possible to check an item > offset for validity against pd_upper. If we're going to do this, > we need a more invasive patch that changes the structure of heaptuple > headers in a more fundamental way, and avoids breaking the page layout > representation. (Something like the way Oids are now handled might > work, although there are alignment issues to worry about, and it'd > take more work on VACUUM's part to convert a tuple to frozen state.) > > I'm also less than enthused about using up our last infomask bit for > a relatively unimportant purpose. We might need that for something > bigger someday... though I can't presently guess what. Considering the cost/benefits, rather than doing some optimization for long-lived tuples, I would like to see us merge the existing xmin/xmax/cmin/cmax values back into three storage fields like we had in 7.4 and had to expand to a full four in 8.0 to support subtransactions. The benefit is that every row would be reduced in size by 4 bytes or 14% for all rows: * Merge xmin/xmax/cmin/cmax back into three header fields Before subtransactions, there used to be only three fields neededto store these four values. This was possible because only the current transaction looks at the cmin/cmax values.If the current transaction created and expired the row the fields stored where xmin (same as xmax), cmin, cmax,and if the transaction was expiring a row from a another transaction, the fields stored were xmin (cmin was not needed),xmax, and cmax. Such a system worked because a transaction could only see rows from another completed transaction. However, subtransactions can see rows from outer transactions, and once the subtransaction completes, the outer transactioncontinues, requiring the storage of all four fields. With subtransactions, an outer transaction can create arow, a subtransaction expire it, and when the subtransaction completes, the outer transaction still has to have propervisibility of the row's cmin, for example, for cursors. One possible solution is to create a phantom cid which representsa cmin/cmax pair and is stored in local memory. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Tom Lane wrote: >> THe only fundamental disadvantage that COPY labors under is having to >> write WAL records. It might be interesting to do something similar to >> the recent hacks for CREATE TABLE AS, so that a COPY into a table just >> created in the current transaction would skip writing WAL and instead >> fsync the table at the end. > Added to TODO: > o Allow COPY into an empty table to skip WAL logging It has to be a *new* table, not an *empty* table. If it's already visible to other xacts then somebody else could insert into it in parallel with you, because COPY doesn't take an exclusive lock. Contrariwise, it doesn't really matter (I think) if there are WAL-logged records already in the table and COPY is adding more that aren't logged. (You might have to force COPY to start adding the rows on freshly added pages ... hmm ... all of a sudden I think we had this discussion already? I for sure remember the fresh-pages trick from some other thread.) regards, tom lane
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Considering the cost/benefits, rather than doing some optimization for > long-lived tuples, I would like to see us merge the existing > xmin/xmax/cmin/cmax values back into three storage fields like we had in > 7.4 and had to expand to a full four in 8.0 to support subtransactions. There is another reason for trying to do that rather than the frozen-row optimization, which is that to get it down to two visibility-related fields, we'd have to find another representation for tuples that are Datums in memory. The current Datum representation overlays three int32 fields where the visibility fields are for a tuple on-disk. This works fine now, and would still work fine if we could revert to the 7.4 approach, but it doesn't play nicely with a scheme to remove 2 of the 4 fields. regards, tom lane
On Fri, Sep 02, 2005 at 04:02:08PM -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Tom Lane wrote: > >> THe only fundamental disadvantage that COPY labors under is having to > >> write WAL records. It might be interesting to do something similar to > >> the recent hacks for CREATE TABLE AS, so that a COPY into a table just > >> created in the current transaction would skip writing WAL and instead > >> fsync the table at the end. > > > Added to TODO: > > o Allow COPY into an empty table to skip WAL logging > > It has to be a *new* table, not an *empty* table. If it's already > visible to other xacts then somebody else could insert into it in > parallel with you, because COPY doesn't take an exclusive lock. What about the indexes? Logging one of the inserters and not the other is certain to corrupt the whole thing. (Logging index insertion but not the heap itself is silly, but perhaps an easy way out is to disable the feature for tables with indexes.) > Contrariwise, it doesn't really matter (I think) if there are WAL-logged > records already in the table and COPY is adding more that aren't logged. Only if the page is locked in a fashion that the bulk loader can't insert tuples into a page that the other transaction is using. (Not sure if this can happen in reality.) Else we risk both inserting a tuple in the same page, and on recovery finding out that somebody else used the tuple slot. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Los románticos son seres que mueren de deseos de vida"
Tom, Alvaro, > > It has to be a *new* table, not an *empty* table. If it's already > > visible to other xacts then somebody else could insert into it in > > parallel with you, because COPY doesn't take an exclusive lock. There's still major gains to be had, for ETL, in being able to disable logging on new tables/partitions. *particularly* partitions. > Contrariwise, it doesn't really matter (I think) if there are WAL-logged > records already in the table and COPY is adding more that aren't logged. > (You might have to force COPY to start adding the rows on freshly added > pages ... hmm ... all of a sudden I think we had this discussion > already? I for sure remember the fresh-pages trick from some other > thread.) Yes, and that's what shot the proposal down before. But I don't think we devoted sufficient discussion to the "new table" case. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
People: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Considering the cost/benefits, rather than doing some optimization for > > long-lived tuples, I would like to see us merge the existing > > xmin/xmax/cmin/cmax values back into three storage fields like we had > > in 7.4 and had to expand to a full four in 8.0 to support > > subtransactions. Hmmm. I personally don't see a whole lot of value in trimming 4 bytes per row off an archive table, particularly if the table would need to go through some kind of I/O intensive operation to do it. Where I do see value is in enabling index-only access for "frozen" tables. That would be a *huge* gain, especially with bitmaps. I think we've discussed this before,though. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On Fri, Sep 02, 2005 at 04:02:08PM -0400, Tom Lane wrote: >> It has to be a *new* table, not an *empty* table. If it's already >> visible to other xacts then somebody else could insert into it in >> parallel with you, because COPY doesn't take an exclusive lock. > What about the indexes? Logging one of the inserters and not the other > is certain to corrupt the whole thing. Good point, but that fits in just fine with the restriction to just-created tables. >> Contrariwise, it doesn't really matter (I think) if there are WAL-logged >> records already in the table and COPY is adding more that aren't logged. > Only if the page is locked in a fashion that the bulk loader can't > insert tuples into a page that the other transaction is using. What other transaction? The point I was making is thatBEGIN;CREATE TABLE ...INSERT ...COPY ... is still optimizable. There isn't going to be anyone competing with the COPY while it runs. regards, tom lane
On Fri, Sep 02, 2005 at 04:27:59PM -0400, Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > >> Contrariwise, it doesn't really matter (I think) if there are WAL-logged > >> records already in the table and COPY is adding more that aren't logged. > > > Only if the page is locked in a fashion that the bulk loader can't > > insert tuples into a page that the other transaction is using. > > What other transaction? The point I was making is that > BEGIN; > CREATE TABLE ... > INSERT ... > COPY ... > is still optimizable. There isn't going to be anyone competing with > the COPY while it runs. Sure. I was thinking that you were looking for a mechanism to relax the other restriction. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com Dios hizo a Adán, pero fue Eva quien lo hizo hombre.
On Fri, Sep 02, 2005 at 01:35:42PM -0700, Josh Berkus wrote: > > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > > Considering the cost/benefits, rather than doing some optimization for > > > long-lived tuples, I would like to see us merge the existing > > > xmin/xmax/cmin/cmax values back into three storage fields like we had > > > in 7.4 and had to expand to a full four in 8.0 to support > > > subtransactions. > > Hmmm. I personally don't see a whole lot of value in trimming 4 bytes per > row off an archive table, particularly if the table would need to go > through some kind of I/O intensive operation to do it. I think you are missing something. These 4 bytes are not trimmed by an I/O-intensive operation, they are not written in the first place. Now, I agree for a very wide table those 4 bytes per tuple may not be a lot. But the optimization could be significant for not-wide (uh, sorry, I don't remember the word) tables. > Where I do see value is in enabling index-only access for "frozen" tables. > That would be a *huge* gain, especially with bitmaps. I think we've > discussed this before, though. That's a completely different discussion. Btree-organized heaps may help you there. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Having your biases confirmed independently is how scientific progress is made, and hence made our great society what it is today" (Mary Gardiner)
On Fri, Sep 02, 2005 at 01:30:58PM -0700, Josh Berkus wrote: > > Contrariwise, it doesn't really matter (I think) if there are WAL-logged > > records already in the table and COPY is adding more that aren't logged. > > (You might have to force COPY to start adding the rows on freshly added > > pages ... hmm ... all of a sudden I think we had this discussion > > already? I for sure remember the fresh-pages trick from some other > > thread.) > > Yes, and that's what shot the proposal down before. But I don't think we > devoted sufficient discussion to the "new table" case. If we are going to have real partitioning sometime soon, I don't think the restriction is a problem. You may have to load a whole partition again, which may be faster than using logged COPY to an already-filled partition. The point is, it's not the whole table, just a partition. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Acepta los honores y aplausos y perderás tu libertad"
Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > On Fri, Sep 02, 2005 at 04:02:08PM -0400, Tom Lane wrote: > >> It has to be a *new* table, not an *empty* table. If it's already > >> visible to other xacts then somebody else could insert into it in > >> parallel with you, because COPY doesn't take an exclusive lock. > > > What about the indexes? Logging one of the inserters and not the other > > is certain to corrupt the whole thing. > > Good point, but that fits in just fine with the restriction to > just-created tables. Seem the newly created table could have an index, but we would skip logging on that too and create a zero-length file on crash restore. > >> Contrariwise, it doesn't really matter (I think) if there are WAL-logged > >> records already in the table and COPY is adding more that aren't logged. > > > Only if the page is locked in a fashion that the bulk loader can't > > insert tuples into a page that the other transaction is using. > > What other transaction? The point I was making is that > BEGIN; > CREATE TABLE ... > INSERT ... > COPY ... > is still optimizable. There isn't going to be anyone competing with > the COPY while it runs. Updated TODO: o Allow COPY on a newly-created table to skip WAL logging On crash recovery, the table involved in the COPY would have its heap and index files truncated. One issueis that no other backend should be able to add to the table at the same time, which is something thatis currently allowed. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
Bruce Momjian <pgman@candle.pha.pa.us> writes: > Updated TODO: > o Allow COPY on a newly-created table to skip WAL logging > On crash recovery, the table involved in the COPY would > have its heap and index files truncated. One issue is > that no other backend should be able to add to the table > at the same time, which is something that is currently > allowed. This is simply wrong. (1) a table created in the current transaction isn't visible to anyone else, (2) the correct rollback state is for it not to be there, rather than be there and empty. regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > On Fri, Sep 02, 2005 at 01:35:42PM -0700, Josh Berkus wrote: >> Where I do see value is in enabling index-only access for "frozen" tables. >> That would be a *huge* gain, especially with bitmaps. I think we've >> discussed this before, though. > That's a completely different discussion. Btree-organized heaps may > help you there. There was some talk of using a spare bit in index entries to mark "known good" index entries (xmin committed and less than GlobalXmin, and xmax invalid) but the cost of maintaining such bits seems nontrivial. In any case I agree that's an independent issue. regards, tom lane
Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Updated TODO: > > > o Allow COPY on a newly-created table to skip WAL logging > > > On crash recovery, the table involved in the COPY would > > have its heap and index files truncated. One issue is > > that no other backend should be able to add to the table > > at the same time, which is something that is currently > > allowed. > > This is simply wrong. (1) a table created in the current transaction > isn't visible to anyone else, (2) the correct rollback state is for > it not to be there, rather than be there and empty. New text: o Allow COPY on a newly-created table to skip WAL logging On crash recovery, the table involved in the COPY would removed or have its heap and index files truncated. One issue is that no other backend should be able to add to the table at the same time, whichis something that is currently allowed. I think we can lock a zero-length table and do this optimization. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, Sep 02, 2005 at 05:18:09PM -0400, Tom Lane wrote: > Bruce Momjian <pgman@candle.pha.pa.us> writes: > > Updated TODO: > > > o Allow COPY on a newly-created table to skip WAL logging > > > On crash recovery, the table involved in the COPY would > > have its heap and index files truncated. One issue is > > that no other backend should be able to add to the table > > at the same time, which is something that is currently > > allowed. > > This is simply wrong. (1) a table created in the current transaction > isn't visible to anyone else, (2) the correct rollback state is for > it not to be there, rather than be there and empty. As a related note: I remember somebody mentioned some time ago that if you create a table and then crash before ending the transaction, the tuple in pg_class is no longer valid, but the file remains. I think this will be a much worse problem if we allow a table that's being COPY'ed to remain after a crash. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com Oh, oh, las chicas galacianas, lo harán por las perlas, ¡Y las de Arrakis por el agua! Pero si buscas damas Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I remember somebody mentioned some time ago that if you create a table > and then crash before ending the transaction, the tuple in pg_class is > no longer valid, but the file remains. Right --- it will be removed on transaction rollback, but not if the backend crashes first. There was a patch submitted earlier this year to try to clean out such files, but it got rejected (as too messy IIRC). I think we still have a TODO item about it. regards, tom lane
On Fri, 2 Sep 2005 15:51:15 -0400 (EDT), Bruce Momjian <pgman@candle.pha.pa.us> wrote: > * Merge xmin/xmax/cmin/cmax back into three header fields And don't forget xvac, please. > Before subtransactions, there used to be only three fields needed to > store these four values. ... five values. > This was possible because only the current > transaction looks at the cmin/cmax values. Which is a reason to get rid of cmin/cmax in tuple headers entirely. Once I had a patch based on 7.4 that stored cmin and cmax in backend-local memory. It passed make check and some volume tests, but I felt it was not ready to be applied without any spill-to-disk mechanism. Development stalled when I tried to eliminate xvac as well, which would have required deep cuts into VACUUM code :-( ServusManfred
Manfred Koizar wrote: > On Fri, 2 Sep 2005 15:51:15 -0400 (EDT), Bruce Momjian > <pgman@candle.pha.pa.us> wrote: > > * Merge xmin/xmax/cmin/cmax back into three header fields > > And don't forget xvac, please. > > > Before subtransactions, there used to be only three fields needed to > > store these four values. > > ... five values. > > > This was possible because only the current > > transaction looks at the cmin/cmax values. > > Which is a reason to get rid of cmin/cmax in tuple headers entirely. > Once I had a patch based on 7.4 that stored cmin and cmax in > backend-local memory. It passed make check and some volume tests, but > I felt it was not ready to be applied without any spill-to-disk > mechanism. Development stalled when I tried to eliminate xvac as > well, which would have required deep cuts into VACUUM code :-( Interesting idea, but how would you record the cmin/xmin values without requiring unlimited memory? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian <pgman@candle.pha.pa.us> wrote: >> Once I had a patch based on 7.4 that stored cmin and cmax in >> backend-local memory. >Interesting idea, but how would you record the cmin/xmin values without >requiring unlimited memory? That's exactly the reason for not sending it to -patches. Without spilling to disk this is just not ready for real life. The problem is that -- unlike other data structures that build up during a transaction, e.g. trigger queues -- cmin/cmax lookup requires random access, so we'd need some form of tree or hash. Unfornunately I never got beyond brainstorming :-( BTW, is there anything else that'd need spilling to disk during long transactions? ServusManfred
On Sat, Sep 03, 2005 at 10:59:31AM +0200, Manfred Koizar wrote: > On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian > <pgman@candle.pha.pa.us> wrote: > >> Once I had a patch based on 7.4 that stored cmin and cmax in > >> backend-local memory. > > >Interesting idea, but how would you record the cmin/xmin values without > >requiring unlimited memory? > > That's exactly the reason for not sending it to -patches. Without > spilling to disk this is just not ready for real life. The problem is > that -- unlike other data structures that build up during a > transaction, e.g. trigger queues -- cmin/cmax lookup requires random > access, so we'd need some form of tree or hash. Unfornunately I never > got beyond brainstorming :-( > > BTW, is there anything else that'd need spilling to disk during long > transactions? Yes, the queue of pending deferred triggers. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "El hombre nunca sabe de lo que es capaz hasta que lo intenta" (C. Dickens)
Manfred Koizar wrote: > On Fri, 2 Sep 2005 20:41:48 -0400 (EDT), Bruce Momjian > <pgman@candle.pha.pa.us> wrote: > >> Once I had a patch based on 7.4 that stored cmin and cmax in > >> backend-local memory. > > >Interesting idea, but how would you record the cmin/xmin values without > >requiring unlimited memory? > > That's exactly the reason for not sending it to -patches. Without > spilling to disk this is just not ready for real life. The problem is > that -- unlike other data structures that build up during a > transaction, e.g. trigger queues -- cmin/cmax lookup requires random > access, so we'd need some form of tree or hash. Unfornunately I never > got beyond brainstorming :-( What we do for shared row locks is much more compact because we create a phantom xid for combination of xids and store that in a hash. I think the problem with having cmin/cmax in local memory is that without a way to stamp a row with some fake cid (as is suggested in the TODO item now) there is really no way to _group_ rows together to store their values as a single item in local memory, meaning we would have to store ctids or something for each row, and that seems unscalable. I have added your memory-only idea to the TODO item because it highlights that cmin/cmax is never looked at outside the backend that is modifying the row. If there was a tuple header field we can use just while the row is being created or expired we could use that, but I don't see one. Could we recalculated one of the fields after it is created/expired? I don't know. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Fri, Sep 02, 2005 at 03:51:15PM -0400, Bruce Momjian wrote: > One possible solution is to create a phantom cid which represents a > cmin/cmax pair and is stored in local memory. If we're going to look at doing that I think it would also be good to consider including xmin and xmax as well. This might require persisting to disk, but for transactions that touch a number of tuples it could potentially be a big win (imagine being able to shrink all 4 fields down to a single int; a 45% space reduction). -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Sep 06, 2005 at 03:58:28PM -0500, Jim C. Nasby wrote: > On Fri, Sep 02, 2005 at 03:51:15PM -0400, Bruce Momjian wrote: > > One possible solution is to create a phantom cid which represents a > > cmin/cmax pair and is stored in local memory. > > If we're going to look at doing that I think it would also be good to > consider including xmin and xmax as well. If you do that, you'll never be able to delete or update the tuple. > This might require persisting to disk, but for transactions that touch > a number of tuples it could potentially be a big win (imagine being > able to shrink all 4 fields down to a single int; a 45% space > reduction). Yeah, I've heard about compression algorithms that managed to fit megabytes of data in 8 bytes and even less. They were very cool. No one managed to write decompression algorithms however. Imagine a whole data warehouse could be stored in a single disk block!! I imagine the development of decompressors was boycotted by SAN vendors and the like. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "Si un desconocido se acerca y te regala un CD de Ubuntu ... Eso es ... Eau de Tux"
On Tue, Sep 06, 2005 at 05:37:06PM -0400, Alvaro Herrera wrote: > On Tue, Sep 06, 2005 at 03:58:28PM -0500, Jim C. Nasby wrote: > > On Fri, Sep 02, 2005 at 03:51:15PM -0400, Bruce Momjian wrote: > > > One possible solution is to create a phantom cid which represents a > > > cmin/cmax pair and is stored in local memory. > > > > If we're going to look at doing that I think it would also be good to > > consider including xmin and xmax as well. > > If you do that, you'll never be able to delete or update the tuple. My idea was to use an int to represent combinations of (c|x)(min|max), probably on a per-table basis. Essentially, it would normalize these values. I don't see how this would eliminate the ability to update or delete. Of course this would actually hurt on tables that had mostly single-row operations, so it would have to be a table-creation option. I believe it could substantially reduce the size of tables that don't see a lot of transactions. It would be good just to have a quick and dirty patch just to see if this is the case or not. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: >>> If we're going to look at doing that I think it would also be good to >>> consider including xmin and xmax as well. >> >> If you do that, you'll never be able to delete or update the tuple. > My idea was to use an int to represent combinations of (c|x)(min|max), > probably on a per-table basis. Essentially, it would normalize these > values. I don't see how this would eliminate the ability to update or > delete. How will other transactions know whether the tuple is good (yet) or not? How will you recover if the backend that does know this crashes before transaction end? How will you lock tuples for update/delete? regards, tom lane
On Tue, Sep 06, 2005 at 06:02:27PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > >>> If we're going to look at doing that I think it would also be good to > >>> consider including xmin and xmax as well. > >> > >> If you do that, you'll never be able to delete or update the tuple. > > > My idea was to use an int to represent combinations of (c|x)(min|max), > > probably on a per-table basis. Essentially, it would normalize these > > values. I don't see how this would eliminate the ability to update or > > delete. > > How will other transactions know whether the tuple is good (yet) or not? > How will you recover if the backend that does know this crashes before > transaction end? How will you lock tuples for update/delete? If the 4 header fields in question were just normalized out, wouldn't all the semantics continue to work the same? All I'm envisioning is replacing them in each tuple with a pointer (vis_id) to another datastore that would be roughly equivalent to: CREATE TABLE visibility ( vis_id SERIAL, xmin int, xmax int, cmin int, cmax_xmax int ) Of course you wouldn't use an actual table to do this, but hopefully this clarifies my idea. Any time the backend would update any of those fields it would now insert a new row into visibility containing the proper values and use vis_id in the tuples. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jnasby@pervasive.com> writes: > If the 4 header fields in question were just normalized out, wouldn't > all the semantics continue to work the same? All I'm envisioning is > replacing them in each tuple with a pointer (vis_id) to another > datastore that would be roughly equivalent to: > CREATE TABLE visibility ( > vis_id SERIAL, > xmin int, > xmax int, > cmin int, > cmax_xmax int > ) > Of course you wouldn't use an actual table to do this, but hopefully > this clarifies my idea. Let's see ... replace every tuple access with TWO tuple accesses ... yes, that should improve performance nicely. And no, I'm not sure the semantics are the same, particularly w.r.t. atomicity of state changes. regards, tom lane
On Tue, Sep 06, 2005 at 07:02:20PM -0400, Tom Lane wrote: > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > If the 4 header fields in question were just normalized out, wouldn't > > all the semantics continue to work the same? All I'm envisioning is > > replacing them in each tuple with a pointer (vis_id) to another > > datastore that would be roughly equivalent to: > > > CREATE TABLE visibility ( > > vis_id SERIAL, > > xmin int, > > xmax int, > > cmin int, > > cmax_xmax int > > ) > > > Of course you wouldn't use an actual table to do this, but hopefully > > this clarifies my idea. > > Let's see ... replace every tuple access with TWO tuple accesses > ... yes, that should improve performance nicely. And no, I'm not > sure the semantics are the same, particularly w.r.t. atomicity of > state changes. Well, like I said, I'm not envisioning using a table to store that info. Since we'd be storing 4 fixed-length fields, you wouldn't need to actually store vis_id per entry, just use the offset into the page. Assuming you use one 'slot' to store the id of the first set, you could store 511 tuples per page, which doesn't sound very bad. But this is why I'm hoping a quick patch could be done so that this could be tested. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Sep 07, 2005 at 12:31:01AM -0500, Jim C. Nasby wrote: > On Tue, Sep 06, 2005 at 07:02:20PM -0400, Tom Lane wrote: > > "Jim C. Nasby" <jnasby@pervasive.com> writes: > > > If the 4 header fields in question were just normalized out, wouldn't > > > all the semantics continue to work the same? All I'm envisioning is > > > replacing them in each tuple with a pointer (vis_id) to another > > > datastore that would be roughly equivalent to: > > > > > CREATE TABLE visibility ( > > > vis_id SERIAL, > > > xmin int, > > > xmax int, > > > cmin int, > > > cmax_xmax int > > > ) > > > > > Of course you wouldn't use an actual table to do this, but hopefully > > > this clarifies my idea. > Well, like I said, I'm not envisioning using a table to store that info. > Since we'd be storing 4 fixed-length fields, you wouldn't need to > actually store vis_id per entry, just use the offset into the page. > Assuming you use one 'slot' to store the id of the first set, you could > store 511 tuples per page, which doesn't sound very bad. I think this could be done with our "SLRU" mechanism, just like pg_clog, pg_subtrans and pg_multixact do. Whether it's really a good idea or not, it's another story. The problem is that you would be creating new ones all the time, it would become a huge source of contention, and it would use a lot of memory. Anyway you are just moving the storage somewhere else -- instead of having 4 fields in the tuple itself, you have one field which points the same 4 fields elsewhere. I don't see how is that a win; it's actually worse because you have to do two lookups. (And actually you have just enlarged the storage requirements because you need to store the "vis_id" twice.) -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "La realidad se compone de muchos sueños, todos ellos diferentes, pero en cierto aspecto, parecidos..." (Yo, hablando de sueños eróticos)
On Wed, Sep 07, 2005 at 01:05:52PM -0400, Alvaro Herrera wrote: > Anyway you are just moving the storage somewhere else -- instead of > having 4 fields in the tuple itself, you have one field which points > the same 4 fields elsewhere. I don't see how is that a win; it's > actually worse because you have to do two lookups. (And actually you > have just enlarged the storage requirements because you need to store > the "vis_id" twice.) It would only be of use if the table had few transactions in it; in other words, if it was mostly read-only. For a true read-only table there are other options people have suggested that are probably better. BTW, this becomes even more attractive if vis_id is int2; in that case you can keep the entire mapping in memory in ~1MB. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, Sep 07, 2005 at 01:20:27PM -0400, Tom Lane wrote: > Anyway the fundamental insight has been completely lost here. The > original point was that cmin and cmax are only interesting within the > originating transaction, and not to anyone else, and thus perhaps don't > need to be kept in permanent storage; while xmin/xmax are different > animals because they *are* of interest to other transactions. I'm curious to know how can you store the cmin/cmax pair completely out of the tuple. It's easy to see how to store a single identifier in each tuple that would be an index to a structure in local memory. However, to eliminate both you'd have to keep a list of all tuples you have created or obsoleted, with the cmin and cmax of each. This seems like an awful amount of memory. -- Alvaro Herrera -- Valdivia, Chile Architect, www.EnterpriseDB.com "I can't go to a restaurant and order food because I keep looking at the fonts on the menu. Five minutes later I realize that it's also talking about food" (Donald Knuth)
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I'm curious to know how can you store the cmin/cmax pair completely out > of the tuple. It's easy to see how to store a single identifier in each > tuple that would be an index to a structure in local memory. However, > to eliminate both you'd have to keep a list of all tuples you have > created or obsoleted, with the cmin and cmax of each. This seems like > an awful amount of memory. Yeah. I think a reasonable compromise scheme is to try to get down to three fields per tuple: xmin same as nowxmax same as nowcid/xvac xvac can share storage with the command ID info as long as VACUUM FULL never tries to move a tuple whose originating or deleting transaction is still running ... which is pretty much the same restriction we had before. For the command IDs, I am imagining: if created in current transaction: use cid to store cmin if deleted in current transaction: use cid to store cmax if both created and deleted in current transaction: cid is an index into an in-memory data structure that contains cmin and cmax. "current transaction" would have to have the loose definition that includes any subxact of the current top xact, but still, I think that the case where you need both fields is relatively uncommon. The in-memory data structure would only need to contain an entry for each distinct combination of cmin and cmax used in the current xact, so I think we could assume that it would never get unreasonably large. The entries would be created "on demand" much like we do for multixact ids (I guess you'd want a hash table to map requested cmin/cmax to an existing entry ID quickly). regards, tom lane
Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > I think this could be done with our "SLRU" mechanism, just like pg_clog, > pg_subtrans and pg_multixact do. Whether it's really a good idea or > not, it's another story. The problem is that you would be creating new > ones all the time, it would become a huge source of contention, and it > would use a lot of memory. ... and you couldn't expire the data in a reasonable period of time. pg_subtrans and pg_multixact have only very short active ranges. pg_clog is longer-lived, but at only 2 bits per transaction, we can stand it. 16 bytes per tuple is a whole lot more data. Anyway the fundamental insight has been completely lost here. The original point was that cmin and cmax are only interesting within the originating transaction, and not to anyone else, and thus perhaps don't need to be kept in permanent storage; while xmin/xmax are different animals because they *are* of interest to other transactions. The storage scheme Jim proposes takes no advantage of that whatever. regards, tom lane
This is now in the TODO list: * Merge xmin/xmax/cmin/cmax back into three header fields Before subtransactions, there used to be only three fields needed to store these four values. This was possible becauseonly the current transaction looks at the cmin/cmax values. If the current transaction created and expired the rowthe fields stored where xmin (same as xmax), cmin, cmax, and if the transaction was expiring a row from a another transaction,the fields stored were xmin (cmin was not needed), xmax, and cmax. Such a system worked because a transactioncould only see rows from another completed transaction. However, subtransactions can see rows from outer transactions,and once the subtransaction completes, the outer transaction continues, requiring the storage of all four fields.With subtransactions, an outer transaction can create a row, a subtransaction expire it, and when the subtransactioncompletes, the outer transaction still has to have proper visibility of the row's cmin, for example, for cursors. One possible solution is to create a phantom cid which represents a cmin/cmax pair and is stored in local memory. Anotheridea is to store both cmin and cmax only in local memory. --------------------------------------------------------------------------- Tom Lane wrote: > Alvaro Herrera <alvherre@alvh.no-ip.org> writes: > > I'm curious to know how can you store the cmin/cmax pair completely out > > of the tuple. It's easy to see how to store a single identifier in each > > tuple that would be an index to a structure in local memory. However, > > to eliminate both you'd have to keep a list of all tuples you have > > created or obsoleted, with the cmin and cmax of each. This seems like > > an awful amount of memory. > > Yeah. I think a reasonable compromise scheme is to try to get down to > three fields per tuple: > > xmin same as now > xmax same as now > cid/xvac > > xvac can share storage with the command ID info as long as VACUUM FULL > never tries to move a tuple whose originating or deleting transaction > is still running ... which is pretty much the same restriction we had > before. > > For the command IDs, I am imagining: > > if created in current transaction: use cid to store cmin > > if deleted in current transaction: use cid to store cmax > > if both created and deleted in current transaction: cid is an index > into an in-memory data structure that contains cmin and cmax. > > "current transaction" would have to have the loose definition that > includes any subxact of the current top xact, but still, I think that > the case where you need both fields is relatively uncommon. > > The in-memory data structure would only need to contain an entry for > each distinct combination of cmin and cmax used in the current xact, > so I think we could assume that it would never get unreasonably large. > The entries would be created "on demand" much like we do for > multixact ids (I guess you'd want a hash table to map requested > cmin/cmax to an existing entry ID quickly). > > regards, tom lane > -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073