Thread: Another idea for dealing with cmin/cmax
In addition to/instead of abstracting cmin/cmax to a phantom ID, what about allowing for two versions of the tuple header, one with cid info and one without? That would allow for cid info to be stripped out when pages were written to disk. The downside to this is that we'd have to be able to deal with pages in-memory potentially being larger than pages on-disk. Since there's been discussion of separating on-disk and in-memory page formats, maybe that doesn't kill the proposal outright. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Jim C. Nasby wrote: > In addition to/instead of abstracting cmin/cmax to a phantom ID, what > about allowing for two versions of the tuple header, one with cid info > and one without? That would allow for cid info to be stripped out when > pages were written to disk. > How exactly would that help? You can't just strip out cid info when writing to disk, if you don't want to lose the information. And it's certainly a lot more complicated than the phantom id thing. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Thu, Sep 28, 2006 at 05:13:11PM +0100, Heikki Linnakangas wrote: > Jim C. Nasby wrote: > >In addition to/instead of abstracting cmin/cmax to a phantom ID, what > >about allowing for two versions of the tuple header, one with cid info > >and one without? That would allow for cid info to be stripped out when > >pages were written to disk. > > > > How exactly would that help? You can't just strip out cid info when > writing to disk, if you don't want to lose the information. Erm, sorry, brainfart... yeah, we'd need to be able to write the info out to disk. The reason I thought of this is because once the transaction commits, we have no use for the cid info. So we could do something like have bgwriter look for tuples that belong to committed transactions before it writes a page, and strip the cid out of them. The problem with *that* is that (AFAIK) you'll need cid info again once you go to update or delete that tuple. And that might obviously need to spill to disk before the transaction commits. Back to the drawing board... -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> wrote: > The reason I thought of this is because once the transaction commits, we > have no use for the cid info. So we could do something like have > bgwriter look for tuples that belong to committed transactions before it > writes a page, and strip the cid out of them. Your concept is just like as the experimental method that I suggested before in http://archives.postgresql.org/pgsql-hackers/2005-08/msg01193.php We can remove cmin and cmax from commited tuples and xmin from frozen tuples and we might save some bytes in tuple headers. However, I think our next goal to shrink the headers is 16 bytes. The headers become 23 bytes using phantom cids and we are limited by alignments, so we will have no more advantages unless we delete extra 7 bytes in the headers. ...and it seems to be very difficult. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro wrote: > However, I think our next goal to shrink the headers is 16 bytes. The > headers > become 23 bytes using phantom cids and we are limited by alignments, > so we will > have no more advantages unless we delete extra 7 bytes in the headers. > ...and it seems to be very difficult. Yeah, I thought about that too earlier. If we get rid of VACUUM FULL, or replace it with something that doesn't need xvac, and keep cmin and cmax in backend-private storage, we could get rid of the overlayed t_field4, which is 4 bytes. Then we're down to 19 bytes. We could get rid of t_hoff, because we should always be able to calculate the header size. Then we're down to 18 bytes. There's currently 15 bits in use in the infomask. After we remove the HEAP_MOVED_* fields that we don't need without VACUUM FULL, that's down to 13 bits. t_natts only needs 11 bits, because MaxHeapAttributeNumber is 1600. We could move 5 of the bits in infomask to the high 5 bits of t_natts, and save one byte. We're now down to 17 bytes. That's as far as I got. So it seems we could shave off some bytes, but we still can't get down to 16. And the changes needed in total would be quite massive. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote: > We could get rid of t_hoff, because we should always be able to > calculate the header size. Then we're down to 18 bytes. Without t_hoff, how do you know the size of the null bitmap? You could probably do it only if you assume the null bitmap, if present, always covers all the columns... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote: >> We could get rid of t_hoff, because we should always be able to >> calculate the header size. Then we're down to 18 bytes. > > Without t_hoff, how do you know the size of the null bitmap? You could > probably do it only if you assume the null bitmap, if present, always > covers all the columns... I think we assume that already. heap_form_tuple reserves space for the bitmap like this: if (hasnull) len += BITMAPLEN(numberOfAttributes); -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 29, 2006 at 10:59:13AM +0100, Heikki Linnakangas wrote: > Martijn van Oosterhout wrote: > >On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote: > >>We could get rid of t_hoff, because we should always be able to > >>calculate the header size. Then we're down to 18 bytes. > > > >Without t_hoff, how do you know the size of the null bitmap? You could > >probably do it only if you assume the null bitmap, if present, always > >covers all the columns... > > I think we assume that already. heap_form_tuple reserves space for the > bitmap like this: > > if (hasnull) > len += BITMAPLEN(numberOfAttributes); Ok, now we do an ALTER TABLE blah ADD COLUMN ..., and we have to expand the bitmaps for the entire table? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Martijn van Oosterhout wrote: > On Fri, Sep 29, 2006 at 10:59:13AM +0100, Heikki Linnakangas wrote: >> Martijn van Oosterhout wrote: >>> On Fri, Sep 29, 2006 at 09:35:31AM +0100, Heikki Linnakangas wrote: >>>> We could get rid of t_hoff, because we should always be able to >>>> calculate the header size. Then we're down to 18 bytes. >>> Without t_hoff, how do you know the size of the null bitmap? You could >>> probably do it only if you assume the null bitmap, if present, always >>> covers all the columns... >> I think we assume that already. heap_form_tuple reserves space for the >> bitmap like this: >> >> if (hasnull) >> len += BITMAPLEN(numberOfAttributes); > > Ok, now we do an ALTER TABLE blah ADD COLUMN ..., and we have to expand > the bitmaps for the entire table? No, you'd still have the the number of attributes (t_natts) in the header. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
On Fri, Sep 29, 2006 at 01:15:06PM +0900, ITAGAKI Takahiro wrote: > > "Jim C. Nasby" <jim@nasby.net> wrote: > > > The reason I thought of this is because once the transaction commits, we > > have no use for the cid info. So we could do something like have > > bgwriter look for tuples that belong to committed transactions before it > > writes a page, and strip the cid out of them. > > Your concept is just like as the experimental method that I suggested before > in http://archives.postgresql.org/pgsql-hackers/2005-08/msg01193.php > We can remove cmin and cmax from commited tuples and xmin from frozen tuples > and we might save some bytes in tuple headers. > > However, I think our next goal to shrink the headers is 16 bytes. The headers > become 23 bytes using phantom cids and we are limited by alignments, so we will > have no more advantages unless we delete extra 7 bytes in the headers. > ...and it seems to be very difficult. Dumb question... wouldn't getting down to 20 bytes buy us something? -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > Dumb question... wouldn't getting down to 20 bytes buy us something? Only on 32-bit machines, which are getting less interesting as database servers every day. (Just last night I was reading somebody opining that the transition to 64-bit hardware would be effectively complete by 2008 ... and he was talking about desktop PCs, not serious iron.) BTW, the apparently useless byte after the 27- or 23-byte header actually has some good use: in a table of up to 8 columns, you can fit a null bitmap there "for free". In a scheme that took us down to 20 rather than 19 bytes, even a narrow table would pay the full maxalign price for having a null. I'm in favor of combining cmin/cmax/xvac to get us down to 23 bytes, but I think anything beyond that is going to face a serious problem of greatly increased cost for diminishing returns. regards, tom lane
Tom Lane <tgl@sss.pgh.pa.us> wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > Dumb question... wouldn't getting down to 20 bytes buy us something? > > BTW, the apparently useless byte after the 27- or 23-byte header > actually has some good use: in a table of up to 8 columns, you can > fit a null bitmap there "for free". In a scheme that took us down > to 20 rather than 19 bytes, even a narrow table would pay the full > maxalign price for having a null. We may use "free" bytes for other purposes. For example, if we increase the size of XIDs from 4 to 6 bytes, we can get rid of transaction wraparound problem. There may be some other uses. > I'm in favor of combining cmin/cmax/xvac to get us down to 23 bytes, > but I think anything beyond that is going to face a serious problem > of greatly increased cost for diminishing returns. If we really want to reduce the size of headers to 16 bytes, we maybe need to do with HeapTupleHeaderData.t_ctid . Under the assumption that tuples are offen updated in the same page, we only need offsets in the page to link an old tuple to new one. We can reduce the size of t_ctid from 6 to 2 bytes. When tuples are updated to other pages, we probably need "phantom ctid". In another assumption that tuples are offen updated after frozen, we can overlap ctid and xmin to a physical field using union. But "phantom xids" are needed to update unfrozen tuples. However, I think both assumptions have less probability than the one assumed when we introduce phantom cids. The above ideas probably do not work well. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
On Mon, Oct 02, 2006 at 10:52:48AM +0900, ITAGAKI Takahiro wrote: > Tom Lane <tgl@sss.pgh.pa.us> wrote: > > > "Jim C. Nasby" <jim@nasby.net> writes: > > > Dumb question... wouldn't getting down to 20 bytes buy us something? > > > > BTW, the apparently useless byte after the 27- or 23-byte header > > actually has some good use: in a table of up to 8 columns, you can > > fit a null bitmap there "for free". In a scheme that took us down > > to 20 rather than 19 bytes, even a narrow table would pay the full > > maxalign price for having a null. > > We may use "free" bytes for other purposes. For example, if we increase > the size of XIDs from 4 to 6 bytes, we can get rid of transaction > wraparound problem. There may be some other uses. > > > > I'm in favor of combining cmin/cmax/xvac to get us down to 23 bytes, > > but I think anything beyond that is going to face a serious problem > > of greatly increased cost for diminishing returns. > > If we really want to reduce the size of headers to 16 bytes, > we maybe need to do with HeapTupleHeaderData.t_ctid . > > Under the assumption that tuples are offen updated in the same page, > we only need offsets in the page to link an old tuple to new one. > We can reduce the size of t_ctid from 6 to 2 bytes. When tuples > are updated to other pages, we probably need "phantom ctid". > > In another assumption that tuples are offen updated after frozen, > we can overlap ctid and xmin to a physical field using union. > But "phantom xids" are needed to update unfrozen tuples. > > However, I think both assumptions have less probability than > the one assumed when we introduce phantom cids. > The above ideas probably do not work well. Well, for data warehousing, phantom XIDs of some sort would probably be extremely valuable. Instead of a number of 4 byte XIDs in each tuple, place a limit on the number of transactions that can be live in a table at once. Granted, this means the need to vacuum-freeze more often, but since that can be done as a low-load, low-priority background process that doesn't seem too horrible. If the same technique was applied to cmin/cmax, you could shrink all the visibility info to 1 byte if you wanted to. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
"Jim C. Nasby" <jim@nasby.net> writes: > ... place a limit on the number of transactions that can be live in a table > at once. Urk, well maybe, but ... > you could shrink all the visibility info to 1 byte if you > wanted to. ... 256 of 'em is surely not an acceptable limit. regards, tom lane
Ühel kenal päeval, E, 2006-10-02 kell 01:30, kirjutas Tom Lane: > "Jim C. Nasby" <jim@nasby.net> writes: > > ... place a limit on the number of transactions that can be live in a table > > at once. > > Urk, well maybe, but ... > > > you could shrink all the visibility info to 1 byte if you > > wanted to. > > ... 256 of 'em is surely not an acceptable limit. I have been thinking about this, and it seems that especially for OLAP loads it would be much better to keep tuple visibility info in a separate file, lets call it Tuple Visibility Map (TVM) TVM would have the following benefits: 1) TVM could be uses for index-only lookups as well as heap-only lookups, also other index lookups could be filtered against it fast before going to heap. 2) TVM could be heavily compressed, especially for bulk loads something like a single (xmin, xmax,cmin,cmax) tuple plus RLE-encoded list of pointers to it will do. 3) In case TVM space is needed in in any page, it would be easy to just throw away cmin/cmax from tuples from committed/aborted transactions. 4) First pass of VACUUM would be much faster, as it has to scan only TVM. Pages with no expired tuples would not need to be touched. If we can come up with a good design for TVM, it may also be an overall win for many kinds of OLTP queries, as it may result in less writes to disk and almost the same amount of writing to WAL. Maybe bitmap or btree index would be something to use as a starting point when designing TVM. Another idea to consider would be to merge FSM and TVM and then use them also for keeping data in cluster order. -- ---------------- Hannu Krosing Database Architect Skype Technologies OÜ Akadeemia tee 21 F, Tallinn, 12618, Estonia Skype me: callto:hkrosing Get Skype for free: http://www.skype.com
Tom Lane <tgl@sss.pgh.pa.us> writes: > "Jim C. Nasby" <jim@nasby.net> writes: > > ... place a limit on the number of transactions that can be live in a table > > at once. > > Urk, well maybe, but ... > > > you could shrink all the visibility info to 1 byte if you > > wanted to. > > ... 256 of 'em is surely not an acceptable limit. The plan Gavin Sherry and I were discussing at the Code Sprint was to store a single "most common xmin" xmin in the per-page special area. Then have a bit on each tuple indicating that the xmin isn't present in the tuple and instead to use the xmin from the page footer. Any tuples with other values of xmin would have to store that xmin normally. The use case here is primarily tables loaded in large batch jobs that have large swaths of the table, possibly the entire table, loaded in the same transaction. My thinking was that "most common xmin" could be very approximate. In fact my my plan was to just store the first xmin the page sees there. This catches the common use cases of pg_restore or COPY loading many records and even catches most cases of large inserts into existing tables whenever they extend the table. A further refinement could be to have vacuum look for the actual most common xmin and rewrite the page using it. If there's enough free space it may be worthwhile storing InvalidTransactionId and allowing the next insert to set the "most common xmin". However I'm not convinced of the importance of this refinement. The thing to remember is that shaving bits off tuples is not a goal in itself. It's a means to an end, namely packing more tuples on a page. If we shave off 4 bytes off every tuple when we're loading thousands of tuples then we get to put more of them on a page. If we save space on tuples when we're running vacuum that just gives us more free space in the free space map and we only see a benefit down the line if we end up eventually filling up that space. -- greg
Greg Stark <gsstark@mit.edu> writes: > The plan Gavin Sherry and I were discussing at the Code Sprint was to store a > single "most common xmin" xmin in the per-page special area. Then have a bit > on each tuple indicating that the xmin isn't present in the tuple and instead > to use the xmin from the page footer. Any tuples with other values of xmin > would have to store that xmin normally. This seems pretty unworkable --- anything that involves more than one field layout for tuple headers is going to be a huge PITA. regards, tom lane
Greg Stark <gsstark@mit.edu> wrote: > The plan Gavin Sherry and I were discussing at the Code Sprint was to store a > single "most common xmin" xmin in the per-page special area. Then have a bit > on each tuple indicating that the xmin isn't present in the tuple and instead > to use the xmin from the page footer. Any tuples with other values of xmin > would have to store that xmin normally. Is this a similar approach to Interested Transaction List (ITL) in Oracle? http://www.dbazine.com/oracle/or-articles/nanda3 ITL-like approach is more efficient than per-tuple XIDs unless all tuples in a page are locked at the same time. However, MAXTRANS and PCTFREE issues may bother us. Regards, --- ITAGAKI Takahiro NTT Open Source Software Center
ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> writes: > ITL-like approach is more efficient than per-tuple XIDs > unless all tuples in a page are locked at the same time. > However, MAXTRANS and PCTFREE issues may bother us. I'm not sure how Oracle gets away with MAXTRANS. Somehow it seems to never arise as a practical problem yet it seems like there must be instances where it would cause a serious problems. It's worse for Postgres since as I understand it Oracle only needs to track transaction ids that are actively locking the record. Postgres needs to track transaction ids for the inserting and deleting transaction even when there's no lock (or no lock remaining). I can imagine having a scheme where there's a short list of transaction ids in the footer and then each tuple just stores an index into that list. If you had space for 16 transaction ids set aside then you could store xmin and xmax in a single byte for example. If the space set aside for these transaction ids is full when you're inserting i suppose you could just go back to the FSM for another page. But I don't see any way out when you're deleting. You have to mark xmax one way or another and if there's no space left in the footer and you only have 4 bits in the tuple what are you going to do? As an aside doing vacuum freeze more aggressively might reduce the pressure on these ITL slots. But I don't see any way to guarantee a slot is available for xmax when deleting. We would need some sort of scheme where the space for transaction ids is able to grow but we're already growing from both ends of the page. We would either have to interleave transaction ids with line pointers or store them on another special page somewhere. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Oct 3, 2006, at 2:23 PM, Gregory Stark wrote: > If the space set aside for these transaction ids is full when > you're inserting > i suppose you could just go back to the FSM for another page. But I > don't see > any way out when you're deleting. You have to mark xmax one way or > another and > if there's no space left in the footer and you only have 4 bits in > the tuple > what are you going to do? > > As an aside doing vacuum freeze more aggressively might reduce the > pressure on > these ITL slots. > > But I don't see any way to guarantee a slot is available for xmax when > deleting. We would need some sort of scheme where the space for > transaction > ids is able to grow but we're already growing from both ends of the > page. We > would either have to interleave transaction ids with line pointers > or store > them on another special page somewhere. Well, worst-case you could just re-do the whole page if you need to expand the list of transaction slots; I don't think that's a huge deal. What did have me baffled was how to deal with xmax though, since (as you mentioned), you can end up in a situation where you can't delete a tuple because there's no more room on the page for another xmax. But I just thought of a way around that which might be better than a separate store for transaction info: allow for moving a tuple off the current page by placing a link to it's new location, similar to how ctid works. We probably wouldn't want to try and cram that into the item list, but I think we should be able to create a special version of a tuple header (AddressForwardingHeader) that simply states "the tuple has moved to this new ctid; go there". Of course, anytime you have to follow that link you're going to pay a penalty, but I think this should only be needed when trying to delete a tuple on a page that's basically full. Theoretically, there shouldn't be too many people trying to hit that deleted tuple, but to further reduce the number of people hitting it, we could include the visibility info (or a pointer to it) in the AddressForwardingHeader. -- Jim Nasby jim@nasby.net EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
On 10/3/06, ITAGAKI Takahiro <itagaki.takahiro@oss.ntt.co.jp> wrote: > ITL-like approach is more efficient than per-tuple XIDs > unless all tuples in a page are locked at the same time. > However, MAXTRANS and PCTFREE issues may bother us. IIRC, the last time I checked Oracle's patents, they pretty much had the ITL covered. So any work in this area would need a significant amount of research. While ITLs are generally better than our current approach, there are often tuning-related issues due to improper settings of initial ITL slots (INITRANS) and freespace. Many times a block doesn't have enough freespace to fulfill MAXTRANS and you'll get contention issues. However, setting INITRANS too high will waste a significantamount of storage space... so you have to really know the application you plan to be running on the database and what hotspots may exist to effectively tune these parameters on a table-by-table basis. Again, it goes back to how much PostgreSQL expects people to become database tuning experts. Oracle lets you modify almost any parameter which is both good and bad. In Oracle, you can easily fix almost any bottleneck if you know enough about the system; whereas PostgreSQL's tuning options are fairly basic and oriented for more generalized use-cases. There are obvious pros/cons to each type of system architecture which, in this case, pretty much relate to design complexity and the amount of end-user training required to make good use of the software. If anyone ever wanted to play with Oracle and understand some of the issues related to their design considerations, running Quest's Spotlight on Oracle while concurrently running Benchmark Factory gives you a nice overview. I haven't played with either tool in about a year and a half now, but I think you can still get trial versions. -- Jonah H. Harris, Software Architect | phone: 732.331.1300 EnterpriseDB Corporation | fax: 732.331.1301 33 Wood Ave S, 2nd Floor | jharris@enterprisedb.com Iselin, New Jersey 08830 | http://www.enterprisedb.com/