Thread: Applying TOAST to CURRENT
Hi, now that we have the branch for 7.0, I could apply my actual work on TOAST to the CURRENT development tree. Before doing so, I'd like to discuss some related details. 1. In the actual version, the lztext datatype is stripped down to something more similar to text (does not compress on input). So it is kinda toastable base type for testing purposes created at initdb time. The pg_rules catalog still uses it, just that the toaster is now responsible to do the compression work. No problems so far with that. In the long run I think lztext will disappear completely again (it was supposed to be). Does anybody see aproblem with abuse of this type during development? 2. I've added another ALTER TABLE command to create the external storage table for a relation. The syntaxis ALTER TABLE tablename CREATE TOAST TABLE; Up to that, toastable types (lztext only yet) will be compressed, but the INSERT still fails if compression isn't enough to make a tuple fit. We haven't decided yet how/when to create the secondary relation and it's index. Since we intend to make base types like text and varchar by default toastable, I don't think that "if a tables schema contains toastable types" is a good enough reason to create them silently. There might exists tons of tables in a schema, that don't require it. OTOH I don't think it's a good thing to try creating these things on the fly the first time needed. The required catalog changes and file creations introduce all kinds of possible rollback/crash problems,that we don't want to have here - do we? 3. Tom, we don't have a consensus how to merge the TOAST related function changes with the fmgr changes upto now. Which base type specific functions will be touched due to fmgr changes right now? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
> Hi, > > now that we have the branch for 7.0, I could apply my actual > work on TOAST to the CURRENT development tree. Before doing > so, I'd like to discuss some related details. > > 1. In the actual version, the lztext datatype is stripped > down to something more similar to text (does not compress > on input). So it is kinda toastable base type for testing > purposes created at initdb time. > > The pg_rules catalog still uses it, just that the toaster > is now responsible to do the compression work. No > problems so far with that. > > In the long run I think lztext will disappear completely > again (it was supposed to be). Does anybody see a problem > with abuse of this type during development? Sounds fine. > 2. I've added another ALTER TABLE command to create the > external storage table for a relation. The syntax is > > ALTER TABLE tablename CREATE TOAST TABLE; > > Up to that, toastable types (lztext only yet) will be > compressed, but the INSERT still fails if compression > isn't enough to make a tuple fit. > > We haven't decided yet how/when to create the secondary > relation and it's index. Since we intend to make base > types like text and varchar by default toastable, I don't > think that "if a tables schema contains toastable types" > is a good enough reason to create them silently. There > might exists tons of tables in a schema, that don't > require it. > > OTOH I don't think it's a good thing to try creating > these things on the fly the first time needed. The > required catalog changes and file creations introduce all > kinds of possible rollback/crash problems, that we don't > want to have here - do we? Well, we could print the message suggesing ALTER TABLE when printing tuple too large. Frankly, I don't see a problem in creating the backup table automatically. If you are worried about performance, how about putting it in a subdirectory. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
Bruce Momjian wrote: > > OTOH I don't think it's a good thing to try creating > > these things on the fly the first time needed. The > > required catalog changes and file creations introduce all > > kinds of possible rollback/crash problems, that we don't > > want to have here - do we? > > Well, we could print the message suggesing ALTER TABLE when printing > tuple too large. Frankly, I don't see a problem in creating the backup > table automatically. If you are worried about performance, how about > putting it in a subdirectory. It's the toast-table and the index. So it's 2 Inodes and 16K per table. If the backend is compiled with -g, someoneneeds to create about 500 tables to waste the same amount of space. Well, I like the subdirectory idea. I only wonder how that should be implemented (actually the tablename is the filename - and that doesn't allow / in it). Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes: > 3. Tom, we don't have a consensus how to merge the TOAST > related function changes with the fmgr changes up to now. > Which base type specific functions will be touched due to > fmgr changes right now? For functions that need their inputs de-toasted, I think that the changes you need should be a free byproduct of the fmgr changes. I'd recommend we make those changes first, and then in a cleanup pass you can modify anything that is able to work on still-toasted input. I can't really do much with updating any varlena datatypes until there's a version of heap_tuple_untoast_attr() somewhere in the system --- if you look at src/include/fmgr.h, you'll see the call is already there: /* use this if you want the raw, possibly-toasted input datum: */ #define PG_GETARG_RAW_VARLENA_P(n) ((struct varlena *) PG_GETARG_POINTER(n)) /* use this if you want the input datum de-toasted: */ #define PG_GETARG_VARLENA_P(n) \(VARATT_IS_EXTENDED(PG_GETARG_RAW_VARLENA_P(n)) ? \ (struct varlena *) heap_tuple_untoast_attr((varattrib*) PG_GETARG_RAW_VARLENA_P(n)) : \ PG_GETARG_RAW_VARLENA_P(n)) /* GETARG macros for varlena types will typically look like this: */ #define PG_GETARG_TEXT_P(n) ((text *) PG_GETARG_VARLENA_P(n)) BTW, it would save some casting if heap_tuple_untoast_attr were declared to accept and return "struct varlena *" ... Anyway, as soon as that code links to something that works, let me know and I'll make a pass over the "text" functions. That should give you something to test with. regards, tom lane
> Bruce Momjian wrote: > > > OTOH I don't think it's a good thing to try creating > > > these things on the fly the first time needed. The > > > required catalog changes and file creations introduce all > > > kinds of possible rollback/crash problems, that we don't > > > want to have here - do we? > > > > Well, we could print the message suggesing ALTER TABLE when printing > > tuple too large. Frankly, I don't see a problem in creating the backup > > table automatically. If you are worried about performance, how about > > putting it in a subdirectory. > > It's the toast-table and the index. So it's 2 Inodes and 16K > per table. If the backend is compiled with -g, someone needs > to create about 500 tables to waste the same amount of space. > > Well, I like the subdirectory idea. I only wonder how that > should be implemented (actually the tablename is the filename > - and that doesn't allow / in it). Not sure. It will take some tricks, I am sure. How about if we add some TOAST option to CREATE TABLE, so they can create with TOAST support rather than having to use ALTER every time. Maybe that would work. -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
JanWieck@t-online.de (Jan Wieck) writes: >>>> OTOH I don't think it's a good thing to try creating >>>> these things on the fly the first time needed. The >>>> required catalog changes and file creations introduce all >>>> kinds of possible rollback/crash problems, that we don't >>>> want to have here - do we? AFAIK we are pretty solid on rolling back table creation, it's just rename/drop that have problems. A worse problem is what if two backends both decide they need to create the toast table at the same time. That might be fixable with appropriate locking but it seems like there'd be potential for deadlocks. > Bruce Momjian wrote: >> Well, we could print the message suggesing ALTER TABLE when printing >> tuple too large. Frankly, I don't see a problem in creating the backup >> table automatically. If you are worried about performance, how about >> putting it in a subdirectory. I agree with Bruce --- the toast table should be created automatically, at least if the table contains any potentially-toastable columns. We want this to be as transparent as possible. I'd rather have auto create on-the-fly when first needed, but if that seems too risky then let's just make the table when its owning table is created. If you want to control it with an ALTER TABLE function, let's add ALTER TABLE DROP TOAST so that admins who don't like the excess space usage can get rid of it. (Of course that should only succeed after verifying the toast table is empty...) regards, tom lane
On Tue, 30 May 2000, Tom Lane wrote: > JanWieck@t-online.de (Jan Wieck) writes: > >>>> OTOH I don't think it's a good thing to try creating > >>>> these things on the fly the first time needed. The > >>>> required catalog changes and file creations introduce all > >>>> kinds of possible rollback/crash problems, that we don't > >>>> want to have here - do we? > > AFAIK we are pretty solid on rolling back table creation, it's just > rename/drop that have problems. A worse problem is what if two > backends both decide they need to create the toast table at the same > time. That might be fixable with appropriate locking but it seems > like there'd be potential for deadlocks. > > > Bruce Momjian wrote: > >> Well, we could print the message suggesing ALTER TABLE when printing > >> tuple too large. Frankly, I don't see a problem in creating the backup > >> table automatically. If you are worried about performance, how about > >> putting it in a subdirectory. > > I agree with Bruce --- the toast table should be created automatically, > at least if the table contains any potentially-toastable columns. We > want this to be as transparent as possible. I'd rather have auto create > on-the-fly when first needed, but if that seems too risky then let's > just make the table when its owning table is created. have to third this one ... I think it should be totally transparent to the admin/user ... just create it when the table is created, what's the worst case scenario? it never gets used and you waste 16k of disk space?
> I agree with Bruce --- the toast table should be created automatically, > at least if the table contains any potentially-toastable columns. We > want this to be as transparent as possible. I'd rather have auto create > on-the-fly when first needed, but if that seems too risky then let's > just make the table when its owning table is created. > > If you want to control it with an ALTER TABLE function, let's add ALTER > TABLE DROP TOAST so that admins who don't like the excess space usage > can get rid of it. (Of course that should only succeed after verifying > the toast table is empty...) But when you vacuum a table, doesn't it get zero size? Sure works here: #$ cd /u/pg/data/base/test#$ ls -l kkk*-rw------- 1 postgres postgres 0 May 30 12:20 kkk#$ -- Bruce Momjian | http://www.op.net/~candle pgman@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
The Hermit Hacker wrote: > > On Tue, 30 May 2000, Tom Lane wrote: > > > > I agree with Bruce --- the toast table should be created automatically, > > at least if the table contains any potentially-toastable columns. We > > want this to be as transparent as possible. I'd rather have auto create > > on-the-fly when first needed, but if that seems too risky then let's > > just make the table when its owning table is created. > > have to third this one ... I think it should be totally transparent to the > admin/user ... just create it when the table is created, what's the worst > case scenario? it never gets used and you waste 16k of disk space? You dont even use 16k if toast tables are like ordinary tables (which I guess they are). New empty tables seem to occupy 0k. So I'm also for immediate creation of tost tables for all tables that require them, either at create (if there are any toastable columns in the create clause) or at alter table time if first toestable column is added after initial create. The only drawback is bloating directories, but it was already suggested that TOAST tables could/should be kept in subdirectory toast (as should indexes too, imho). And the most widespread database in the world does it too ;) (dBASE and its derivates) -------- Hannu
On Tue, 30 May 2000, Hannu Krosing wrote: > The Hermit Hacker wrote: > > > > On Tue, 30 May 2000, Tom Lane wrote: > > > > > > I agree with Bruce --- the toast table should be created automatically, > > > at least if the table contains any potentially-toastable columns. We > > > want this to be as transparent as possible. I'd rather have auto create > > > on-the-fly when first needed, but if that seems too risky then let's > > > just make the table when its owning table is created. > > > > have to third this one ... I think it should be totally transparent to the > > admin/user ... just create it when the table is created, what's the worst > > case scenario? it never gets used and you waste 16k of disk space? > > You dont even use 16k if toast tables are like ordinary tables (which I > guess they are). New empty tables seem to occupy 0k. > > So I'm also for immediate creation of tost tables for all tables that > require them, either at create (if there are any toastable columns in > the create clause) or at alter table time if first toestable column is > added after initial create. > > The only drawback is bloating directories, but it was already suggested > that > TOAST tables could/should be kept in subdirectory toast (as should > indexes > too, imho). still say, simplest "fix": <dbname>/{system,db,toast,index}
The Hermit Hacker wrote: > > On Tue, 30 May 2000, Hannu Krosing wrote: > > > The only drawback is bloating directories, but it was already suggested > > that > > TOAST tables could/should be kept in subdirectory toast (as should > > indexes > > too, imho). > > still say, simplest "fix": > > <dbname>/{system,db,toast,index} Why can't we just add a column named "tablepath" to pg_table, that can either be a simple filename, or relative path with a filename or even full path (if we don't worry too much for security ;) That has came up before when discussing ways to make rename table rollbackable but it could be handy here two. AFAIK it has been a general principle in programming to keep separate things separate unless a very good reason not to do so is present. ----------- Hannu
The Hermit Hacker wrote: > > have to third this one ... I think it should be totally transparent to the > admin/user ... just create it when the table is created, what's the worst > case scenario? it never gets used and you waste 16k of disk space? > Not exactly. I've made some good experiences with having the toaster trying to keep the main tuple size below 1/4 of MaxTupleSize (BLKSIZE - block header). Remember that external stored attributes are only fetched from the secondary relation if really needed (when the result set is sent to the client or if explicitly used in the query).So in a usual case, where a relatively small amount of the entire data is retrieved and key attributes are small,it's a win. With this config more main tuples fit into one block, and if the attributes used in the WHERE clause aren't stored external, the result set (including sort and group actions) can be collected with fewerblock reads. Only those big values, that the client really wanted, have to be fetched at send time. If no external table exists, the toaster will try the <2K thing by compression only. If the resulting tuple fits into the 8K limit, it's OK. But if a secondary relation exists, it'll store external to make the tuple <2K. Thus, a 4K or 6K tuple, that actually fits and would be stored in the main table, will cause the toaster to jumpin if we allways create the secondary table. Hmmm - thinking about that it doesn't sound bad if we allways create a secondary relation at CREATE TABLE time, butNOT the index for it. And at VACUUM time we create the index if it doesn't exist AND there is external stored data. The table is prepared for external storage allways and we avoid the risks from creating tables in possiblylater aborting transactions or due to concurrency issues. But we don't waste the index space for reallyallways-small-tuple tables. Another benefit would be, that reloads should be faster because with this technique, the toaster doesn'tneed to insert index tuples during the load. The indices are created later at VACUUM after reload. The toaster needs to use sequential scans on the external table until the next vacuum run, but index usage allways depends on vacuum so that's not a real issue from my PoV. At least a transparent compromise - isn't it? Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
JanWieck@t-online.de (Jan Wieck) writes: > Hmmm - thinking about that it doesn't sound bad if we allways > create a secondary relation at CREATE TABLE time, but NOT the > index for it. And at VACUUM time we create the index if it > doesn't exist AND there is external stored data. Don't much like that --- what if the user doesn't run vacuum for a good long while? Could be doing a lot of sequential scans over a pretty large toast file... If the 16K for an empty btree index really bothers you, let's attack that head-on. I don't see why a freshly created index couldn't be zero bytes, and the metadata page gets created on first store into the index. > The toaster needs to use sequential scans on the external > table until the next vacuum run, but index usage allways > depends on vacuum so that's not a real issue from my PoV. What makes you say that? Indexes will be used on a never-vacuumed table with the current planner. regards, tom lane
Jan Wieck wrote: > > The Hermit Hacker wrote: > > > > have to third this one ... I think it should be totally transparent to the > > admin/user ... just create it when the table is created, what's the worst > > case scenario? it never gets used and you waste 16k of disk space? > > > > Not exactly. > > I've made some good experiences with having the toaster > trying to keep the main tuple size below 1/4 of MaxTupleSize > (BLKSIZE - block header). Can't _that_ behaviour be made modifyable by some setting ? > Remember that external stored > attributes are only fetched from the secondary relation if > really needed (when the result set is sent to the client or > if explicitly used in the query). So in a usual case, where a > relatively small amount of the entire data is retrieved and > key attributes are small, it's a win. With this config more > main tuples fit into one block, and if the attributes used in > the WHERE clause aren't stored external, the result set > (including sort and group actions) can be collected with > fewer block reads. Only those big values, that the client > really wanted, have to be fetched at send time. What is the priority of checks on indexed fetch? I mean if we do "SELECT * FROM ttable WHERE toasted LIKE 'ab%' " DO we first scan by index to 'ab%', then check if tuple is live and after that to the LIKE comparison ? Would it not be faster in toast case to use the already retrieved index data and check that first, before going to main table (not to mention the TOAST table) > If no external table exists, the toaster will try the <2K > thing by compression only. If the resulting tuple fits into > the 8K limit, it's OK. Would it not be faster/cleaner to check some configuration variable than the existance of toest table ? > But if a secondary relation exists, > it'll store external to make the tuple <2K. Thus, a 4K or 6K > tuple, that actually fits and would be stored in the main > table, will cause the toaster to jump in if we allways create > the secondary table. Do our current (btree/hash) indexes support toast ? If not, will they ? > > Hmmm - thinking about that it doesn't sound bad if we allways > create a secondary relation at CREATE TABLE time, but NOT the > index for it. And at VACUUM time we create the index if it > doesn't exist AND there is external stored data. Is there a plan to migrate to some combined index/database table for at least toast tables later ? For at least toast tables it seems feasible to start using the originally planned tuple-spanning mechanisms, unless we plan migrating LOs to toast table at some point which would make index-less tuple chaining a bad idea as it would make seeking on really large LOs slow. > The table is prepared for external storage allways and we > avoid the risks from creating tables in possibly later > aborting transactions or due to concurrency issues. But we > don't waste the index space for really allways-small-tuple > tables. That could perhaps be done for other tables too, ie CREATE INDEX would not actually create index until VACUUM notices that table is big enough to make use of that index ? On second thought that seems not a good idea to me ;( > > Another benefit would be, that reloads should be faster > because with this technique, the toaster doesn't need to > insert index tuples during the load. The indices are created > later at VACUUM after reload. AFAIK reloads (from pg_dump at least) create indexes after LOAD'ing data > The toaster needs to use sequential scans on the external > table until the next vacuum run, but index usage allways > depends on vacuum so that's not a real issue from my PoV. > > At least a transparent compromise - isn't it? But do we need it ? I suspect there are other issues that need your attention more than complicating table creation to save a few kb ;) Creating toast tables still wastes only 1MB per 64 tables _that have toastable columns_, which seems real cheap considering today's HD prices. You would need 6400 toast tables to consume 1% of the smallest currently available (10GB) disk. If that is a concern this can probably be cured by good docs that say in detail which datatypes cause toast tables an which don't. ----------- Hannu
Hannu Krosing wrote: > > I've made some good experiences with having the toaster > > trying to keep the main tuple size below 1/4 of MaxTupleSize > > (BLKSIZE - block header). > > Can't _that_ behaviour be made modifyable by some setting ? Good point. There is already a fine tuning option per table attribute, where someone can tell things like "forget about compression for this attribute" or "try keeping in main tuple and toast others first". Theres no utility command up to now to customize them, but an UPDATE pg_attribute does it already. Seems another value in pg_class, telling the toaster what max size to try, would be a good idea. > What is the priority of checks on indexed fetch? > > I mean if we do "SELECT * FROM ttable WHERE toasted LIKE 'ab%' " > > DO we first scan by index to 'ab%', then check if tuple is live and > after that to the LIKE comparison ? That's the current behaviour, and TOAST doesn't change it. There was discussion already about index tuple toasting. Indices have different size constraints and other featuresso they cannot share exactly the same toasting scheme as heap tuples. I'm still not sure if supporting indices on huge values is worth the efford. Many databases have some limit on the size of index entries, and noone seems to really care for that. > > If no external table exists, the toaster will try the <2K > > thing by compression only. If the resulting tuple fits into > > the 8K limit, it's OK. > > Would it not be faster/cleaner to check some configuration variable > than the existance of toest table ? The toast tables and indexes OID are stored in pg_class. An open Relation has reference to the pg_class row, so it's simply comparing that to INVALID_OID. No wasted time here. > Do our current (btree/hash) indexes support toast ? Not hard tested yet. At least, they don't support it if toasting would be required to make the index tuple fit, but the heap toaster is already happy with it. The tuple is modified in place at heap_insert(). So the later index_insert() will use the Datums found there to build the index tuples, either plain or toast reference, whatever the toaster left. > > If not, will they ? Not planned for 7.1. Maybe we can workout a solution for unlimited index entries after that. > > Hmmm - thinking about that it doesn't sound bad if we allways > > create a secondary relation at CREATE TABLE time, but NOT the > > index for it. And at VACUUM time we create the index if it > > doesn't exist AND there is external stored data. > > Is there a plan to migrate to some combined index/database table for > at least toast tables later ? No. But we plan a general overwriting storage manager, so that might not be an issue at all. > For at least toast tables it seems feasible to start using the > originally planned tuple-spanning mechanisms, unless we plan > migrating LOs to toast table at some point which would make index-less > tuple chaining a bad idea as it would make seeking on really large > LOs slow. I've never seen a complete proposal for tuple-spanning. The toaster breaks up the large Datum into chunks. Thereis a chunk number, so modifying the index to be a multi-attribute one would gain direct access to a chunk. That should make seeks reasonably fast. > > Another benefit would be, that reloads should be faster > > because with this technique, the toaster doesn't need to > > insert index tuples during the load. The indices are created > > later at VACUUM after reload. > > AFAIK reloads (from pg_dump at least) create indexes after LOAD'ing data Finally the toast table will have another relkind, so it'll not be accessible by normal SQL. The toaster acts onthese tables more hardwired like on system catalogs. It expects a fixed schema and uses direct heap access. Due to the different relkind, a dump wouldn't be able to delay the index creation. > But do we need it ? > > [...] > > You would need 6400 toast tables to consume 1% of the smallest currently > available (10GB) disk. > > If that is a concern this can probably be cured by good docs that say > in detail which datatypes cause toast tables an which don't. We plan to make ALL variable size builtin types toastable. So this list would name them all :-). But this 6400 = 1% really is the point. Let's forget about the 16K and create the toast table allways (as soonas the main table has toastable attributes). Jan -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #