Thread: Vertical Partitioning with TOAST
Hi all, I wrote a experimental patch for a vertical partitioning function. I decided to use the code of TOAST to create the function easily. In a word, the row that the user specified is forcedly driven out with TOAST. The performance gain of 10% was seen by driving out c_data of the customer table in the DBT-2 benchmark in our environment. The mechanism of TOAST is an overdesigned system to use it for a vertical partitioning. Because the overhead of processing is large, the performance might down according to the environment. There are seriously a lot of things that should be considered if a vertical partitioning is mounted. For instance, TOAST index is omitted, and ctid is used for link. Your comments are welcome. Thanks. --- How To Use --- Use "ALTER TABLE" command. http://www.postgresql.org/docs/8.1/static/sql-altertable.html ALTER TABLE name ALTER COLUMN column SET STRAGE FORCEEXTERNAL; I do not understand whether "FORCEEXTERNAL" is an appropriate word. Please teach when there is a better word... -- Junji Teramoto diff -purN postgresql-8.1.0.org/src/backend/access/heap/heapam.c postgresql-8.1.0/src/backend/access/heap/heapam.c --- postgresql-8.1.0.org/src/backend/access/heap/heapam.c 2005-10-15 11:49:08.000000000 +0900 +++ postgresql-8.1.0/src/backend/access/heap/heapam.c 2005-12-01 15:31:38.307713257 +0900 @@ -1070,6 +1070,36 @@ heap_get_latest_tid(Relation relation, } /* end of loop */ } +// Add by junji from here +/* + * has_rel_forceexternal - Is there "SET STORAGE FORCEEXTERNAL"ed rows? + */ +bool +has_rel_forceexternal(Relation relation) +{ + TupleDesc tupleDesc; + Form_pg_attribute *att; + int numAttrs; + int i; + + /* + * Get the tuple descriptor and break down the tuple(s) into fields. + */ + tupleDesc = relation->rd_att; + att = tupleDesc->attrs; + numAttrs = tupleDesc->natts; + + for (i = 0; i < numAttrs; i++) + { + if (att[i]->attstorage == 'f') + return true; + } + + return false; +} +// Add by junji to here + + /* * heap_insert - insert tuple into a heap * @@ -1130,6 +1160,9 @@ heap_insert(Relation relation, HeapTuple * out-of-line attributes from some other relation, invoke the toaster. */ if (HeapTupleHasExternal(tup) || +// Add by junji from here + (has_rel_forceexternal(relation)) || +// Add by junji to here (MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD)) heap_tuple_toast_attrs(relation, tup, NULL); @@ -1762,6 +1795,9 @@ l2: */ need_toast = (HeapTupleHasExternal(&oldtup) || HeapTupleHasExternal(newtup) || +// Add by junji from here + (has_rel_forceexternal(relation)) || +// Add by junji to here (MAXALIGN(newtup->t_len) > TOAST_TUPLE_THRESHOLD)); newtupsize = MAXALIGN(newtup->t_len); diff -purN postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c postgresql-8.1.0/src/backend/access/heap/tuptoaster.c --- postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c 2005-10-15 11:49:09.000000000 +0900 +++ postgresql-8.1.0/src/backend/access/heap/tuptoaster.c 2005-12-01 15:29:29.722579466 +0900 @@ -512,6 +512,46 @@ toast_insert_or_update(Relation rel, Hea } } +// Add by junji from here + /* + * We look for attributes of attstorage 'f'. + */ + if (rel->rd_rel->reltoastrelid != InvalidOid) + { + Datum old_value; + + /*------ + * Search for the biggest yet inlined attribute with + * attstorage equals 'x' or 'e' + *------ + */ + for (i = 0; i < numAttrs; i++) + { + if (toast_action[i] == 'p') + continue; + if (VARATT_IS_EXTERNAL(toast_values[i])) + continue; + if (att[i]->attstorage != 'f') + continue; + + /* + * Store this external + */ + old_value = toast_values[i]; + toast_action[i] = 'p'; + toast_values[i] = toast_save_datum(rel, toast_values[i]); + if (toast_free[i]) + pfree(DatumGetPointer(old_value)); + + toast_free[i] = true; + toast_sizes[i] = VARATT_SIZE(toast_values[i]); + + need_change = true; + need_free = true; + } + } +// Add by junji to here + /* ---------- * Compress and/or save external until data fits into target length * diff -purN postgresql-8.1.0.org/src/backend/commands/tablecmds.c postgresql-8.1.0/src/backend/commands/tablecmds.c --- postgresql-8.1.0.org/src/backend/commands/tablecmds.c 2005-10-15 11:49:15.000000000 +0900 +++ postgresql-8.1.0/src/backend/commands/tablecmds.c 2005-12-01 15:29:29.726577573 +0900 @@ -3439,6 +3439,10 @@ ATExecSetStorage(Relation rel, const cha newstorage = 'x'; else if (pg_strcasecmp(storagemode, "main") == 0) newstorage = 'm'; +// Add by junji from here + else if (pg_strcasecmp(storagemode, "forceexternal") == 0) + newstorage = 'f'; +// Add by junji to here else { ereport(ERROR, @@ -6045,8 +6049,14 @@ needs_toast_table(Relation rel) tupdesc = rel->rd_att; att = tupdesc->attrs; + + for (i = 0; i < tupdesc->natts; i++) { +// Add by junji from here + if (att[i]->attstorage == 'f') + return true; +// Add by junji to here if (att[i]->attisdropped) continue; data_length = att_align(data_length, att[i]->attalign); diff -purN postgresql-8.1.0.org/src/include/access/heapam.h postgresql-8.1.0/src/include/access/heapam.h --- postgresql-8.1.0.org/src/include/access/heapam.h 2005-10-15 11:49:42.000000000 +0900 +++ postgresql-8.1.0/src/include/access/heapam.h 2005-12-01 15:29:29.726577573 +0900 @@ -155,6 +155,10 @@ extern void heap_get_latest_tid(Relation ItemPointer tid); extern void setLastTid(const ItemPointer tid); +// Add by junji from here +extern bool has_rel_forceexternal(Relation relation); +// Add by junji to here + extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, bool use_wal, bool use_fsm); extern HTSU_Result heap_delete(Relation relation, ItemPointer tid,
This seems like a useful feature to add, allowing for easy built-in verticle partitioning. Are there issues with the patch as-is? (Other than it probably should have gone to -patches...) On Thu, Dec 01, 2005 at 05:59:08PM +0900, Junji TERAMOTO wrote: > Hi all, > > I wrote a experimental patch for a vertical partitioning > function. > > I decided to use the code of TOAST to create the function > easily. In a word, the row that the user specified is forcedly > driven out with TOAST. > > The performance gain of 10% was seen by driving out c_data of the > customer table in the DBT-2 benchmark in our environment. > > The mechanism of TOAST is an overdesigned system to use it for a > vertical partitioning. Because the overhead of processing is large, > the performance might down according to the environment. > > There are seriously a lot of things that should be considered if > a vertical partitioning is mounted. > For instance, TOAST index is omitted, and ctid is used for link. > > Your comments are welcome. Thanks. > > --- > How To Use > --- > Use "ALTER TABLE" command. > http://www.postgresql.org/docs/8.1/static/sql-altertable.html > > ALTER TABLE name ALTER COLUMN column SET STRAGE FORCEEXTERNAL; > > I do not understand whether "FORCEEXTERNAL" is an appropriate > word. Please teach when there is a better word... > > > -- > Junji Teramoto > diff -purN postgresql-8.1.0.org/src/backend/access/heap/heapam.c postgresql-8.1.0/src/backend/access/heap/heapam.c > --- postgresql-8.1.0.org/src/backend/access/heap/heapam.c 2005-10-15 11:49:08.000000000 +0900 > +++ postgresql-8.1.0/src/backend/access/heap/heapam.c 2005-12-01 15:31:38.307713257 +0900 > @@ -1070,6 +1070,36 @@ heap_get_latest_tid(Relation relation, > } /* end of loop */ > } > > +// Add by junji from here > +/* > + * has_rel_forceexternal - Is there "SET STORAGE FORCEEXTERNAL"ed rows? > + */ > +bool > +has_rel_forceexternal(Relation relation) > +{ > + TupleDesc tupleDesc; > + Form_pg_attribute *att; > + int numAttrs; > + int i; > + > + /* > + * Get the tuple descriptor and break down the tuple(s) into fields. > + */ > + tupleDesc = relation->rd_att; > + att = tupleDesc->attrs; > + numAttrs = tupleDesc->natts; > + > + for (i = 0; i < numAttrs; i++) > + { > + if (att[i]->attstorage == 'f') > + return true; > + } > + > + return false; > +} > +// Add by junji to here > + > + > /* > * heap_insert - insert tuple into a heap > * > @@ -1130,6 +1160,9 @@ heap_insert(Relation relation, HeapTuple > * out-of-line attributes from some other relation, invoke the toaster. > */ > if (HeapTupleHasExternal(tup) || > +// Add by junji from here > + (has_rel_forceexternal(relation)) || > +// Add by junji to here > (MAXALIGN(tup->t_len) > TOAST_TUPLE_THRESHOLD)) > heap_tuple_toast_attrs(relation, tup, NULL); > > @@ -1762,6 +1795,9 @@ l2: > */ > need_toast = (HeapTupleHasExternal(&oldtup) || > HeapTupleHasExternal(newtup) || > +// Add by junji from here > + (has_rel_forceexternal(relation)) || > +// Add by junji to here > (MAXALIGN(newtup->t_len) > TOAST_TUPLE_THRESHOLD)); > > newtupsize = MAXALIGN(newtup->t_len); > diff -purN postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c postgresql-8.1.0/src/backend/access/heap/tuptoaster.c > --- postgresql-8.1.0.org/src/backend/access/heap/tuptoaster.c 2005-10-15 11:49:09.000000000 +0900 > +++ postgresql-8.1.0/src/backend/access/heap/tuptoaster.c 2005-12-01 15:29:29.722579466 +0900 > @@ -512,6 +512,46 @@ toast_insert_or_update(Relation rel, Hea > } > } > > +// Add by junji from here > + /* > + * We look for attributes of attstorage 'f'. > + */ > + if (rel->rd_rel->reltoastrelid != InvalidOid) > + { > + Datum old_value; > + > + /*------ > + * Search for the biggest yet inlined attribute with > + * attstorage equals 'x' or 'e' > + *------ > + */ > + for (i = 0; i < numAttrs; i++) > + { > + if (toast_action[i] == 'p') > + continue; > + if (VARATT_IS_EXTERNAL(toast_values[i])) > + continue; > + if (att[i]->attstorage != 'f') > + continue; > + > + /* > + * Store this external > + */ > + old_value = toast_values[i]; > + toast_action[i] = 'p'; > + toast_values[i] = toast_save_datum(rel, toast_values[i]); > + if (toast_free[i]) > + pfree(DatumGetPointer(old_value)); > + > + toast_free[i] = true; > + toast_sizes[i] = VARATT_SIZE(toast_values[i]); > + > + need_change = true; > + need_free = true; > + } > + } > +// Add by junji to here > + > /* ---------- > * Compress and/or save external until data fits into target length > * > diff -purN postgresql-8.1.0.org/src/backend/commands/tablecmds.c postgresql-8.1.0/src/backend/commands/tablecmds.c > --- postgresql-8.1.0.org/src/backend/commands/tablecmds.c 2005-10-15 11:49:15.000000000 +0900 > +++ postgresql-8.1.0/src/backend/commands/tablecmds.c 2005-12-01 15:29:29.726577573 +0900 > @@ -3439,6 +3439,10 @@ ATExecSetStorage(Relation rel, const cha > newstorage = 'x'; > else if (pg_strcasecmp(storagemode, "main") == 0) > newstorage = 'm'; > +// Add by junji from here > + else if (pg_strcasecmp(storagemode, "forceexternal") == 0) > + newstorage = 'f'; > +// Add by junji to here > else > { > ereport(ERROR, > @@ -6045,8 +6049,14 @@ needs_toast_table(Relation rel) > tupdesc = rel->rd_att; > att = tupdesc->attrs; > > + > + > for (i = 0; i < tupdesc->natts; i++) > { > +// Add by junji from here > + if (att[i]->attstorage == 'f') > + return true; > +// Add by junji to here > if (att[i]->attisdropped) > continue; > data_length = att_align(data_length, att[i]->attalign); > diff -purN postgresql-8.1.0.org/src/include/access/heapam.h postgresql-8.1.0/src/include/access/heapam.h > --- postgresql-8.1.0.org/src/include/access/heapam.h 2005-10-15 11:49:42.000000000 +0900 > +++ postgresql-8.1.0/src/include/access/heapam.h 2005-12-01 15:29:29.726577573 +0900 > @@ -155,6 +155,10 @@ extern void heap_get_latest_tid(Relation > ItemPointer tid); > extern void setLastTid(const ItemPointer tid); > > +// Add by junji from here > +extern bool has_rel_forceexternal(Relation relation); > +// Add by junji to here > + > extern Oid heap_insert(Relation relation, HeapTuple tup, CommandId cid, > bool use_wal, bool use_fsm); > extern HTSU_Result heap_delete(Relation relation, ItemPointer tid, > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
"Jim C. Nasby" <jim@nasby.net> writes: > This seems like a useful feature to add, allowing for easy built-in > verticle partitioning. Are there issues with the patch as-is? Other than the ones mentioned by the poster? It seemed to me more like a not-too-successful experiment than something ready for application. If you take the viewpoint that this is just another TOAST storage strategy, I think it's pretty useless. A large field value is going to get toasted anyway with the regular strategy, and if your column happens to contain some values that are not large, forcing them out-of-line anyway is simply silly. (You could make a case for making the threshold size user-controllable, but I don't see the case for setting the threshold to zero, which is what this amounts to.) The poster was not actually suggesting applying it in the form of a force-external TOAST strategy; he was using this as a prototype to try to interest people in the idea of out-of-line storage mechanisms with lower overhead than TOAST. But that part is all speculation not code. Personally, I'd rather look into whether we couldn't speed up TOAST without changing any of its basic assumptions. The current implementation isn't awful, but it was built to allow the existing table and index mechanisms to be re-used for TOAST data. Now that we know for certain TOAST is a good idea, it would be reasonable to take a second look at whether we could improve the performance with another round of implementation effort. regards, tom lane
On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: > "Jim C. Nasby" <jim@nasby.net> writes: > > This seems like a useful feature to add, allowing for easy built-in > > verticle partitioning. Are there issues with the patch as-is? > > Other than the ones mentioned by the poster? > > It seemed to me more like a not-too-successful experiment than something > ready for application. If you take the viewpoint that this is just > another TOAST storage strategy, I think it's pretty useless. A large > field value is going to get toasted anyway with the regular strategy, > and if your column happens to contain some values that are not large, > forcing them out-of-line anyway is simply silly. (You could make a case > for making the threshold size user-controllable, but I don't see the > case for setting the threshold to zero, which is what this amounts to.) Valid point. I do think there's a lot of benefit to being able to set the limit much lower than what it currently defaults to today. We have a client that has a queue-type table that is updated very frequently. One of the fields is text, that is not updated as frequently. Keeping this table vacuumed well enough has proven to be problematic, because any delay to vacuuming quickly results in a very large amount of bloat. Moving that text field into a seperate table would most likely be a win. Presumably this would need to be settable on at least a per-table basis. Would adding such a variable be a good beginner TODO, or is it too invasive? > Personally, I'd rather look into whether we couldn't speed up TOAST > without changing any of its basic assumptions. The current > implementation isn't awful, but it was built to allow the existing table > and index mechanisms to be re-used for TOAST data. Now that we know for > certain TOAST is a good idea, it would be reasonable to take a second > look at whether we could improve the performance with another round of > implementation effort. I've often wondered about all the overhead of storing toast data in what amounts to a regular table. Sounds like another TODO... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Hello all, Thank you for having the interest. Jim C. Nasby wrote: > Valid point. I do think there's a lot of benefit to being able to set > the limit much lower than what it currently defaults to today. We have a > client that has a queue-type table that is updated very frequently. One > of the fields is text, that is not updated as frequently. Keeping this > table vacuumed well enough has proven to be problematic, because any > delay to vacuuming quickly results in a very large amount of bloat. > Moving that text field into a seperate table would most likely be a win. Yes, our team think that this patch is effective that the tuple can be partially updated. For instance, DBT-2 updates frequently contents excluding c_data in the customer table. Because c_data(about 400bytes: The size of the entire tuple is 500bytes.) is copied together in every case, it is thought that it has decreased the performance. That is more important than the vertical partitioning function. Of course, it is important to change DDL of the table. However, I think it might be useful when it is not possible to change. As pointed out by Tom, this is a patch to verify the idea. I want to know that community is how much interested in a partial update. Of course, it is interested whether to want the vertical partitioning function in PostgreSQL, too. :-) By the way, should I send the patch to -patches again? -- Junji Teramoto / teramoto.junji (a) lab.ntt.co.jp
Jim C. Nasby wrote: > On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: > > "Jim C. Nasby" <jim@nasby.net> writes: > > > This seems like a useful feature to add, allowing for easy built-in > > > verticle partitioning. Are there issues with the patch as-is? > > > > Other than the ones mentioned by the poster? > > > > It seemed to me more like a not-too-successful experiment than something > > ready for application. If you take the viewpoint that this is just > > another TOAST storage strategy, I think it's pretty useless. A large > > field value is going to get toasted anyway with the regular strategy, > > and if your column happens to contain some values that are not large, > > forcing them out-of-line anyway is simply silly. (You could make a case > > for making the threshold size user-controllable, but I don't see the > > case for setting the threshold to zero, which is what this amounts to.) > > Valid point. I do think there's a lot of benefit to being able to set > the limit much lower than what it currently defaults to today. We have a > client that has a queue-type table that is updated very frequently. One > of the fields is text, that is not updated as frequently. Keeping this > table vacuumed well enough has proven to be problematic, because any > delay to vacuuming quickly results in a very large amount of bloat. > Moving that text field into a seperate table would most likely be a win. > > Presumably this would need to be settable on at least a per-table basis. > > Would adding such a variable be a good beginner TODO, or is it too > invasive? Well, we have now: ALTER TABLE ALTER [ COLUMN ] column SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } What else is needed? -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001+ If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania19073
On Thu, Dec 08, 2005 at 10:03:43AM -0500, Bruce Momjian wrote: > Jim C. Nasby wrote: > > On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: > > > "Jim C. Nasby" <jim@nasby.net> writes: > > > > This seems like a useful feature to add, allowing for easy built-in > > > > verticle partitioning. Are there issues with the patch as-is? > > > > > > Other than the ones mentioned by the poster? > > > > > > It seemed to me more like a not-too-successful experiment than something > > > ready for application. If you take the viewpoint that this is just > > > another TOAST storage strategy, I think it's pretty useless. A large > > > field value is going to get toasted anyway with the regular strategy, > > > and if your column happens to contain some values that are not large, > > > forcing them out-of-line anyway is simply silly. (You could make a case > > > for making the threshold size user-controllable, but I don't see the > > > case for setting the threshold to zero, which is what this amounts to.) > > > > Valid point. I do think there's a lot of benefit to being able to set > > the limit much lower than what it currently defaults to today. We have a > > client that has a queue-type table that is updated very frequently. One > > of the fields is text, that is not updated as frequently. Keeping this > > table vacuumed well enough has proven to be problematic, because any > > delay to vacuuming quickly results in a very large amount of bloat. > > Moving that text field into a seperate table would most likely be a win. > > > > Presumably this would need to be settable on at least a per-table basis. > > > > Would adding such a variable be a good beginner TODO, or is it too > > invasive? > > Well, we have now: > > ALTER TABLE ALTER [ COLUMN ] column > SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } > > What else is needed? As Tom suggested, I think it would be best to be able to change the size at which a field gets stored externally. I think it also makes sense to have this reverse the normal order of compress first, then if it still doesn't fit store it externally. I forsee this typically being useful when you have fields that are between ~100 and 1000 bytes in size, and I'm doubtful that compression would do much good there. But I wouldn't rule out this being useful on fields that can also sometimes contain much larger amounts of data, so I don't think it makes sense to disable compression completely. So, I think this leaves two new options: SET STORAGE EXTERNAL [THRESHOLD x] If a field is over x in size, it's stored externally. SET STORAGE EXTENDED [THRESHOLD x] If a field is over x in size, it's stored externally. If it's over BLCKSZ/4 it will also be compressed (I think that's how things work now). Actually, that's rather ugly. I think it would be better to just break external storage and compression out into their own attributes: SET STORAGE [ ALLOW EXTERNAL [THRESHOLD x] ] [ ALLOW COMPRESSION [THRESHOLD x] ] ALLOW EXTERNAL: if a field is larger than x bytes (BLCKSZ/4 by default) then it will be stored externally. May be specified along with ALLOW COMPRESSION. ALLOW COMPRESSION: if a field is larger than x bytes (BLCKSZ/4 by default) then it will be COMPRESSED. May be specified along with ALLOW EXTERNAL. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On 12/8/2005 1:42 PM, Jim C. Nasby wrote: > On Thu, Dec 08, 2005 at 10:03:43AM -0500, Bruce Momjian wrote: >> Jim C. Nasby wrote: >> > On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: >> > > "Jim C. Nasby" <jim@nasby.net> writes: >> > > > This seems like a useful feature to add, allowing for easy built-in >> > > > verticle partitioning. Are there issues with the patch as-is? >> > > >> > > Other than the ones mentioned by the poster? >> > > >> > > It seemed to me more like a not-too-successful experiment than something >> > > ready for application. If you take the viewpoint that this is just >> > > another TOAST storage strategy, I think it's pretty useless. A large >> > > field value is going to get toasted anyway with the regular strategy, >> > > and if your column happens to contain some values that are not large, >> > > forcing them out-of-line anyway is simply silly. (You could make a case >> > > for making the threshold size user-controllable, but I don't see the >> > > case for setting the threshold to zero, which is what this amounts to.) >> > >> > Valid point. I do think there's a lot of benefit to being able to set >> > the limit much lower than what it currently defaults to today. We have a >> > client that has a queue-type table that is updated very frequently. One >> > of the fields is text, that is not updated as frequently. Keeping this >> > table vacuumed well enough has proven to be problematic, because any >> > delay to vacuuming quickly results in a very large amount of bloat. >> > Moving that text field into a seperate table would most likely be a win. >> > >> > Presumably this would need to be settable on at least a per-table basis. >> > >> > Would adding such a variable be a good beginner TODO, or is it too >> > invasive? >> >> Well, we have now: >> >> ALTER TABLE ALTER [ COLUMN ] column >> SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } >> >> What else is needed? > > As Tom suggested, I think it would be best to be able to change the size > at which a field gets stored externally. I think it also makes sense to > have this reverse the normal order of compress first, then if it still > doesn't fit store it externally. I forsee this typically being useful > when you have fields that are between ~100 and 1000 bytes in size, and > I'm doubtful that compression would do much good there. But I wouldn't > rule out this being useful on fields that can also sometimes contain > much larger amounts of data, so I don't think it makes sense to disable > compression completely. So, I think this leaves two new options: It's not the size of a field that triggers toasting. It is the size of the entire tuple. As long as that is > BLKSIZE/4, the toaster will pick the currently largest inline value and do "something" with it. "something" is either compressing or (if not allowed or already done) moving external. Jan > > SET STORAGE EXTERNAL [THRESHOLD x] > If a field is over x in size, it's stored externally. > > SET STORAGE EXTENDED [THRESHOLD x] > If a field is over x in size, it's stored externally. If it's over > BLCKSZ/4 it will also be compressed (I think that's how things work > now). > > Actually, that's rather ugly. I think it would be better to just break > external storage and compression out into their own attributes: > > SET STORAGE [ ALLOW EXTERNAL [THRESHOLD x] ] [ ALLOW COMPRESSION [THRESHOLD x] ] > > ALLOW EXTERNAL: if a field is larger than x bytes (BLCKSZ/4 by default) > then it will be stored externally. May be specified along with ALLOW > COMPRESSION. > > ALLOW COMPRESSION: if a field is larger than x bytes (BLCKSZ/4 by default) > then it will be COMPRESSED. May be specified along with ALLOW EXTERNAL. > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 -- #======================================================================# # It's easier to get forgiveness for being wrong than for being right. # # Let's break this rule - forgive me. # #================================================== JanWieck@Yahoo.com #
Is there still interst in this idea for TODO? --------------------------------------------------------------------------- Jim C. Nasby wrote: > On Thu, Dec 08, 2005 at 10:03:43AM -0500, Bruce Momjian wrote: > > Jim C. Nasby wrote: > > > On Thu, Dec 08, 2005 at 12:59:47AM -0500, Tom Lane wrote: > > > > "Jim C. Nasby" <jim@nasby.net> writes: > > > > > This seems like a useful feature to add, allowing for easy built-in > > > > > verticle partitioning. Are there issues with the patch as-is? > > > > > > > > Other than the ones mentioned by the poster? > > > > > > > > It seemed to me more like a not-too-successful experiment than something > > > > ready for application. If you take the viewpoint that this is just > > > > another TOAST storage strategy, I think it's pretty useless. A large > > > > field value is going to get toasted anyway with the regular strategy, > > > > and if your column happens to contain some values that are not large, > > > > forcing them out-of-line anyway is simply silly. (You could make a case > > > > for making the threshold size user-controllable, but I don't see the > > > > case for setting the threshold to zero, which is what this amounts to.) > > > > > > Valid point. I do think there's a lot of benefit to being able to set > > > the limit much lower than what it currently defaults to today. We have a > > > client that has a queue-type table that is updated very frequently. One > > > of the fields is text, that is not updated as frequently. Keeping this > > > table vacuumed well enough has proven to be problematic, because any > > > delay to vacuuming quickly results in a very large amount of bloat. > > > Moving that text field into a seperate table would most likely be a win. > > > > > > Presumably this would need to be settable on at least a per-table basis. > > > > > > Would adding such a variable be a good beginner TODO, or is it too > > > invasive? > > > > Well, we have now: > > > > ALTER TABLE ALTER [ COLUMN ] column > > SET STORAGE { PLAIN | EXTERNAL | EXTENDED | MAIN } > > > > What else is needed? > > As Tom suggested, I think it would be best to be able to change the size > at which a field gets stored externally. I think it also makes sense to > have this reverse the normal order of compress first, then if it still > doesn't fit store it externally. I forsee this typically being useful > when you have fields that are between ~100 and 1000 bytes in size, and > I'm doubtful that compression would do much good there. But I wouldn't > rule out this being useful on fields that can also sometimes contain > much larger amounts of data, so I don't think it makes sense to disable > compression completely. So, I think this leaves two new options: > > SET STORAGE EXTERNAL [THRESHOLD x] > If a field is over x in size, it's stored externally. > > SET STORAGE EXTENDED [THRESHOLD x] > If a field is over x in size, it's stored externally. If it's over > BLCKSZ/4 it will also be compressed (I think that's how things work > now). > > Actually, that's rather ugly. I think it would be better to just break > external storage and compression out into their own attributes: > > SET STORAGE [ ALLOW EXTERNAL [THRESHOLD x] ] [ ALLOW COMPRESSION [THRESHOLD x] ] > > ALLOW EXTERNAL: if a field is larger than x bytes (BLCKSZ/4 by default) > then it will be stored externally. May be specified along with ALLOW > COMPRESSION. > > ALLOW COMPRESSION: if a field is larger than x bytes (BLCKSZ/4 by default) > then it will be COMPRESSED. May be specified along with ALLOW EXTERNAL. > -- > Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com > Pervasive Software http://pervasive.com work: 512-231-6117 > vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461 > > ---------------------------(end of broadcast)--------------------------- > TIP 3: Have you checked our extensive FAQ? > > http://www.postgresql.org/docs/faq > -- Bruce Momjian http://candle.pha.pa.us SRA OSS, Inc. http://www.sraoss.com + If your life is a hard drive, Christ can be your backup. +
If this would be accepted I might actually be able to accomplish this. Maybe. :) But having a TODO wouldn't be a bad idea as well... Would this require 2 new fields in pg_attribute, or is there a better way to store the thresholds? I'm thinking that each field would need two special values; 0 for 'no external/compression' and -1 for 'use default' (hrm, I guess this means we should add at least one GUC to control that default...) I suspect there's folks on -general who would express interest if you want me to ask there... On Thu, Mar 02, 2006 at 10:15:19PM -0500, Bruce Momjian wrote: > > Is there still interst in this idea for TODO? > > --------------------------------------------------------------------------- > > As Tom suggested, I think it would be best to be able to change the size > > at which a field gets stored externally. I think it also makes sense to > > have this reverse the normal order of compress first, then if it still > > doesn't fit store it externally. I forsee this typically being useful > > when you have fields that are between ~100 and 1000 bytes in size, and > > I'm doubtful that compression would do much good there. But I wouldn't > > rule out this being useful on fields that can also sometimes contain > > much larger amounts of data, so I don't think it makes sense to disable > > compression completely. So, I think this leaves two new options: > > > > SET STORAGE EXTERNAL [THRESHOLD x] > > If a field is over x in size, it's stored externally. > > > > SET STORAGE EXTENDED [THRESHOLD x] > > If a field is over x in size, it's stored externally. If it's over > > BLCKSZ/4 it will also be compressed (I think that's how things work > > now). > > > > Actually, that's rather ugly. I think it would be better to just break > > external storage and compression out into their own attributes: > > > > SET STORAGE [ ALLOW EXTERNAL [THRESHOLD x] ] [ ALLOW COMPRESSION [THRESHOLD x] ] > > > > ALLOW EXTERNAL: if a field is larger than x bytes (BLCKSZ/4 by default) > > then it will be stored externally. May be specified along with ALLOW > > COMPRESSION. > > > > ALLOW COMPRESSION: if a field is larger than x bytes (BLCKSZ/4 by default) > > then it will be COMPRESSED. May be specified along with ALLOW EXTERNAL. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Ühel kenal päeval, N, 2006-03-02 kell 22:15, kirjutas Bruce Momjian: > Is there still interst in this idea for TODO? Just to voice my support - Yes, I think that being able to set lower thresolds for TOAST is very useful in several cases. Also getting rid of toast index and start using ctids directly would be a big bonus. When using direct ctids we could use either ctid chains or some sort of skiplist for access to N-th TOAST chunk. ------------ Hannu
On Sat, Mar 04, 2006 at 12:15:46PM +0200, Hannu Krosing wrote: > Also getting rid of toast index and start using ctids directly would be > a big bonus. > > When using direct ctids we could use either ctid chains or some sort of > skiplist for access to N-th TOAST chunk. I suppose this would mean that you couldn't use vacuum on the toast table anymore. Or teach vacuum that everytime it moves a tuple it needs to update the original table (sequential scan). What exactly are you trying to save here? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a > tool for doing 5% of the work and then sitting around waiting for someone > else to do the other 95% so you can sue them.
Martijn van Oosterhout <kleptog@svana.org> writes: > On Sat, Mar 04, 2006 at 12:15:46PM +0200, Hannu Krosing wrote: >> Also getting rid of toast index and start using ctids directly would be >> a big bonus. >> When using direct ctids we could use either ctid chains or some sort of >> skiplist for access to N-th TOAST chunk. > I suppose this would mean that you couldn't use vacuum on the toast > table anymore. Another problem with it is that it'd destroy the current optimizations that allow partial fetches of uncompressed TOASTed fields to be fast. You couldn't fetch page N of a TOAST datum without reading all the pages before it. I suppose the objection that toast tables wouldn't be regular tables anymore might not be fatal, but you'll certainly get some pushback if you try to take away the direct-access optimizations. regards, tom lane
Ühel kenal päeval, L, 2006-03-04 kell 10:31, kirjutas Tom Lane: > Martijn van Oosterhout <kleptog@svana.org> writes: > > On Sat, Mar 04, 2006 at 12:15:46PM +0200, Hannu Krosing wrote: > >> Also getting rid of toast index and start using ctids directly would be > >> a big bonus. > >> When using direct ctids we could use either ctid chains or some sort of > >> skiplist for access to N-th TOAST chunk. > > > I suppose this would mean that you couldn't use vacuum on the toast > > table anymore. > > Another problem with it is that it'd destroy the current optimizations > that allow partial fetches of uncompressed TOASTed fields to be fast. > You couldn't fetch page N of a TOAST datum without reading all the pages > before it. > > I suppose the objection that toast tables wouldn't be regular tables > anymore might not be fatal, but you'll certainly get some pushback if > you try to take away the direct-access optimizations. That's why I was suggesting skiplist instead on simple linked lists. Another way would be to put a list of all toast ctids for your whole toasted field in the first page(s) of the toast. That way you will still have option of fast access to any partial of the field, most likely even faster than with current implementation, as you have to touch less pages. And you can have it also for compressed fields if you store uncompressed offsets. ----------------- Hannu