Thread: tableam vs. TOAST
In a nearby thread[1], Ashwin Agrawal complained that there is no way for a table AM to get rid the TOAST table that the core system thinks should be created. To that I added a litany of complaints of my own, including... - the core system decides whether or not a TOAST table is needed based on criteria that are very much heap-specific, - the code for reading and writing values stored in a TOAST table is heap-specific, and - the core system assumes that you want to use the same table AM for the main table and the toast table, but you might not (e.g. you might want to use the regular old heap for the latter). Attached as a series of patches which try to improve things in this area. Except possibly for 0001, this is v13 material; see discussion on the other thread. These likely need some additional work, but I've done enough with them that I thought it would be worth publishing them at this stage, because it seems that I'm not the only one thinking about the problems that exist in this general area. Here is an overview: 0001 moves the needs_toast_table() calculation below the table AM layer. That allows a table AM to decide for itself whether it wants a TOAST table. The most obvious way in which a table AM might want to be different from what core expects is to decide that the answer is always "no," which it can do if it has some other method of storing large values or doesn't wish to support them. Another possibility is that it wants logic that is basically similar to the heap, but with a different size threshold because its tuple format is different. There are probably other possibilities. 0002 breaks tuptoaster.c into three separate files. It just does code movement; no functional changes. The three pieces are detoast.c, which handles detoasting of toast values and inspection of the sizes of toasted datums; heaptoast.c, which keeps all the functions that are intrinsically heap-specific; and toast_internals.c, which is intended to have a very limited audience. A nice fringe benefit of this stuff is that a lot of other files that current have to include tuptoaster.h and thus htup_details.h no longer do. 0003 creates a new file toast_helper.c which is intended to help table AMs implement insertion and deletion of toast table rows. Most of the AM-independent logic from the functions remaining in heaptoast.c is moved to this file. This leaves about ~600 of the original ~2400 lines from tuptoaster.c as heap-specific logic, but a new heap AM actually wouldn't need all of that stuff, because some of the logic here is in support of stuff like record types, which use HeapTuple internally and will continue to do so even if those record types are stored in some other kind of table. 0004 allows TOAST tables to be implemented using a table AM other than heap. In a certain sense this is the opposite of 0003. 0003 is intended to help people who are implementing a new kind of main table, whereas 0004 is intended to help people implementing a new kind of TOAST table. It teaches the code that inserts, deletes, and retrieves TOAST row to use slots, and it makes some efficiency improvements in the hopes of offsetting any performance loss from so doing. See commit message and/or patch for full details. I believe that with all of these changes it should be pretty straightforward for a table AM that wants to use itself to store TOAST data to do so, or to delegate that task back to say the regular heap. I haven't really validated that yet, but plan to do so. In addition to what's in this patch set, I believe that we should probably rename some of these functions and macros, so that the heap-specific ones have heap-specific names and the generic ones don't, but I haven't gone through all of that yet. The existing patches try to choose good names for the new things they add, but they don't rename any of the existing stuff. I also think we should consider removing TOAST_MAX_CHUNK_SIZE from the control file, both because I'm not sure anybody's really using the ability to vary that for anything and because that solution doesn't seem entirely sensible in a world of multiple AMs. However, that is a debatable change, so maybe others will disagree. [1] http://postgr.es/m/CALfoeitE+P8UGii8=BsGQLpHch2EZWJhq4M+D-jfaj8YCa_FSw@mail.gmail.com -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Updated and rebased patches attached. On Fri, May 17, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote: > 0001 moves the needs_toast_table() calculation below the table AM > layer. That allows a table AM to decide for itself whether it wants a > TOAST table. The most obvious way in which a table AM might want to > be different from what core expects is to decide that the answer is > always "no," which it can do if it has some other method of storing > large values or doesn't wish to support them. Another possibility is > that it wants logic that is basically similar to the heap, but with a > different size threshold because its tuple format is different. There > are probably other possibilities. This was committed as 1171d7d58545f26a402f76a05936d572bf29d53b per discussion on another thread. > 0002 breaks tuptoaster.c into three separate files. It just does code > movement; no functional changes. The three pieces are detoast.c, > which handles detoasting of toast values and inspection of the sizes > of toasted datums; heaptoast.c, which keeps all the functions that are > intrinsically heap-specific; and toast_internals.c, which is intended > to have a very limited audience. A nice fringe benefit of this stuff > is that a lot of other files that current have to include tuptoaster.h > and thus htup_details.h no longer do. Now 0001. No changes. > 0003 creates a new file toast_helper.c which is intended to help table > AMs implement insertion and deletion of toast table rows. Most of the > AM-independent logic from the functions remaining in heaptoast.c is > moved to this file. This leaves about ~600 of the original ~2400 > lines from tuptoaster.c as heap-specific logic, but a new heap AM > actually wouldn't need all of that stuff, because some of the logic > here is in support of stuff like record types, which use HeapTuple > internally and will continue to do so even if those record types are > stored in some other kind of table. Now 0002. No changes. > 0004 allows TOAST tables to be implemented using a table AM other than > heap. In a certain sense this is the opposite of 0003. 0003 is > intended to help people who are implementing a new kind of main table, > whereas 0004 is intended to help people implementing a new kind of > TOAST table. It teaches the code that inserts, deletes, and retrieves > TOAST row to use slots, and it makes some efficiency improvements in > the hopes of offsetting any performance loss from so doing. See > commit message and/or patch for full details. Now 0003. Some brain fade repaired. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > Updated and rebased patches attached. And again. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Updated and rebased patches attached.
And again.
I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and
combinations of compression and out-of-line storage options.
I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths.
Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid performance fluctuations,
and calculated the results as a median value of a few consecutive test executions.
Please find the SQL script attached herewith, which I have used to perform the observation.
Below are the test scenarios, how I have checked the behavior and performance of TOAST patches against PG master.
1. where a single column is compressed(SCC)
2. where multiple columns are compressed(MCC)
-- ALTER the table column/s for storage as "MAIN" to make sure that the column values are COMPRESSED.
3. where a single column is pushed to the TOAST table but not compressed(SCTNC)
4. where multiple columns are pushed to the TOAST table but not compressed(MCTNC)
-- ALTER the table column/s for storage as "EXTERNAL" to make sure that the column values are pushed to the TOAST table but not COMPRESSED.
5. where a single column is pushed to the TOAST table and also compressed(SCTC)
6. where multiple columns are pushed to the TOAST table and also compressed(MCTC)
-- ALTER the table column/s for storage as "EXTENDED" to make sure that the column values are pushed to the TOAST table and also COMPRESSED.
7. updating the tuples with similar data shouldn't affect the behavior of storage options.
System Used: (VCPUs: 8, RAM: 16GB, Size: 640GB)
10000 Tuples | 20000 Tuples | 40000 Tuples | 80000 Tuples | ||||||||
Without Patch | With Patch | Without Patch | With Patch | Without Patch | With Patch | Without Patch | With Patch | ||||
1. SCC INSERT | 125921.737 ms (02:05.922) | 125992.563 ms (02:05.993) | 234263.295 ms (03:54.263) | 235952.336 ms (03:55.952) | 497290.442 ms (08:17.290) | 502820.139 ms (08:22.820) | 948470.603 ms (15:48.471) | 941778.952 ms (15:41.779) | |||
1. SCC UPDATE | 263017.814 ms (04:23.018) | 270893.910 ms (04:30.894) | 488393.748 ms (08:08.394) | 507937.377 ms (08:27.937) | 1078862.613 ms (17:58.863) | 1053029.428 ms (17:33.029) | 2037119.576 ms (33:57.120) | 2023633.862 ms (33:43.634) | |||
2. MCC INSERT | 35415.089 ms (00:35.415) | 35910.552 ms (00:35.911) | 70899.737 ms (01:10.900) | 70800.964 ms (01:10.801) | 142185.996 ms (02:22.186) | 142241.913 ms (02:22.242) | |||||
2. MCC UPDATE | 72043.757 ms (01:12.044) | 73848.732 ms (01:13.849) | 137717.696 ms (02:17.718) | 137577.606 ms (02:17.578) | 276358.752 ms (04:36.359) | 276520.727 ms (04:36.521) | |||||
3. SCTNC INSERT | 26377.274 ms (00:26.377) | 25600.189 ms (00:25.600) | 45702.630 ms (00:45.703) | 45163.510 ms (00:45.164) | 99903.299 ms (01:39.903) | 100013.004 ms (01:40.013) | |||||
3. SCTNC UPDATE | 78385.225 ms (01:18.385) | 76680.325 ms (01:16.680) | 151823.250 ms (02:31.823) | 153503.971 ms (02:33.504) | 308197.734 ms (05:08.198) | 308474.937 ms (05:08.475) | |||||
4. MCTNC INSERT | 26214.069 ms (00:26.214) | 25383.522 ms (00:25.384) | 50826.522 ms (00:50.827) | 50221.669 ms (00:50.222) | 106034.338 ms (01:46.034) | 106122.827 ms (01:46.123) | |||||
4. MCTNC UPDATE | 78423.817 ms (01:18.424) | 75154.593 ms (01:15.155) | 158885.787 ms (02:38.886) | 156530.964 ms (02:36.531) | 319721.266 ms (05:19.721) | 322385.709 ms (05:22.386) | |||||
5. SCTC INSERT | 38451.022 ms (00:38.451) | 38652.520 ms (00:38.653) | 71590.748 ms (01:11.591) | 71048.975 ms (01:11.049) | 143327.913 ms (02:23.328) | 142593.207 ms (02:22.593) | |||||
5. SCTC UPDATE | 82069.311 ms (01:22.069) | 81678.131 ms (01:21.678) | 138763.508 ms (02:18.764) | 138625.473 ms (02:18.625) | 277534.080 ms (04:37.534) | 277091.611 ms (04:37.092) | |||||
6. MCTC INSERT | 36325.730 ms (00:36.326) | 35803.368 ms (00:35.803) | 73285.204 ms (01:13.285) | 72728.371 ms (01:12.728) | 142324.859 ms (02:22.325) | 144368.335 ms (02:24.368) | |||||
6. MCTC UPDATE | 73740.729 ms (01:13.741) | 73002.511 ms (01:13.003) | 141309.859 ms (02:21.310) | 139676.173 ms (02:19.676) | 278906.647 ms (04:38.907) | 279522.408 ms (04:39.522) |
I also have performed the below test with TOAST table objects.
8. pg_dump/restore, pg_upgrade with these
9. Streaming Replication setup
10. Concurrent Transactions
While testing few concurrent transactions I have below query:
-- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc)
-- Session 1:
CREATE TABLE a (a_id text PRIMARY KEY);
CREATE TABLE b (b_id text);
INSERT INTO a VALUES ('a'), ('b');
INSERT INTO b VALUES ('a'), ('b'), ('b');
BEGIN;
ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id); -- Not Acquiring any lock
-- Session 2:
SELECT * FROM b WHERE b_id = 'a'; -- Shows result
-- Session 1:
ALTER TABLE b ALTER COLUMN b_id SET STORAGE EXTERNAL; -- Acquire a lock
-- Session 2:
SELECT * FROM b WHERE b_id = 'a'; -- Hang/Waiting for lock in session 1
Is this an expected behavior?
With Regards,
Attachment
On Wed, Jun 12, 2019 at 4:17 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote: > > Updated and rebased patches attached. > > And again. Hi Robert, Thus spake GCC: detoast.c: In function ‘toast_fetch_datum’: detoast.c:308:12: error: variable ‘toasttupDesc’ set but not used [-Werror=unused-but-set-variable] TupleDesc toasttupDesc; ^ -- Thomas Munro https://enterprisedb.com
On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: > I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and > combinations of compression and out-of-line storage options. > I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths. > Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid performancefluctuations, > and calculated the results as a median value of a few consecutive test executions. Thanks for testing. > All the observation looks good to me, > except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC INSERT with tuple count(40K) there was a slightlyincrese in time taken > incase of "with patch" result. For a better observation, I also have ran the same "Test 1" for higher tuple count(i.e.80K), and it also looks fine. Did you run each test just once? How stable are the results? > While testing few concurrent transactions I have below query: > -- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc) > > -- Session 1: > CREATE TABLE a (a_id text PRIMARY KEY); > CREATE TABLE b (b_id text); > INSERT INTO a VALUES ('a'), ('b'); > INSERT INTO b VALUES ('a'), ('b'), ('b'); > > BEGIN; > ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id); -- Not Acquiring any lock For me, this acquires AccessShareLock and ShareRowExclusiveLock on the target table. rhaas=# select locktype, database, relation, pid, mode, granted from pg_locks where relation = 'b'::regclass; locktype | database | relation | pid | mode | granted ----------+----------+----------+-------+-----------------------+--------- relation | 16384 | 16872 | 93197 | AccessShareLock | t relation | 16384 | 16872 | 93197 | ShareRowExclusiveLock | t (2 rows) I don't see what that has to do with the topic at hand, though. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Sun, Jul 7, 2019 at 11:08 PM Thomas Munro <thomas.munro@gmail.com> wrote: > Thus spake GCC: > > detoast.c: In function ‘toast_fetch_datum’: > detoast.c:308:12: error: variable ‘toasttupDesc’ set but not used > [-Werror=unused-but-set-variable] > TupleDesc toasttupDesc; > ^ Hmm, fixed, I hope. Here's an updated patch set. In addition to the above fix, I fixed things up for the new pgindent rules and added a fourth patch that renames the detoasting functions to something that doesn't include 'heap.' -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
> I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and
> combinations of compression and out-of-line storage options.
> I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths.
> Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid performance fluctuations,
> and calculated the results as a median value of a few consecutive test executions.
Thanks for testing.
> All the observation looks good to me,
> except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC INSERT with tuple count(40K) there was a slightly increse in time taken
> incase of "with patch" result. For a better observation, I also have ran the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.
Did you run each test just once? How stable are the results?
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
On Tue, Jul 9, 2019 at 12:40 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: >> Did you run each test just once? How stable are the results? > > No, I have executed the test multiple times(7times each) and calculated the result as the median among those, > and the result looks stable(with v3 patches). I spent some time looking at your SCC test today. I think this isn't really testing the code that actually got changed in the patch: a quick CPU profile shows that your SCC test is bottlenecked on pg_lzcompress, which spends a huge amount of time compressing the gigantic string of 'a's you've constructed, and that code is exactly the same with the patch as it in master. So, I think that any fluctuations between the patched and unpatched results are just random variation. There's no reason the patch should be slower with one row count and faster with a different row count, anyway. I tried to come up with a better test case that uses a more modest amount of data, and ended up with this: -- Setup. CREATE OR REPLACE FUNCTION randomish_string(integer) RETURNS text AS $$ SELECT string_agg(random()::text, '') FROM generate_series(1, $1); $$ LANGUAGE sql; CREATE TABLE source_compressed (a int, b text); INSERT INTO source_compressed SELECT g, repeat('a', 2000) FROM generate_series(1, 10000) g; CREATE TABLE sink_compressed (LIKE source_compressed); CREATE TABLE source_external (a int, b text); INSERT INTO source_external SELECT g, randomish_string(400) FROM generate_series(1, 10000) g; CREATE TABLE sink_external (LIKE source_external); CREATE TABLE source_external_uncompressed (a int, b text); ALTER TABLE source_external_uncompressed ALTER COLUMN b SET STORAGE EXTERNAL; INSERT INTO source_external_uncompressed SELECT g, randomish_string(400) FROM generate_series(1, 10000) g; CREATE TABLE sink_external_uncompressed (LIKE source_external_uncompressed); ALTER TABLE sink_external_uncompressed ALTER COLUMN b SET STORAGE EXTERNAL; -- Test. \timing TRUNCATE sink_compressed, sink_external, sink_external_uncompressed; CHECKPOINT; INSERT INTO sink_compressed SELECT * FROM source_compressed; INSERT INTO sink_external SELECT * FROM source_external; INSERT INTO sink_external_uncompressed SELECT * FROM source_external_uncompressed; Roughly, on both master and with the patches, the first one takes about 4.2 seconds, the second 7.5, and the third 1.2. The third one is the fastest because it doesn't do any compression. Since it does less irrelevant work than the other two cases, it has the best chance of showing up any performance regression that the patch has caused -- if any regression existed, I suppose that it would be an increased per-toast-fetch or per-toast-chunk overhead. However, I can't reproduce any such regression. My first attempt at testing that case showed the patch about 1% slower, but that wasn't reliably reproducible when I did it a bunch more times. So as far as I can figure, this patch does not regress performance in any easily-measurable way. Barring objections, I plan to commit the whole series of patches here (latest rebase attached). They are not perfect and could likely be improved in various ways, but I think they are an improvement over what we have now, and it's not like it's set in stone once it's committed. We can change it more if we come up with a better idea. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2019-08-01 12:23:42 -0400, Robert Haas wrote: > Barring objections, I plan to commit the whole series of patches here > (latest rebase attached). They are not perfect and could likely be > improved in various ways, but I think they are an improvement over > what we have now, and it's not like it's set in stone once it's > committed. We can change it more if we come up with a better idea. Could you wait until I either had a chance to look again, or until, say, Monday if I don't get around to it? I'll try to get to it earlier than that. Greetings, Andres Freund
On Thu, Aug 1, 2019 at 1:53 PM Andres Freund <andres@anarazel.de> wrote: > Could you wait until I either had a chance to look again, or until, say, > Monday if I don't get around to it? I'll try to get to it earlier than > that. Sure, no problem. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-08-01 12:23:42 -0400, Robert Haas wrote: > Roughly, on both master and with the patches, the first one takes > about 4.2 seconds, the second 7.5, and the third 1.2. The third one > is the fastest because it doesn't do any compression. Since it does > less irrelevant work than the other two cases, it has the best chance > of showing up any performance regression that the patch has caused -- > if any regression existed, I suppose that it would be an increased > per-toast-fetch or per-toast-chunk overhead. However, I can't > reproduce any such regression. My first attempt at testing that case > showed the patch about 1% slower, but that wasn't reliably > reproducible when I did it a bunch more times. So as far as I can > figure, this patch does not regress performance in any > easily-measurable way. Hm, those all include writing, right? And for read-only we don't expect any additional overhead, correct? The write overhead is probably too large show a bit of function call overhead - but if so, it'd probably be on unlogged tables? And with COPY, because that utilizes multi_insert, which means more toasting in a shorter amount of time? .oO(why does everyone attach attachements out of order? Is that a gmail thing?) > From a4c858c75793f0f8aff7914c572a6615ea5babf8 Mon Sep 17 00:00:00 2001 > From: Robert Haas <rhaas@postgresql.org> > Date: Mon, 8 Jul 2019 11:58:05 -0400 > Subject: [PATCH 1/4] Split tuptoaster.c into three separate files. > > detoast.c/h contain functions required to detoast a datum, partially > or completely, plus a few other utility functions for examining the > size of toasted datums. > > toast_internals.c/h contain functions that are used internally to the > TOAST subsystem but which (mostly) do not need to be accessed from > outside. > > heaptoast.c/h contains code that is intrinsically specific to the > heap AM, either because it operates on HeapTuples or is based on the > layout of a heap page. > > detoast.c and toast_internals.c are placed in > src/backend/access/common rather than src/backend/access/heap. At > present, both files still have dependencies on the heap, but that will > be improved in a future commit. I wonder if toasting.c should be moved too? If anybody doesn't know git's --color-moved, it makes patches like this a lot easier to review. > index 00000000000..582af147ea1 > --- /dev/null > +++ b/src/include/access/detoast.h > @@ -0,0 +1,92 @@ > +/*------------------------------------------------------------------------- > + * > + * detoast.h > + * Access to compressed and external varlena values. > > Hm. Does it matter that that also includes stuff like expanded datums? > > + * Copyright (c) 2000-2019, PostgreSQL Global Development Group > + * > + * src/include/access/detoast.h > + * > + *------------------------------------------------------------------------- > + */ > +#ifndef DETOAST_H > +#define DETOAST_H trailing whitespace after "#ifndef DETOAST_H ". > commit 60d51e6510c66f79c51e43fe22730fe050d87854 > Author: Robert Haas <rhaas@postgresql.org> > Date: 2019-07-08 12:02:16 -0400 > > Create an API for inserting and deleting rows in TOAST tables. > > This moves much of the non-heap-specific logic from toast_delete and > toast_insert_or_update into a helper functions accessible via a new > header, toast_helper.h. Using the functions in this module, a table > AM can implement creation and deletion of TOAST table rows with > much less code duplication than was possible heretofore. Some > table AMs won't want to use the TOAST logic at all, but for those > that do this will make that easier. > > Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com Hm. This leaves toast_insert_or_update() as a name. That makes it sound like it's generic toast code, rather than heap specific? It's definitely nice how a lot of repetitive code has been deduplicated. Also makes it easier to see how algorithmically expensive toast_insert_or_update() is :(. > /* > * Second we look for attributes of attstorage 'x' or 'e' that are still > * inline, and make them external. But skip this if there's no toast > * table to push them to. > */ > while (heap_compute_data_size(tupleDesc, > toast_values, toast_isnull) > maxDataLen && > rel->rd_rel->reltoastrelid != InvalidOid) Shouldn't this condition be the other way round? > if (for_compression) > skip_colflags |= TOASTCOL_INCOMPRESSIBLE; > > for (i = 0; i < numAttrs; i++) > { > Form_pg_attribute att = TupleDescAttr(tupleDesc, i); > > if ((ttc->ttc_attr[i].tai_colflags & skip_colflags) != 0) > continue; > if (VARATT_IS_EXTERNAL(DatumGetPointer(ttc->ttc_values[i]))) > continue; /* can't happen, toast_action would be 'p' */ > if (for_compression && > VARATT_IS_COMPRESSED(DatumGetPointer(ttc->ttc_values[i]))) > continue; > if (check_main && att->attstorage != 'm') > continue; > if (!check_main && att->attstorage != 'x' && att->attstorage != 'e') > continue; > > if (ttc->ttc_attr[i].tai_size > biggest_size) > { > biggest_attno = i; > biggest_size = ttc->ttc_attr[i].tai_size; > } > } Couldn't most of these be handled via colflags, instead of having numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED, TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the size check ought to boil down to a single mask test? > extern void toast_tuple_init(ToastTupleContext *ttc); > extern int toast_tuple_find_biggest_attribute(ToastTupleContext *ttc, > bool for_compression, > bool check_main); > extern void toast_tuple_try_compression(ToastTupleContext *ttc, int attribute); > extern void toast_tuple_externalize(ToastTupleContext *ttc, int attribute, > int options, int max_chunk_size); > extern void toast_tuple_cleanup(ToastTupleContext *ttc, bool cleanup_toastrel); I wonder if a better prefix wouldn't be toasting_... > +/* > + * Information about one column of a tuple being toasted. > + * > + * NOTE: toast_action[i] can have these values: > + * ' ' default handling > + * 'p' already processed --- don't touch it > + * 'x' incompressible, but OK to move off > + * > + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with > + * toast_action[i] different from 'p'. > + */ > +typedef struct > +{ > + struct varlena *tai_oldexternal; > + int32 tai_size; > + uint8 tai_colflags; > +} ToastAttrInfo; I think that comment is outdated? > +/* > + * Flags indicating the overall state of a TOAST operation. > + * > + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need > + * to be deleted. > + * > + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed. > + * > + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted. > + * > + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other > + * words, the toaster did something. > + */ > +#define TOAST_NEEDS_DELETE_OLD 0x0001 > +#define TOAST_NEEDS_FREE 0x0002 > +#define TOAST_HAS_NULLS 0x0004 > +#define TOAST_NEEDS_CHANGE 0x0008 I'd make these enums. They're more often accessible in a debugger... > commit 9e4bd383a00e6bb96088666e57673b343049345c > Author: Robert Haas <rhaas@postgresql.org> > Date: 2019-08-01 10:37:02 -0400 > > Allow TOAST tables to be implemented using table AMs other than heap. > > toast_fetch_datum, toast_save_datum, and toast_delete_datum are > adjusted to use tableam rather than heap-specific functions. This > might have some performance impact, but this patch attempts to > mitigate that by restructuring things so that we don't open and close > the toast table and indexes multiple times per tuple. > > tableam now exposes an integer value (not a callback) for the > maximum TOAST chunk size, and has a new callback allowing table > AMs to specify the AM that should be used to implement the TOAST > table. Previously, the toast AM was always the same as the table AM. > > Patch by me, tested by Prabhat Sabu. > > Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com I'm quite unconvinced that making the chunk size specified by the AM is a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in pg_control etc. It seems a bit dangerous to let AMs provide the size, without being very clear that any change of the value will make data inaccessible. It'd be different if the max were only used during toasting. I think the *size* checks should be weakened so we check: 1) After each chunk, whether the already assembled chunks exceed the expected size. 2) After all chunks have been collected, check that the size is exactly what we expect. I don't think that removes a meaningful amount of error checking. Missing tuples etc get detected by the chunk_ids not being consecutive. The overall size is still verified. The obvious problem with this is the slice fetching logic. For slices with an offset of 0, it's obviously trivial to implement. For the higher slice logic, I'd assume we'd have to fetch the first slice by estimating where the start chunk is based on the current suggest chunk size, and restarting the scan earlier/later if not accurate. In most cases it'll be accurate, so we'd not loose efficiency. Greetings, Andres Freund
On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote: > Hm, those all include writing, right? And for read-only we don't expect > any additional overhead, correct? The write overhead is probably too > large show a bit of function call overhead - but if so, it'd probably be > on unlogged tables? And with COPY, because that utilizes multi_insert, > which means more toasting in a shorter amount of time? Yes and yes. I guess we could test the unlogged case and with COPY, but in any realistic case you're still looking for a tiny CPU overhead in a sea of I/O costs. Even if an extremely short COPY on an unlogged table regresses slightly, we do not normally reject patches that improve code quality on the grounds that they add function call overhead in a few places. Code like this hard to optimize and maintain; as you remarked yourself, there are multiple opportunities to do this stuff better that are hard to see in the current structure. > .oO(why does everyone attach attachements out of order? Is that > a gmail thing?) Must be. > I wonder if toasting.c should be moved too? I mean, we could, but I don't really see a reason. It'd just be moving it from one fairly-generic place to another, and I'd rather minimize churn. > trailing whitespace after "#ifndef DETOAST_H ". Will fix. > Hm. This leaves toast_insert_or_update() as a name. That makes it sound > like it's generic toast code, rather than heap specific? I'll rename it to heap_toast_insert_or_update(). But I think I'll put that in 0004 with the other renames. > It's definitely nice how a lot of repetitive code has been deduplicated. > Also makes it easier to see how algorithmically expensive > toast_insert_or_update() is :(. Yep. > Shouldn't this condition be the other way round? I had to fight pretty hard to stop myself from tinkering with the algorithm -- this can clearly be done better, but I wanted to make it match the existing structure as far as possible. It also only needs to be tested once, not on every loop iteration, so if we're going to start changing things, we should go further than just swapping the order of the tests. For now I prefer to draw a line in the sand and change nothing. > Couldn't most of these be handled via colflags, instead of having > numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED, > TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the > size check ought to boil down to a single mask test? I'm not really seeing how more flags would significantly simplify this logic, but I might be missing something. > I wonder if a better prefix wouldn't be toasting_... I'm open to other votes, but I think it's toast_tuple is more specific than toasting_ and thus better. > > +/* > > + * Information about one column of a tuple being toasted. > > + * > > + * NOTE: toast_action[i] can have these values: > > + * ' ' default handling > > + * 'p' already processed --- don't touch it > > + * 'x' incompressible, but OK to move off > > + * > > + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with > > + * toast_action[i] different from 'p'. > > + */ > > +typedef struct > > +{ > > + struct varlena *tai_oldexternal; > > + int32 tai_size; > > + uint8 tai_colflags; > > +} ToastAttrInfo; > > I think that comment is outdated? Oops. Will fix. > > +/* > > + * Flags indicating the overall state of a TOAST operation. > > + * > > + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need > > + * to be deleted. > > + * > > + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed. > > + * > > + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted. > > + * > > + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other > > + * words, the toaster did something. > > + */ > > +#define TOAST_NEEDS_DELETE_OLD 0x0001 > > +#define TOAST_NEEDS_FREE 0x0002 > > +#define TOAST_HAS_NULLS 0x0004 > > +#define TOAST_NEEDS_CHANGE 0x0008 > > I'd make these enums. They're more often accessible in a debugger... Ugh, I hate that style. Abusing enums to make flag bits makes my skin crawl. I always wondered what the appeal was -- I guess now I know. Blech. > I'm quite unconvinced that making the chunk size specified by the AM is > a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in > pg_control etc. It seems a bit dangerous to let AMs provide the size, > without being very clear that any change of the value will make data > inaccessible. It'd be different if the max were only used during > toasting. I was actually thinking about proposing that we rip TOAST_MAX_CHUNK_SIZE out of pg_control. No real effort has been made here to make this something that users could configure, and I don't know of a good reason to configure it. It also seems pretty out of place in a world where there are multiple AMs floating around -- why should heap, and only heap, be checked there? Granted it does have some pride of place, but still. > I think the *size* checks should be weakened so we check: > 1) After each chunk, whether the already assembled chunks exceed the > expected size. > 2) After all chunks have been collected, check that the size is exactly > what we expect. > > I don't think that removes a meaningful amount of error > checking. Missing tuples etc get detected by the chunk_ids not being > consecutive. The overall size is still verified. > > The obvious problem with this is the slice fetching logic. For slices > with an offset of 0, it's obviously trivial to implement. For the higher > slice logic, I'd assume we'd have to fetch the first slice by estimating > where the start chunk is based on the current suggest chunk size, and > restarting the scan earlier/later if not accurate. In most cases it'll > be accurate, so we'd not loose efficiency. I don't feel entirely convinced that there's any rush to do all of this right now, and the more I change the harder it is to make sure that I haven't broken anything. How strongly do you feel about this stuff? -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Robert Haas <robertmhaas@gmail.com> writes: > I don't feel entirely convinced that there's any rush to do all of > this right now, and the more I change the harder it is to make sure > that I haven't broken anything. How strongly do you feel about this > stuff? FWIW, I agree with your comment further up that this patch ought to just refactor, not change any algorithms. The latter can be done in separate patches. regards, tom lane
Hi, On 2019-08-05 15:36:59 -0400, Robert Haas wrote: > On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote: > > Hm. This leaves toast_insert_or_update() as a name. That makes it sound > > like it's generic toast code, rather than heap specific? > > I'll rename it to heap_toast_insert_or_update(). But I think I'll put > that in 0004 with the other renames. Makes sense. > > It's definitely nice how a lot of repetitive code has been deduplicated. > > Also makes it easier to see how algorithmically expensive > > toast_insert_or_update() is :(. > > Yep. > > > Shouldn't this condition be the other way round? > > I had to fight pretty hard to stop myself from tinkering with the > algorithm -- this can clearly be done better, but I wanted to make it > match the existing structure as far as possible. It also only needs to > be tested once, not on every loop iteration, so if we're going to > start changing things, we should go further than just swapping the > order of the tests. For now I prefer to draw a line in the sand and > change nothing. Makes sense. > > Couldn't most of these be handled via colflags, instead of having > > numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED, > > TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the > > size check ought to boil down to a single mask test? > > I'm not really seeing how more flags would significantly simplify this > logic, but I might be missing something. Well, right now you have a number of ifs for each attribute. If you encoded all the parameters into flags, you could change that to a single flag test - as far as I can tell, all the parameters could be represented as that, if you moved MAIN etc into flags. A single if for flags (and then the size check) is cheaper than several separate checks. > > I'm quite unconvinced that making the chunk size specified by the AM is > > a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in > > pg_control etc. It seems a bit dangerous to let AMs provide the size, > > without being very clear that any change of the value will make data > > inaccessible. It'd be different if the max were only used during > > toasting. > > I was actually thinking about proposing that we rip > TOAST_MAX_CHUNK_SIZE out of pg_control. No real effort has been made > here to make this something that users could configure, and I don't > know of a good reason to configure it. It also seems pretty out of > place in a world where there are multiple AMs floating around -- why > should heap, and only heap, be checked there? Granted it does have > some pride of place, but still. > > I think the *size* checks should be weakened so we check: > > 1) After each chunk, whether the already assembled chunks exceed the > > expected size. > > 2) After all chunks have been collected, check that the size is exactly > > what we expect. > > > > I don't think that removes a meaningful amount of error > > checking. Missing tuples etc get detected by the chunk_ids not being > > consecutive. The overall size is still verified. > > > > The obvious problem with this is the slice fetching logic. For slices > > with an offset of 0, it's obviously trivial to implement. For the higher > > slice logic, I'd assume we'd have to fetch the first slice by estimating > > where the start chunk is based on the current suggest chunk size, and > > restarting the scan earlier/later if not accurate. In most cases it'll > > be accurate, so we'd not loose efficiency. > > I don't feel entirely convinced that there's any rush to do all of > this right now, and the more I change the harder it is to make sure > that I haven't broken anything. How strongly do you feel about this > stuff? I think we either should leave the hardcoded limit in place, or make it actually not fixed. Ripping-it-out-but-not-actually just seems like a trap for the unwary, without much point. Greetings, Andres Freund
On 2019-Aug-05, Robert Haas wrote: > > Shouldn't this condition be the other way round? > > I had to fight pretty hard to stop myself from tinkering with the > algorithm -- this can clearly be done better, but I wanted to make it > match the existing structure as far as possible. It also only needs to > be tested once, not on every loop iteration, so if we're going to > start changing things, we should go further than just swapping the > order of the tests. For now I prefer to draw a line in the sand and > change nothing. I agree, and can we move forward with this 0001? The idea here is to change no code (as also suggested by Tom elsewhere), and it's the largest patch in this series by a mile. I checked --color-moved=zebra and I think the patch looks fine, and also it compiles fine. I ran src/tools/pginclude/headerscheck on it and found no complaints. So here's a rebased version, where the DETOAST_H whitespace has been removed. No other changes from your original. Will you please push soon? -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
On Thu, Sep 5, 2019 at 10:52 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote: > I agree, and can we move forward with this 0001? The idea here is to > change no code (as also suggested by Tom elsewhere), and it's the > largest patch in this series by a mile. I checked --color-moved=zebra > and I think the patch looks fine, and also it compiles fine. I ran > src/tools/pginclude/headerscheck on it and found no complaints. > > So here's a rebased version, where the DETOAST_H whitespace has been > removed. No other changes from your original. Will you please push > soon? Done, thanks. Here's the rest again with the additional rename added to 0003 (formerly 0004). I think it's probably OK to go ahead with that stuff, too, but I'll wait a bit to see if anyone wants to raise more objections. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
Hi, On 2019-09-05 13:42:40 -0400, Robert Haas wrote: > Done, thanks. Here's the rest again with the additional rename added > to 0003 (formerly 0004). I think it's probably OK to go ahead with > that stuff, too, but I'll wait a bit to see if anyone wants to raise > more objections. Well, I still dislike making the toast chunk size configurable in a halfhearted manner. Greetings, Andres Freund
On Thu, Sep 5, 2019 at 3:10 PM Andres Freund <andres@anarazel.de> wrote: > On 2019-09-05 13:42:40 -0400, Robert Haas wrote: > > Done, thanks. Here's the rest again with the additional rename added > > to 0003 (formerly 0004). I think it's probably OK to go ahead with > > that stuff, too, but I'll wait a bit to see if anyone wants to raise > > more objections. > > Well, I still dislike making the toast chunk size configurable in a > halfhearted manner. So, I'd be willing to look into that some more. But how about if I commit the next patch in the series first? I think this comment is really about the second patch in the series, "Allow TOAST tables to be implemented using table AMs other than heap," and it's fair to point out that, since that patch extends table AM, we're somewhat committed to it once we put it in. But "Create an API for inserting and deleting rows in TOAST tables." is just refactoring, and I don't see what we gain from waiting to commit that part. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Andres Freund <andres@anarazel.de> writes: > Well, I still dislike making the toast chunk size configurable in a > halfhearted manner. It's hard to make it fully configurable without breaking our on-disk storage, because of the lack of any explicit representation of the chunk size in TOAST data. You have to "just know" how big the chunks are supposed to be. However, it's reasonable to ask why we should treat it as an AM property, especially a fixed AM property as this has it. If somebody does reimplement toast logic in some other AM, they might well decide it's worth the storage cost to be more flexible about the chunk size ... but too bad, this design won't let them do it. I don't entirely understand why relation_toast_am is a callback while toast_max_chunk_size isn't, either. Why would they not both be callbacks? That would at least let an AM set a per-relation max chunk size, if it wanted. It seems like this design throws away most of the benefit of a fixed chunk size (mostly, being able to do relevant modulo arithmetic with shifts and masks rather than full-fledged integer division) without getting much of anything in return. regards, tom lane
On Thu, Sep 5, 2019 at 3:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Andres Freund <andres@anarazel.de> writes: > > Well, I still dislike making the toast chunk size configurable in a > > halfhearted manner. > > It's hard to make it fully configurable without breaking our on-disk > storage, because of the lack of any explicit representation of the chunk > size in TOAST data. You have to "just know" how big the chunks are > supposed to be. There was a concrete proposal about this from Andres here, down at the bottom of the email: http://postgr.es/m/20190802224251.lsxw4o5ebn2ng5ey@alap3.anarazel.de Basically, detoasting would tolerate whatever chunk size it finds, and the slice-fetching logic would get complicated. > However, it's reasonable to ask why we should treat it as an AM property, > especially a fixed AM property as this has it. If somebody does > reimplement toast logic in some other AM, they might well decide it's > worth the storage cost to be more flexible about the chunk size ... but > too bad, this design won't let them do it. Fair complaint. The main reason I want to treat it as an AM property is that TOAST_TUPLE_THRESHOLD is defined in terms of heap-specific constants, and having other AMs include heap-specific header files seems like a thing we should try hard to avoid. Once you're indirectly including htup_details.h in every AM in existence, it's going to be hard to be sure that you've got no other dependencies on the current heap AM. But I agree that making it not a fixed value could be useful. One benefit of it would be that you could just change the value, even for the current heap, without breaking access to already-toasted data. > It seems like this design throws away most of the benefit of a fixed > chunk size (mostly, being able to do relevant modulo arithmetic with > shifts and masks rather than full-fledged integer division) without > getting much of anything in return. I don't think you're really getting that particular benefit, because TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET are not going to end up as powers of two. But you do get the benefit of working with constants instead of a value determined at runtime. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-09-05 15:27:28 -0400, Robert Haas wrote: > On Thu, Sep 5, 2019 at 3:10 PM Andres Freund <andres@anarazel.de> wrote: > > On 2019-09-05 13:42:40 -0400, Robert Haas wrote: > > > Done, thanks. Here's the rest again with the additional rename added > > > to 0003 (formerly 0004). I think it's probably OK to go ahead with > > > that stuff, too, but I'll wait a bit to see if anyone wants to raise > > > more objections. > > > > Well, I still dislike making the toast chunk size configurable in a > > halfhearted manner. > > So, I'd be willing to look into that some more. But how about if I > commit the next patch in the series first? I think this comment is > really about the second patch in the series, "Allow TOAST tables to be > implemented using table AMs other than heap," and it's fair to point > out that, since that patch extends table AM, we're somewhat committed > to it once we put it in. But "Create an API for inserting and > deleting rows in TOAST tables." is just refactoring, and I don't see > what we gain from waiting to commit that part. Yea, makes sense to me.
On Thu, Sep 5, 2019 at 4:07 PM Andres Freund <andres@anarazel.de> wrote: > Yea, makes sense to me. OK, done. Here's the remaining patches again, with a slight update to the renaming patch (now 0002). In the last version, I renamed toast_insert_or_update to heap_toast_insert_or_update but did not rename toast_delete to heap_toast_delete. Actually, I'm not seeing any particular reason not to go ahead and push the renaming patch at this point also. I guess there's a question as to whether I should more aggressively add "heap" to the names of the other functions in heaptoast.h, but I'm kinda "meh" about that. It seems likely that other AMs will need their own versions of toast_insert_or_update() and toast_delete(), but they shouldn't really need their own version of, say, toast_flatten_tuple_to_datum(), and the point there is that we're building a DatumTuple, so calling it heap_toast_flatten_tuple_to_datum() seems almost misleading. I'm inclined to leave all that stuff alone for now. 0001 needs more thought, as discussed. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote: > The obvious problem with this is the slice fetching logic. For slices > with an offset of 0, it's obviously trivial to implement. For the higher > slice logic, I'd assume we'd have to fetch the first slice by estimating > where the start chunk is based on the current suggest chunk size, and > restarting the scan earlier/later if not accurate. In most cases it'll > be accurate, so we'd not loose efficiency. Dilip and I were discussing this proposal this morning, and after talking to him, I don't quite see how to make this work without significant loss of efficiency. Suppose that, based on the estimated chunk size, you decide that there are probably 10 chunks and that the value that you need is probably located in chunk 6. So you fetch chunk 6. Happily, chunk 6 is the size that you expect, so you extract the bytes that you need and go on your way. But ... what if there are actually 6 chunks, not 10, and the first five are bigger than you expected, and the reason why the size of chunk 6 matched your expectation because it was the last chunk and thus smaller than the rest? Now you've just returned the wrong answer. AFAICS, there's no way to detect that except to always read at least two chunks, which seems like a pretty heavy price to pay. It doesn't cost if you were going to need to read at least 2 chunks anyway, but if you were only going to need to read 1, having to fetch 2 stinks. Actually, when I initially read your proposal, I thought you were proposing to relax things such that the chunks didn't even have to all be the same size. That seems like it would be something cool that could potentially be leveraged not only by AMs but perhaps also by data types that want to break up their data strategically to cater to future access patterns. But then there's really no way to make slicing work without always reading from the beginning. There would be no big problem here - in either scenario - if each chunk contained the byte-offset of that chunk relative to the start of the datum. You could guess wildly and always know whether or not you had got it right without reading any extra data. But that's not the case. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Fri, Sep 6, 2019 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote: > On Thu, Sep 5, 2019 at 4:07 PM Andres Freund <andres@anarazel.de> wrote: > > Yea, makes sense to me. > > OK, done. Here's the remaining patches again, with a slight update to > the renaming patch (now 0002). In the last version, I renamed > toast_insert_or_update to heap_toast_insert_or_update but did not > rename toast_delete to heap_toast_delete. Actually, I'm not seeing > any particular reason not to go ahead and push the renaming patch at > this point also. And, hearing no objections, done. Here's the last patch back, rebased over that renaming. Although I think that Andres (and Tom) are probably right that there's room for improvement here, I currently don't see a way around the issues I wrote about in http://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com -- so not quite sure where to go next. Hopefully Andres or someone else will give me a quick whack with the cluebat if I'm missing something obvious. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB
This issue is not frequently reproducible, we need to repeat the same testcase multiple times.
CREATE OR REPLACE FUNCTION toast_chunks_cnt_func(p1 IN text)
RETURNS int AS $$
DECLARE
chunks_cnt int;
v_tbl text;
BEGIN
SELECT reltoastrelid::regclass INTO v_tbl FROM pg_class WHERE RELNAME = p1;
EXECUTE 'SELECT count(*) FROM ' || v_tbl::regclass INTO chunks_cnt;
RETURN chunks_cnt;
END; $$ LANGUAGE PLPGSQL;
-- Server crash after multiple run of below testcase
-- ------------------------------------------------------------------------
CHECKPOINT;
CREATE TABLE toast_tab (c1 text);
\d+ toast_tab
-- ALTER table column c1 for storage as "EXTERNAL" to make sure that the column value is pushed to the TOAST table but not COMPRESSED.
ALTER TABLE toast_tab ALTER COLUMN c1 SET STORAGE EXTERNAL;
\d+ toast_tab
\timing
INSERT INTO toast_tab
( select repeat('a', 200000)
from generate_series(1,40000) x);
\timing
SELECT reltoastrelid::regclass FROM pg_class WHERE RELNAME = 'toast_tab';
SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks";
SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1;
SELECT DISTINCT SUBSTR(c1, 90000,10) FROM toast_tab;
CHECKPOINT;
\timing
UPDATE toast_tab SET c1 = UPPER(c1);
\timing
SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks";
SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1;
SELECT DISTINCT SUBSTR(c1, 90000,10) FROM toast_tab;
DROP TABLE toast_tab;
-- ------------------------------------------------------------------------
-- Stacktrace as below:
[centos@host-192-168-1-249 bin]$ gdb -q -c data2/core.3151 postgres
Reading symbols from /home/centos/PG/PGsrc/postgresql/inst/bin/postgres...done.
[New LWP 3151]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer '.
Program terminated with signal 6, Aborted.
#0 0x00007f2267d33207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-13.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00007f2267d33207 in raise () from /lib64/libc.so.6
#1 0x00007f2267d348f8 in abort () from /lib64/libc.so.6
#2 0x0000000000eb3a80 in errfinish (dummy=0) at elog.c:552
#3 0x0000000000c26530 in ProcessSyncRequests () at sync.c:393
#4 0x0000000000bbbc57 in CheckPointBuffers (flags=256) at bufmgr.c:2589
#5 0x0000000000604634 in CheckPointGuts (checkPointRedo=51448358328, flags=256) at xlog.c:8992
#6 0x0000000000603b5e in CreateCheckPoint (flags=256) at xlog.c:8781
#7 0x0000000000aed8fa in CheckpointerMain () at checkpointer.c:481
#8 0x00000000006240de in AuxiliaryProcessMain (argc=2, argv=0x7ffe887c0880) at bootstrap.c:461
#9 0x0000000000b0e834 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5414
#10 0x0000000000b09283 in reaper (postgres_signal_arg=17) at postmaster.c:2995
#11 <signal handler called>
#12 0x00007f2267df1f53 in __select_nocancel () from /lib64/libc.so.6
#13 0x0000000000b05000 in ServerLoop () at postmaster.c:1682
#14 0x0000000000b0457b in PostmasterMain (argc=5, argv=0x349bce0) at postmaster.c:1391
#15 0x0000000000971c9f in main (argc=5, argv=0x349bce0) at main.c:210
(gdb)
On Fri, Sep 6, 2019 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 5, 2019 at 4:07 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea, makes sense to me.
>
> OK, done. Here's the remaining patches again, with a slight update to
> the renaming patch (now 0002). In the last version, I renamed
> toast_insert_or_update to heap_toast_insert_or_update but did not
> rename toast_delete to heap_toast_delete. Actually, I'm not seeing
> any particular reason not to go ahead and push the renaming patch at
> this point also.
And, hearing no objections, done.
Here's the last patch back, rebased over that renaming. Although I
think that Andres (and Tom) are probably right that there's room for
improvement here, I currently don't see a way around the issues I
wrote about in http://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com
-- so not quite sure where to go next. Hopefully Andres or someone
else will give me a quick whack with the cluebat if I'm missing
something obvious.
--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB
This issue is not frequently reproducible, we need to repeat the same testcase multiple times.
On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB
This issue is not frequently reproducible, we need to repeat the same testcase multiple times.I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.
My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped or truncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.
2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file being synced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() to update the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's why the "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failed too. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please check ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.
Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point #2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelled and then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUEST for that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details on this.
So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked. But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting the server logfile.
Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for the test-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point #1 is the reason for the panic or point #2.
On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB
This issue is not frequently reproducible, we need to repeat the same testcase multiple times.I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.Hi Robert,
My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).--With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
On 2019-10-04 20:32, Robert Haas wrote: > Here's the last patch back, rebased over that renaming. Although I > think that Andres (and Tom) are probably right that there's room for > improvement here, I currently don't see a way around the issues I > wrote about inhttp://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com > -- so not quite sure where to go next. Hopefully Andres or someone > else will give me a quick whack with the cluebat if I'm missing > something obvious. This patch seems sound as far as the API restructuring goes. If I may summarize the remaining discussion: This patch adds a field toast_max_chunk_size to TableAmRoutine, to take the place of the hardcoded TOAST_MAX_CHUNK_SIZE. The heapam_methods implementation then sets this to TOAST_MAX_CHUNK_SIZE, thus preserving existing behavior. Other table AMs can set this to some other value that they find suitable. Currently, TOAST_MAX_CHUNK_SIZE is computed based on heap-specific values and assumptions, so it's likely that other AMs won't want to use that value. (Side note: Maybe rename TOAST_MAX_CHUNK_SIZE then.) The concern was raised that while TOAST_MAX_CHUNK_SIZE is stored in pg_control, values chosen by other table AMs won't be, and so they won't have any safe-guards against starting a server with incompatible disk layout. Then, various ways to detect or check the TOAST chunk size at run time were discussed, but none seemed satisfactory. I think AMs are probably going to need a general mechanism to store pg_control-like data somewhere. There are going to be chunk sizes, block sizes, segment sizes, and so on. This one is just a particular case of that. This particular patch doesn't need to be held up by that, though. Providing that mechanism can be a separate subproject of pluggable storage. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Wed, Nov 6, 2019 at 4:01 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > This patch seems sound as far as the API restructuring goes. Thanks. And thanks for weighing in. > If I may summarize the remaining discussion: This patch adds a field > toast_max_chunk_size to TableAmRoutine, to take the place of the > hardcoded TOAST_MAX_CHUNK_SIZE. The heapam_methods implementation then > sets this to TOAST_MAX_CHUNK_SIZE, thus preserving existing behavior. > Other table AMs can set this to some other value that they find > suitable. Currently, TOAST_MAX_CHUNK_SIZE is computed based on > heap-specific values and assumptions, so it's likely that other AMs > won't want to use that value. (Side note: Maybe rename > TOAST_MAX_CHUNK_SIZE then.) Yeah. > The concern was raised that while > TOAST_MAX_CHUNK_SIZE is stored in pg_control, values chosen by other > table AMs won't be, and so they won't have any safe-guards against > starting a server with incompatible disk layout. Then, various ways to > detect or check the TOAST chunk size at run time were discussed, but > none seemed satisfactory. Yeah. I've been nervous about trying to proceed with this patch because Andres seemed confident there was a better approach than what I did here, but as I wrote about back on September 12th, it doesn't seem like his idea will work. I'm not clear whether I'm being stupid and there's a way to salvage his idea, or whether he just made a mistake. One possible approach would be to move more of the logic below the tableam layer. For example, toast_fetch_datum() could do this: toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock); call_a_new_tableam_method_here(toast_rel, &toast_pointer); table_close(toastrel, AccessShareLock); ...and then it becomes the tableam's job to handle everything that needs to be done in the middle. That might be better than what I've got now; it's certainly more flexible. It does mean that an AM that just wants to reuse the existing logic with a different chunk size has got to repeat some code, but it's probably <~150 lines, so that's perhaps not a catastrophe. Alternatively, we could (a) stick with the current approach, (b) use the current approach but make the table AM member a callback rather than a constant, or (c) something else entirely. I don't want to give up on making the TOAST infrastructure pluggable; requiring every AM to use the heap as its TOAST implementation seems too constraining. > I think AMs are probably going to need a general mechanism to store > pg_control-like data somewhere. There are going to be chunk sizes, > block sizes, segment sizes, and so on. This one is just a particular > case of that. That's an interesting point. I don't know for sure to what extent we need that; I think that the toast chunk size is actually not very interesting to vary, and the fact that we technically allow it to be varied seems like it isn't buying us much. I think as much as possible we should allow settings that actually need to be varied to differ table-by-table, not require a recompile or re-initdb. But if we are going to have some that do require that, then what you're talking about here would certainly make that easier to secure. > This particular patch doesn't need to be held up by that, though. > Providing that mechanism can be a separate subproject of pluggable storage. +1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-11-06 10:01:40 +0100, Peter Eisentraut wrote: > On 2019-10-04 20:32, Robert Haas wrote: > > Here's the last patch back, rebased over that renaming. Although I > > think that Andres (and Tom) are probably right that there's room for > > improvement here, I currently don't see a way around the issues I > > wrote about inhttp://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com > > -- so not quite sure where to go next. Hopefully Andres or someone > > else will give me a quick whack with the cluebat if I'm missing > > something obvious. > > This patch seems sound as far as the API restructuring goes. > > If I may summarize the remaining discussion: This patch adds a field > toast_max_chunk_size to TableAmRoutine, to take the place of the hardcoded > TOAST_MAX_CHUNK_SIZE. The heapam_methods implementation then sets this to > TOAST_MAX_CHUNK_SIZE, thus preserving existing behavior. Other table AMs can > set this to some other value that they find suitable. Currently, > TOAST_MAX_CHUNK_SIZE is computed based on heap-specific values and > assumptions, so it's likely that other AMs won't want to use that value. > (Side note: Maybe rename TOAST_MAX_CHUNK_SIZE then.) The concern was raised > that while TOAST_MAX_CHUNK_SIZE is stored in pg_control, values chosen by > other table AMs won't be, and so they won't have any safe-guards against > starting a server with incompatible disk layout. Then, various ways to > detect or check the TOAST chunk size at run time were discussed, but none > seemed satisfactory. I think it's more than just that. It's also that I think presenting a hardcoded value to the outside of / above an AM is architecturally wrong. If anything this is an implementation detail of the AM, that the AM ought to be concerned with internally, not something it should present to the outside. I also, and separately from that architectural concern, think that hardcoding values like this in the control file is a bad practice, and we shouldn't expand it. It basically makes it practically impossible to ever change their default value. > I think AMs are probably going to need a general mechanism to store > pg_control-like data somewhere. There are going to be chunk sizes, block > sizes, segment sizes, and so on. This one is just a particular case of > that. That's imo best done as a meta page within the table. > This particular patch doesn't need to be held up by that, though. Providing > that mechanism can be a separate subproject of pluggable storage. Again seems like something that the AM ought to handle below it. Greetings, Andres Freund
On Wed, Nov 6, 2019 at 11:25 AM Andres Freund <andres@anarazel.de> wrote: > I think it's more than just that. It's also that I think presenting a > hardcoded value to the outside of / above an AM is architecturally > wrong. If anything this is an implementation detail of the AM, that the > AM ought to be concerned with internally, not something it should > present to the outside. I mean, it depends on your vision of how things ought to be abstracted. If you want the TOAST stuff to be logically "below" the table AM layer, then this is an abstraction violation. But if you think of TOAST as being a parallel system to table AM, then it's fine. It also depends on your goals. If you want to give the table AM maximum freedom to do what it likes, the design I proposed is not very good. If you want to make it easy for someone to plug in a new AM that does toasting like the current heap but with a different chunk size, that design lets you do so with a very minimal amount of code. I don't really care very much about the details here, but I don't want to just keep kicking the can down the road. If we can agree on *some* design that lets a new table AM have a TOAST table that uses an AM other than the heap, and that I can understand and implement with some halfway-reasonable amount of work, I'll do it. It doesn't have to be the thing I proposed. But I think it would be better to do that thing than nothing. We're not engraving anything we do here on stone tablets. > I also, and separately from that architectural concern, think that > hardcoding values like this in the control file is a bad practice, and > we shouldn't expand it. It basically makes it practically impossible to > ever change their default value. I generally agree, although I think there might be exceptions. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Hi, On 2019-11-06 10:38:58 -0500, Robert Haas wrote: > Yeah. I've been nervous about trying to proceed with this patch > because Andres seemed confident there was a better approach than what > I did here, but as I wrote about back on September 12th, it doesn't > seem like his idea will work. I'm not clear whether I'm being stupid > and there's a way to salvage his idea, or whether he just made a > mistake. (still trying to get back to this) > One possible approach would be to move more of the logic below the > tableam layer. For example, toast_fetch_datum() could do this: > > toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock); > call_a_new_tableam_method_here(toast_rel, &toast_pointer); > table_close(toastrel, AccessShareLock); > ...and then it becomes the tableam's job to handle everything that > needs to be done in the middle. I think that's a good direction to go in, for more than just the the discussion we're having here about the fixed chunking size. I'm fairly sure that plenty AMs e.g. wouldn't want to actually store toasted datums in a separate relation. And this would at least go more in the direction of making that possible. And I think the above ought to not even increase the overhead compared to the patch, as the number of indirect function calls ought to stay the same or even be lower. The indirection would otherwise have to happen within toast_fetch_datum(), whereas with the above, it ought to be possible to avoid needing to do so processing toast chunks. It seems, unfortunately, that atm an AM not wanting to store toast datums in a separate file, would still need to create a toast relation, just to get a distinct toast oid, to make sure that toasted datums from the old and new relation are distinct. Which seems like it could be important e.g. for table rewrites. There's probably some massaging of table rewrites and cluster needed. I suspect the new callback ought to allow sliced and non-sliced access, perhaps by just allowing to specify slice offset / length to 0 and INT32_MAX respectively (or maybe just -1, -1?). That'd also allow an AM to make slicing possible in cases that's not possible for heap. And there seems little point in having two callbacks. > That might be better than what I've got now; it's certainly more > flexible. It does mean that an AM that just wants to reuse the > existing logic with a different chunk size has got to repeat some > code, but it's probably <~150 lines, so that's perhaps not a > catastrophe. Also seems that the relevant code can be made reusable (opting in into the current logic), in line with where you've been going with this code already. > Alternatively, we could (a) stick with the current approach, (b) use > the current approach but make the table AM member a callback rather > than a constant, or (c) something else entirely. A callback returning the chunk size does not seem like an improvement to me. > I don't want to give up on making the TOAST infrastructure pluggable; > requiring every AM to use the heap as its TOAST implementation seems > too constraining. +1 > > I think AMs are probably going to need a general mechanism to store > > pg_control-like data somewhere. There are going to be chunk sizes, > > block sizes, segment sizes, and so on. This one is just a particular > > case of that. > > That's an interesting point. I don't know for sure to what extent we > need that; I think that the toast chunk size is actually not very > interesting to vary, and the fact that we technically allow it to be > varied seems like it isn't buying us much. Whether I agree with that statement depends a bit on what you mean by varying the chunk size. If you mean that there's not much need for a value other than a adjusted computation of what's currently used, then I don't agree: We currently make toast a lot more expensive by quadrupling the number of separate heap fetches. And e.g. compressing chunks separately, to allow for sliced access even when compressed, would also be hard to do with the current sizes. Additionally, I *very* strongly suspect that, while it makes sense to use chunks sizes where multiple chunks fit a page for a heavily updated toast table full of small-ish values, that it makes no sense whatsoever to do so when toasting a 10MB value that's going to be appended to the toast relation, because there's no space available for reuse anyway. And smaller chunking isn't going to help with space reuse either, because the whole toast datum is going to be deleted together. I think the right way for toast creation to behave really would be to check whether there's free space available that'd benefit from using smaller chunks and do so if available, and otherwise use all the space in a page for each chunk. That'd obviously make sliced access harder, so it surely isn't a panacea. Greetings, Andres Freund
Hi, On 2019-11-06 11:49:10 -0500, Robert Haas wrote: > On Wed, Nov 6, 2019 at 11:25 AM Andres Freund <andres@anarazel.de> wrote: > > I think it's more than just that. It's also that I think presenting a > > hardcoded value to the outside of / above an AM is architecturally > > wrong. If anything this is an implementation detail of the AM, that the > > AM ought to be concerned with internally, not something it should > > present to the outside. > > I mean, it depends on your vision of how things ought to be > abstracted. If you want the TOAST stuff to be logically "below" the > table AM layer, then this is an abstraction violation. But if you > think of TOAST as being a parallel system to table AM, then it's fine. > It also depends on your goals. If you want to give the table AM > maximum freedom to do what it likes, the design I proposed is not very > good. If you want to make it easy for someone to plug in a new AM that > does toasting like the current heap but with a different chunk size, > that design lets you do so with a very minimal amount of code. I'd like an AM to have the *option* of implementing something better, or at least go in the direction of making that possible. It seems perfectly possible to have a helper function implementing the current logic that you just can call with the fixed chunk size as an additional parameter. Which'd basically mean there's no meaningful difference in complexity compared to providing the chunk size as an external AM property. In one case you have a callback that just calls a helper function with one parameter, in the other you fill in a member of the struct. Greetings, Andres Freund
From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the following two reasons:
1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped or truncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.
2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file being synced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() to update the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's why the "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failed too. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please check ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.
Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point #2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelled and then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUEST for that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details on this.
So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked. But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting the server logfile.
Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for the test-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point #1 is the reason for the panic or point #2.
On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB
This issue is not frequently reproducible, we need to repeat the same testcase multiple times.I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.Hi Robert,
My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).--With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.
The Postgres Database Company
On Thu, Nov 7, 2019 at 10:57 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: > > > > On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> >> From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the followingtwo reasons: >> >> 1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped ortruncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the codeinside for (failures = 0; !entry->canceled; failures++) loop. >> >> 2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file beingsynced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() toupdate the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's whythe "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failedtoo. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please checkProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop. >> >> Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point#2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sendsa SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelledand then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a filehas failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry forthis file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUESTfor that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details onthis. >> >> So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked.But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting theserver logfile. >> >> Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for thetest-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point#1 is the reason for the panic or point #2. > > > I have ran the same testcases with and without patch multiple times with debug option (log_min_messages = DEBUG1), butthis time I am not able to reproduce the crash. Okay, no problem. Thanks for re-running the test-cases. @Robert, Myself and Prabhat have tried running the test-cases that caused the checkpointer process to crash earlier multiple times but we are not able to reproduce it both with and without the patch. However, from the stack trace shared earlier by Prabhat, it is clear that the checkpointer process panicked due to fsync failure. But, there is no further data to know the exact reason for the fsync failure. From the code of checkpointer process (basically the function to process fsync requests) it is understood that, the checkpointer process can PANIC due to one of the following two reasons. 1) The fsync call made by checkpointer process has failed with error other than ENOENT. 2) The fsync call made by checkpointer process failed with ENOENT error which caused the checkpointer process to invoke AbsorbSyncRequests() to update the entry for deleted file in the hash table (basically to mark the entry as cancelled). But, seems like it couldn't do so either because - a) possibly there was no SYNC_FORGET_REQUEST sent by the backend to the checkpointer process or b) the request was sent but due to some reason the checkpointer process couldn't absorb the request. This caused the checkpointer process to perform fsync on the same file once again which is bound to fail resulting into a panic. Now, if checkpointer process panicked due to reason #1 then I don't think it has anything to do with postgres because postgres only cares when fsync fails with ENOENT error. If in case checkpointer process panicked due reason #2 then possibly there is some bug in postgres code which I assume has to be some problem with the way backend is sending fsync request to the checkpointer for deleted files and the way checkpointer is handling the requests. At least for me, it is hard to believe that reason #2 could be the cause for the checkpointer process getting panicked here - for the reason that before a file is unlinked by backend, it first sends a SYNC_FORGET_REQUEST to the checkpointer process, when this is done successfully then only backend removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table. And even if reason #2 is the cause for this failure, I don't think it has anything to do with your changes, although I haven't studied your patches in detail but considering the purpose of the patch and from a quick look it doesn't seem to change anything in the area of the code that might be causing this crash. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com >> >> >> Thanks, >> >> -- >> With Regards, >> Ashutosh Sharma >> EnterpriseDB:http://www.enterprisedb.com >> >> On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: >>> >>> >>> >>> On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote: >>>> >>>> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote: >>>>> >>>>> While testing the Toast patch(PG+v7 patch) I found below server crash. >>>>> System configuration: >>>>> VCPUs: 4, RAM: 8GB, Storage: 320GB >>>>> >>>>> This issue is not frequently reproducible, we need to repeat the same testcase multiple times. >>>> >>>> >>>> I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing.Your report doesn't specify whether you can also reproduce the problem without the patch, which is something thatyou should always check before reporting a bug in a particular patch. >>> >>> >>> Hi Robert, >>> >>> My sincere apologize that I have not mentioned the issue in more detail. >>> I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and >>> as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure aboutthe backtrace whether its related to patch or not). >>> >>> >>>> >>>> -- >>>> Robert Haas >>>> EnterpriseDB: http://www.enterprisedb.com >>>> The Enterprise PostgreSQL Company >>> >>> >>> >>> -- >>> >>> With Regards, >>> >>> Prabhat Kumar Sahu >>> Skype ID: prabhat.sahu1984 >>> EnterpriseDB Software India Pvt. Ltd. >>> >>> The Postgres Database Company > > > > -- > > With Regards, > > Prabhat Kumar Sahu > Skype ID: prabhat.sahu1984 > EnterpriseDB Software India Pvt. Ltd. > > The Postgres Database Company
On 2019-11-06 18:00, Andres Freund wrote: > I'd like an AM to have the *option* of implementing something better, or > at least go in the direction of making that possible. I don't think the presented design prevents that. An AM can just return false from relation_needs_toast_table in all cases and implement something internally. > It seems perfectly possible to have a helper function implementing the > current logic that you just can call with the fixed chunk size as an > additional parameter. Which'd basically mean there's no meaningful > difference in complexity compared to providing the chunk size as an > external AM property. In one case you have a callback that just calls a > helper function with one parameter, in the other you fill in a member of > the struct. I can see a "moral" concern about having TOAST be part of the table AM API. It should be an implementation concern of the AM. How much more work would it be to refactor TOAST into a separate API that an AM implementation could use or not? How much more complicated would the result be? I guess you would like to at least have it explored. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 7, 2019 at 1:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > @Robert, Myself and Prabhat have tried running the test-cases that > caused the checkpointer process to crash earlier multiple times but we > are not able to reproduce it both with and without the patch. However, > from the stack trace shared earlier by Prabhat, it is clear that the > checkpointer process panicked due to fsync failure. But, there is no > further data to know the exact reason for the fsync failure. From the > code of checkpointer process (basically the function to process fsync > requests) it is understood that, the checkpointer process can PANIC > due to one of the following two reasons. Oh, I didn't realize this was a panic due to an fsync() failure when I looked at the stack trace before. I think it's concerning that fsync() failed on Prabhat's machine, and it would be interesting to know why that happened, but I don't see how this patch could possibly *cause* fsync() to fail, so I think we can say that whatever is happening on his machine is unrelated to this patch -- and probably also unrelated to PostgreSQL. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On Thu, Nov 7, 2019 at 7:35 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Nov 7, 2019 at 1:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote: > > @Robert, Myself and Prabhat have tried running the test-cases that > > caused the checkpointer process to crash earlier multiple times but we > > are not able to reproduce it both with and without the patch. However, > > from the stack trace shared earlier by Prabhat, it is clear that the > > checkpointer process panicked due to fsync failure. But, there is no > > further data to know the exact reason for the fsync failure. From the > > code of checkpointer process (basically the function to process fsync > > requests) it is understood that, the checkpointer process can PANIC > > due to one of the following two reasons. > > Oh, I didn't realize this was a panic due to an fsync() failure when I > looked at the stack trace before. I think it's concerning that > fsync() failed on Prabhat's machine, and it would be interesting to > know why that happened, but I don't see how this patch could possibly > *cause* fsync() to fail, so I think we can say that whatever is > happening on his machine is unrelated to this patch -- and probably > also unrelated to PostgreSQL. > That's right and that's exactly what I mentioned in my conclusion too. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On Wed, Nov 6, 2019 at 12:00 PM Andres Freund <andres@anarazel.de> wrote: > I'd like an AM to have the *option* of implementing something better, or > at least go in the direction of making that possible. OK. Could you see what you think of the attached patches? 0001 does some refactoring of toast_fetch_datum() and toast_fetch_datum_slice() to make them look more like each other and clean up a bunch of stuff that I thought was annoying, and 0002 then pulls out the common logic into a heap-specific function. If you like this direction, we could then push the heap-specific function below tableam, but I haven't done that yet. > It seems perfectly possible to have a helper function implementing the > current logic that you just can call with the fixed chunk size as an > additional parameter. Which'd basically mean there's no meaningful > difference in complexity compared to providing the chunk size as an > external AM property. In one case you have a callback that just calls a > helper function with one parameter, in the other you fill in a member of > the struct. I haven't tried to do this yet. I think that to make it work, the helper function would have to operate in terms of slots instead of using fastgetattr() as this logic does now. I don't know whether that would be faster (because the current code might have a little less in terms of indirect function calls) or slower (because the current code makes two calls to fastgetattr and if we used slots here we could just deform once). I suspect it might be a small enough difference not to worry too much about it either way. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Thu, Nov 7, 2019 at 7:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 1:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > @Robert, Myself and Prabhat have tried running the test-cases that
> > caused the checkpointer process to crash earlier multiple times but we
> > are not able to reproduce it both with and without the patch. However,
> > from the stack trace shared earlier by Prabhat, it is clear that the
> > checkpointer process panicked due to fsync failure. But, there is no
> > further data to know the exact reason for the fsync failure. From the
> > code of checkpointer process (basically the function to process fsync
> > requests) it is understood that, the checkpointer process can PANIC
> > due to one of the following two reasons.
>
> Oh, I didn't realize this was a panic due to an fsync() failure when I
> looked at the stack trace before. I think it's concerning that
> fsync() failed on Prabhat's machine, and it would be interesting to
> know why that happened, but I don't see how this patch could possibly
> *cause* fsync() to fail, so I think we can say that whatever is
> happening on his machine is unrelated to this patch -- and probably
> also unrelated to PostgreSQL.
>
That's right and that's exactly what I mentioned in my conclusion too.
Hi Craig, Please find my response inline below. On Sun, Nov 10, 2019 at 2:39 PM Craig Ringer <craig@2ndquadrant.com> wrote: > > On Thu, 7 Nov 2019 at 22:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote: >> > > In fact, I suspect this is PostgreSQL successfully protecting itself from an unsafe situation. > > Does the host have thin-provisioned storage? lvmthin, thin-provisioned SAN, etc? > No, It doesn't. Infact the machine on which the issue was reproduced once/twice doesn't have any LVMs. The other machine on which the issue never got reproduced have some LVMs but they are thick-provisioned not thin-provisioned. > Is the DB on NFS? > No. -- With Regards, Ashutosh Sharma EnterpriseDB:http://www.enterprisedb.com
On 2019-11-08 17:59, Robert Haas wrote: > On Wed, Nov 6, 2019 at 12:00 PM Andres Freund <andres@anarazel.de> wrote: >> I'd like an AM to have the *option* of implementing something better, or >> at least go in the direction of making that possible. > > OK. Could you see what you think of the attached patches? 0001 does > some refactoring of toast_fetch_datum() and toast_fetch_datum_slice() > to make them look more like each other and clean up a bunch of stuff > that I thought was annoying, and 0002 then pulls out the common logic > into a heap-specific function. If you like this direction, we could > then push the heap-specific function below tableam, but I haven't done > that yet. Compared to the previous patch (v7) where the API just had a "use this AM for TOAST" field and the other extreme of pushing TOAST entirely inside the heap AM, this seems like the worst of both worlds, with the maximum additional complexity. I don't think we need to nail down this API for eternity, so I'd be happy to err on the side of practicality here. However, it seems it's not quite clear what for example the requirements and wishes from zheap would be. What's the simplest way to move this forward? The refactorings you proposed seem reasonable on their own, and I have some additional comments on that if we decide to go forward in this direction. One thing that's confusing is that the TOAST tables have fields chunk_id and chunk_seq, but when an error message talks about "chunk %d" or "chunk number %d", they usually mean the "seq" and not the "id". -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Mon, Nov 11, 2019 at 8:51 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Compared to the previous patch (v7) where the API just had a "use this > AM for TOAST" field and the other extreme of pushing TOAST entirely > inside the heap AM, this seems like the worst of both worlds, with the > maximum additional complexity. There might be a misunderstanding here. These patches would still have a "use this AM for TOAST" callback, just as the previous set did, but I didn't include that here, because this is talking about a different part of the problem. The purpose of that callback is to determine which AM will be used to create the toast table. The purpose of these patches is to be able to detoast a value given nothing but the TOAST pointer extracted from the heap tuple, while removing the present assumption that the TOAST table is a heap table. (The current coding is actually seriously inconsistent, because right now, the code that creates TOAST tables always uses the same AM as the main heap; but the detoasting code only works with heap tables, which means that no non-heap AM can use the TOAST system at all. If necessary, we could commit the patch to allow the TOAST table AM to be changed first, and then handle allowing the detoasting logic to cope with a non-heap AM as a separate matter.) > I don't think we need to nail down this API for eternity, so I'd be > happy to err on the side of practicality here. However, it seems it's > not quite clear what for example the requirements and wishes from zheap > would be. What's the simplest way to move this forward? The only thing zheap needs - in the current design, anyway - is the ability to change the chunk size. However, I think that's mostly because we haven't spent a lot of time thinking about how to do TOAST better than the heap does TOAST today. I think it is desirable to allow for more options than that. That's why I like this approach more than the previous one. The previous approach allowed the chunk size to be variable, but permitted no other AM-specific variation; this one allows the AM to detoast in any way that it likes. The downside of that is that if you really do only want to vary the chunk size, you'll have to repeat somewhat more code. That's sad, but we're never likely to have enough AMs for that to be a really serious problem, and if we do, the AM-specific callbacks for those AMs that just want a different chunk size could call a common helper function. > The refactorings you proposed seem reasonable on their own, and I have > some additional comments on that if we decide to go forward in this > direction. One thing that's confusing is that the TOAST tables have > fields chunk_id and chunk_seq, but when an error message talks about > "chunk %d" or "chunk number %d", they usually mean the "seq" and not the > "id". Well, we've got errors like this right now: unexpected chunk number %d (expected %d) for toast value %u in %s So at least in this case, and I think in many cases, we're referring to the chunk_id as "toast value %u" and the chunk_seq as "chunk number %d". I think that's pretty good terminology. It's unfortunate that the TOAST table columns are called chunk_id and chunk_seq rather than, say, value_id and chunk_number, and I guess we could possibly change that without breaking too many things, but I'm not sure that changing the error messages would help anybody. We could try to rephrase the error message to mention the two value in the opposite order, which to me would be more clear, but I'm not exactly sure how to do that without writing rather awkward English. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-11-08 17:59, Robert Haas wrote: > OK. Could you see what you think of the attached patches? 0001 does > some refactoring of toast_fetch_datum() and toast_fetch_datum_slice() > to make them look more like each other and clean up a bunch of stuff > that I thought was annoying, and 0002 then pulls out the common logic > into a heap-specific function. If you like this direction, we could > then push the heap-specific function below tableam, but I haven't done > that yet. Partial review: The 0001 patch seems very sensible. Some minor comments on that: Perhaps rename the residx variable (in both functions). You have gotten rid of all the res* variables except that one. That name as it is right now isn't very helpful at all. You have collapsed the error messages for "chunk %d of %d" and "final chunk %d" and replaced it with just "chunk %d". I think it might be better to keep the "chunk %d of %d" wording, for more context, or was there a reason why you wanted to remove the total count from the message? I believe this assertion + Assert(endchunk <= totalchunks); should be < (strictly less). In the commit message you state that this assertion replaces a run-time check, but I couldn't quite make out which one you are referring to because all the existing run-time checks are kept, with slightly refactored conditions. -- Peter Eisentraut http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Thu, Nov 21, 2019 at 5:37 AM Peter Eisentraut <peter.eisentraut@2ndquadrant.com> wrote: > Partial review: The 0001 patch seems very sensible. Some minor comments > on that: Thanks for the review. Updated patches attached. This version is more complete than the last set of patches I posted. It looks like this: 0001 - Lets a table AM that needs a toast table choose the AM that will be used to implement the toast table. 0002 - Refactoring and code cleanup for TOAST code. 0003 - Move heap-specific portion of logic refactored by previous patch to a separate function. 0004 - Lets a table AM arrange to call a different function when detoasting, instead of the one created by 0003. > Perhaps rename the residx variable (in both functions). You have gotten > rid of all the res* variables except that one. That name as it is right > now isn't very helpful at all. OK, I renamed residx to curchunk and nextidx to expectedchunk. > You have collapsed the error messages for "chunk %d of %d" and "final > chunk %d" and replaced it with just "chunk %d". I think it might be > better to keep the "chunk %d of %d" wording, for more context, or was > there a reason why you wanted to remove the total count from the message? No, not really. Adjusted. > I believe this assertion > > + Assert(endchunk <= totalchunks); > > should be < (strictly less). I think you're right. Fixed. > In the commit message you state that this assertion replaces a run-time > check, but I couldn't quite make out which one you are referring to > because all the existing run-time checks are kept, with slightly > refactored conditions. Pre-patch, there is this condition: - if ((residx != nextidx) || (residx > endchunk) || (residx < startchunk) This checks that the expected chunk number is equal to the one we want, but not greater than the last one we were expecting nor less than the first one we were expecting. The check is redundant if you suppose that we never compute nextidx so that it is outside the bounds of startchunk..endchunk. Since nextidx (or expectedchunk as these patches now call it) is initialized to startchunk and then only incremented, it seems impossible for the first condition to ever fail, so it is no longer tested there. The latter chunk is also no longer tested there; instead, we do this: + Assert(endchunk < totalchunks); That's what the commit message is on about. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Fri, Nov 22, 2019 at 10:41 AM Robert Haas <robertmhaas@gmail.com> wrote: > Thanks for the review. Updated patches attached. This version is more > complete than the last set of patches I posted. It looks like this: > > 0001 - Lets a table AM that needs a toast table choose the AM that > will be used to implement the toast table. > 0002 - Refactoring and code cleanup for TOAST code. > 0003 - Move heap-specific portion of logic refactored by previous > patch to a separate function. > 0004 - Lets a table AM arrange to call a different function when > detoasting, instead of the one created by 0003. Hearing no further comments, I went ahead and pushed 0002 today. That turned out to have a bug, so I pushed a fix for that. Hopefully the buildfarm will agree that it's fixed. Meanwhile, here are the remaining patches again, rebased over the bug fix. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Tue, Dec 17, 2019 at 4:12 PM Robert Haas <robertmhaas@gmail.com> wrote: > Hearing no further comments, I went ahead and pushed 0002 today. That > turned out to have a bug, so I pushed a fix for that. Hopefully the > buildfarm will agree that it's fixed. > > Meanwhile, here are the remaining patches again, rebased over the bug fix. OK, I've now pushed the last of the refactoring patches. Here are the two main patches back, which are actually quite small, though the second one looks bigger than it is because it moves a function from detoast.c into heaptoast.c. This is slightly rebased again because the other refactoring patch I just pushed had a couple of typos which I fixed. If nobody has further comments or objections, I plan to commit these in early January. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Attachment
On Wed, Dec 18, 2019 at 11:37 AM Robert Haas <robertmhaas@gmail.com> wrote: > If nobody has further comments or objections, I plan to commit these > in early January. Done. Which, I think, wraps up the work I felt needed to be done here. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company