Thread: tableam vs. TOAST

tableam vs. TOAST

From

Robert Haas

Date:

18 May 2019, 00:21:26

In a nearby thread[1], Ashwin Agrawal complained that there is no way
for a table AM to get rid the TOAST table that the core system thinks
should be created.  To that I added a litany of complaints of my own,
including...

- the core system decides whether or not a TOAST table is needed based
on criteria that are very much heap-specific,
- the code for reading and writing values stored in a TOAST table is
heap-specific, and
- the core system assumes that you want to use the same table AM for
the main table and the toast table, but you might not (e.g. you might
want to use the regular old heap for the latter).

Attached as a series of patches which try to improve things in this
area.  Except possibly for 0001, this is v13 material; see discussion
on the other thread.  These likely need some additional work, but I've
done enough with them that I thought it would be worth publishing them
at this stage, because it seems that I'm not the only one thinking
about the problems that exist in this general area.  Here is an
overview:

0001 moves the needs_toast_table() calculation below the table AM
layer.  That allows a table AM to decide for itself whether it wants a
TOAST table.  The most obvious way in which a table AM might want to
be different from what core expects is to decide that the answer is
always "no," which it can do if it has some other method of storing
large values or doesn't wish to support them.  Another possibility is
that it wants logic that is basically similar to the heap, but with a
different size threshold because its tuple format is different. There
are probably other possibilities.

0002 breaks tuptoaster.c into three separate files.  It just does code
movement; no functional changes.  The three pieces are detoast.c,
which handles detoasting of toast values and inspection of the sizes
of toasted datums; heaptoast.c, which keeps all the functions that are
intrinsically heap-specific; and toast_internals.c, which is intended
to have a very limited audience.  A nice fringe benefit of this stuff
is that a lot of other files that current have to include tuptoaster.h
and thus htup_details.h no longer do.

0003 creates a new file toast_helper.c which is intended to help table
AMs implement insertion and deletion of toast table rows.  Most of the
AM-independent logic from the functions remaining in heaptoast.c is
moved to this file.  This leaves about ~600 of the original ~2400
lines from tuptoaster.c as heap-specific logic, but a new heap AM
actually wouldn't need all of that stuff, because some of the logic
here is in support of stuff like record types, which use HeapTuple
internally and will continue to do so even if those record types are
stored in some other kind of table.

0004 allows TOAST tables to be implemented using a table AM other than
heap.  In a certain sense this is the opposite of 0003.  0003 is
intended to help people who are implementing a new kind of main table,
whereas 0004 is intended to help people implementing a new kind of
TOAST table.  It teaches the code that inserts, deletes, and retrieves
TOAST row to use slots, and it makes some efficiency improvements in
the hopes of offsetting any performance loss from so doing.  See
commit message and/or patch for full details.

I believe that with all of these changes it should be pretty
straightforward for a table AM that wants to use itself to store TOAST
data to do so, or to delegate that task back to say the regular heap.
I haven't really validated that yet, but plan to do so.

In addition to what's in this patch set, I believe that we should
probably rename some of these functions and macros, so that the
heap-specific ones have heap-specific names and the generic ones
don't, but I haven't gone through all of that yet.  The existing
patches try to choose good names for the new things they add, but they
don't rename any of the existing stuff.  I also think we should
consider removing TOAST_MAX_CHUNK_SIZE from the control file, both
because I'm not sure anybody's really using the ability to vary that
for anything and because that solution doesn't seem entirely sensible
in a world of multiple AMs.  However, that is a debatable change, so
maybe others will disagree.

[1] http://postgr.es/m/CALfoeitE+P8UGii8=BsGQLpHch2EZWJhq4M+D-jfaj8YCa_FSw@mail.gmail.com

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Robert Haas

Date:

21 May 2019, 21:10:53

Updated and rebased patches attached.

On Fri, May 17, 2019 at 5:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> 0001 moves the needs_toast_table() calculation below the table AM
> layer.  That allows a table AM to decide for itself whether it wants a
> TOAST table.  The most obvious way in which a table AM might want to
> be different from what core expects is to decide that the answer is
> always "no," which it can do if it has some other method of storing
> large values or doesn't wish to support them.  Another possibility is
> that it wants logic that is basically similar to the heap, but with a
> different size threshold because its tuple format is different. There
> are probably other possibilities.

This was committed as 1171d7d58545f26a402f76a05936d572bf29d53b per
discussion on another thread.

> 0002 breaks tuptoaster.c into three separate files.  It just does code
> movement; no functional changes.  The three pieces are detoast.c,
> which handles detoasting of toast values and inspection of the sizes
> of toasted datums; heaptoast.c, which keeps all the functions that are
> intrinsically heap-specific; and toast_internals.c, which is intended
> to have a very limited audience.  A nice fringe benefit of this stuff
> is that a lot of other files that current have to include tuptoaster.h
> and thus htup_details.h no longer do.

Now 0001.  No changes.

> 0003 creates a new file toast_helper.c which is intended to help table
> AMs implement insertion and deletion of toast table rows.  Most of the
> AM-independent logic from the functions remaining in heaptoast.c is
> moved to this file.  This leaves about ~600 of the original ~2400
> lines from tuptoaster.c as heap-specific logic, but a new heap AM
> actually wouldn't need all of that stuff, because some of the logic
> here is in support of stuff like record types, which use HeapTuple
> internally and will continue to do so even if those record types are
> stored in some other kind of table.

Now 0002. No changes.

> 0004 allows TOAST tables to be implemented using a table AM other than
> heap.  In a certain sense this is the opposite of 0003.  0003 is
> intended to help people who are implementing a new kind of main table,
> whereas 0004 is intended to help people implementing a new kind of
> TOAST table.  It teaches the code that inserts, deletes, and retrieves
> TOAST row to use slots, and it makes some efficiency improvements in
> the hopes of offsetting any performance loss from so doing.  See
> commit message and/or patch for full details.

Now 0003.  Some brain fade repaired.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Robert Haas

Date:

11 June 2019, 19:17:01

On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Updated and rebased patches attached.

And again.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Prabhat Sahu

Date:

25 June 2019, 09:19:09

On Tue, Jun 11, 2019 at 9:47 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Updated and rebased patches attached.

And again.

Hi Robert,

I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and
combinations of compression and out-of-line storage options.
I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths.
Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid performance fluctuations,
and calculated the results as a median value of a few consecutive test executions.

Please find the SQL script attached herewith, which I have used to perform the observation.

Below are the test scenarios, how I have checked the behavior and performance of TOAST patches against PG master.
1. where a single column is compressed(SCC)
2. where multiple columns are compressed(MCC)
-- ALTER the table column/s for storage as "MAIN" to make sure that the column values are COMPRESSED.

3. where a single column is pushed to the TOAST table but not compressed(SCTNC)
4. where multiple columns are pushed to the TOAST table but not compressed(MCTNC)
-- ALTER the table column/s for storage as "EXTERNAL" to make sure that the column values are pushed to the TOAST table but not COMPRESSED.

5. where a single column is pushed to the TOAST table and also compressed(SCTC)
6. where multiple columns are pushed to the TOAST table and also compressed(MCTC)
-- ALTER the table column/s for storage as "EXTENDED" to make sure that the column values are pushed to the TOAST table and also COMPRESSED.

7. updating the tuples with similar data shouldn't affect the behavior of storage options.

Please find my observation as below:
System Used: (VCPUs: 8, RAM: 16GB, Size: 640GB)

	10000 Tuples		20000 Tuples		40000 Tuples		80000 Tuples
	Without Patch	With Patch	Without Patch	With Patch	Without Patch	With Patch	Without Patch	With Patch
1. SCC INSERT	125921.737 ms (02:05.922)	125992.563 ms (02:05.993)	234263.295 ms (03:54.263)	235952.336 ms (03:55.952)	497290.442 ms (08:17.290)	502820.139 ms (08:22.820)	948470.603 ms (15:48.471)	941778.952 ms (15:41.779)
1. SCC UPDATE	263017.814 ms (04:23.018)	270893.910 ms (04:30.894)	488393.748 ms (08:08.394)	507937.377 ms (08:27.937)	1078862.613 ms (17:58.863)	1053029.428 ms (17:33.029)	2037119.576 ms (33:57.120)	2023633.862 ms (33:43.634)
2. MCC INSERT	35415.089 ms (00:35.415)	35910.552 ms (00:35.911)	70899.737 ms (01:10.900)	70800.964 ms (01:10.801)	142185.996 ms (02:22.186)	142241.913 ms (02:22.242)
2. MCC UPDATE	72043.757 ms (01:12.044)	73848.732 ms (01:13.849)	137717.696 ms (02:17.718)	137577.606 ms (02:17.578)	276358.752 ms (04:36.359)	276520.727 ms (04:36.521)
3. SCTNC INSERT	26377.274 ms (00:26.377)	25600.189 ms (00:25.600)	45702.630 ms (00:45.703)	45163.510 ms (00:45.164)	99903.299 ms (01:39.903)	100013.004 ms (01:40.013)
3. SCTNC UPDATE	78385.225 ms (01:18.385)	76680.325 ms (01:16.680)	151823.250 ms (02:31.823)	153503.971 ms (02:33.504)	308197.734 ms (05:08.198)	308474.937 ms (05:08.475)
4. MCTNC INSERT	26214.069 ms (00:26.214)	25383.522 ms (00:25.384)	50826.522 ms (00:50.827)	50221.669 ms (00:50.222)	106034.338 ms (01:46.034)	106122.827 ms (01:46.123)
4. MCTNC UPDATE	78423.817 ms (01:18.424)	75154.593 ms (01:15.155)	158885.787 ms (02:38.886)	156530.964 ms (02:36.531)	319721.266 ms (05:19.721)	322385.709 ms (05:22.386)
5. SCTC INSERT	38451.022 ms (00:38.451)	38652.520 ms (00:38.653)	71590.748 ms (01:11.591)	71048.975 ms (01:11.049)	143327.913 ms (02:23.328)	142593.207 ms (02:22.593)
5. SCTC UPDATE	82069.311 ms (01:22.069)	81678.131 ms (01:21.678)	138763.508 ms (02:18.764)	138625.473 ms (02:18.625)	277534.080 ms (04:37.534)	277091.611 ms (04:37.092)
6. MCTC INSERT	36325.730 ms (00:36.326)	35803.368 ms (00:35.803)	73285.204 ms (01:13.285)	72728.371 ms (01:12.728)	142324.859 ms (02:22.325)	144368.335 ms (02:24.368)
6. MCTC UPDATE	73740.729 ms (01:13.741)	73002.511 ms (01:13.003)	141309.859 ms (02:21.310)	139676.173 ms (02:19.676)	278906.647 ms (04:38.907)	279522.408 ms (04:39.522)

All the observation looks good to me,

except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC INSERT with tuple count(40K) there was a slightly increse in time taken

incase of "with patch" result. For a better observation, I also have ran the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.

I also have performed the below test with TOAST table objects.
8. pg_dump/restore, pg_upgrade with these
9. Streaming Replication setup
10. Concurrent Transactions

While testing few concurrent transactions I have below query:
-- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET STORAGE .. MAIN/EXTERNAL/EXTENDED/ etc)

-- Session 1:
CREATE TABLE a (a_id text PRIMARY KEY);
CREATE TABLE b (b_id text);
INSERT INTO a VALUES ('a'), ('b');
INSERT INTO b VALUES ('a'), ('b'), ('b');

BEGIN;
ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id); -- Not Acquiring any lock

-- Session 2:
SELECT * FROM b WHERE b_id = 'a'; -- Shows result

-- Session 1:
ALTER TABLE b ALTER COLUMN b_id SET STORAGE EXTERNAL; -- Acquire a lock

-- Session 2:
SELECT * FROM b WHERE b_id = 'a'; -- Hang/Waiting for lock in session 1

Is this an expected behavior?

With Regards,

Prabhat Kumar Sahu

Attachment

TOAST_table_test.sql

Re: tableam vs. TOAST

From

Thomas Munro

Date:

08 July 2019, 06:08:10

On Wed, Jun 12, 2019 at 4:17 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, May 21, 2019 at 2:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
> > Updated and rebased patches attached.
>
> And again.

Hi Robert,

Thus spake GCC:

detoast.c: In function ‘toast_fetch_datum’:
detoast.c:308:12: error: variable ‘toasttupDesc’ set but not used
[-Werror=unused-but-set-variable]
TupleDesc toasttupDesc;
^

--
Thomas Munro
https://enterprisedb.com

Re: tableam vs. TOAST

From

Robert Haas

Date:

08 July 2019, 18:36:09

On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
> I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and
> combinations of compression and out-of-line storage options.
> I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths.
> Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid
performancefluctuations,
 
> and calculated the results as a median value of a few consecutive test executions.

Thanks for testing.

> All the observation looks good to me,
> except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC INSERT with tuple count(40K)  there was a
slightlyincrese in time taken
 
> incase of "with patch" result. For a better observation, I also have ran the same "Test 1" for higher tuple
count(i.e.80K), and it also looks fine.
 

Did you run each test just once?  How stable are the results?

> While testing few concurrent transactions I have below query:
> -- Concurrent transactions acquire a lock for TOAST option(ALTER TABLE .. SET STORAGE .. MAIN/EXTERNAL/EXTENDED/
etc)
>
> -- Session 1:
> CREATE TABLE a (a_id text PRIMARY KEY);
> CREATE TABLE b (b_id text);
> INSERT INTO a VALUES ('a'), ('b');
> INSERT INTO b VALUES ('a'), ('b'), ('b');
>
> BEGIN;
> ALTER TABLE b ADD CONSTRAINT bfk FOREIGN KEY (b_id) REFERENCES a (a_id);     -- Not Acquiring any lock

For me, this acquires AccessShareLock and ShareRowExclusiveLock on the
target table.

rhaas=# select locktype, database, relation, pid, mode, granted from
pg_locks where relation = 'b'::regclass;
 locktype | database | relation |  pid  |         mode          | granted
----------+----------+----------+-------+-----------------------+---------
 relation |    16384 |    16872 | 93197 | AccessShareLock       | t
 relation |    16384 |    16872 | 93197 | ShareRowExclusiveLock | t
(2 rows)

I don't see what that has to do with the topic at hand, though.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Robert Haas

Date:

08 July 2019, 19:52:05

On Sun, Jul 7, 2019 at 11:08 PM Thomas Munro <thomas.munro@gmail.com> wrote:
> Thus spake GCC:
>
> detoast.c: In function ‘toast_fetch_datum’:
> detoast.c:308:12: error: variable ‘toasttupDesc’ set but not used
> [-Werror=unused-but-set-variable]
> TupleDesc toasttupDesc;
> ^

Hmm, fixed, I hope.

Here's an updated patch set.  In addition to the above fix, I fixed
things up for the new pgindent rules and added a fourth patch that
renames the detoasting functions to something that doesn't include
'heap.'

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Prabhat Sahu

Date:

09 July 2019, 07:40:11

On Mon, Jul 8, 2019 at 9:06 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Tue, Jun 25, 2019 at 2:19 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
> I have tested the TOAST patches(v3) with different storage options like(MAIN, EXTERNAL, EXTENDED, etc.), and
> combinations of compression and out-of-line storage options.
> I have used a few dummy tables with various tuple count say 10k, 20k, 40k, etc. with different column lengths.
> Used manual CHECKPOINT option with (checkpoint_timeout = 1d, max_wal_size = 10GB) before the test to avoid performance fluctuations,
> and calculated the results as a median value of a few consecutive test executions.

Thanks for testing.

> All the observation looks good to me,
> except for the "Test1" for SCC UPDATE with tuple count(10K/20K), for SCC INSERT with tuple count(40K) there was a slightly increse in time taken
> incase of "with patch" result. For a better observation, I also have ran the same "Test 1" for higher tuple count(i.e. 80K), and it also looks fine.

Did you run each test just once? How stable are the results?

No, I have executed the test multiple times(7times each) and calculated the result as the median among those,

and the result looks stable(with v3 patches).

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: tableam vs. TOAST

From

Robert Haas

Date:

01 August 2019, 19:23:42

On Tue, Jul 9, 2019 at 12:40 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
>> Did you run each test just once?  How stable are the results?
>
> No, I have executed the test multiple times(7times each) and calculated the result as the median among those,
> and the result looks stable(with v3 patches).

I spent some time looking at your SCC test today.  I think this isn't
really testing the code that actually got changed in the patch: a
quick CPU profile shows that your SCC test is bottlenecked on
pg_lzcompress, which spends a huge amount of time compressing the
gigantic string of 'a's you've constructed, and that code is exactly
the same with the patch as it in master. So, I think that any
fluctuations between the patched and unpatched results are just random
variation.  There's no reason the patch should be slower with one row
count and faster with a different row count, anyway.

I tried to come up with a better test case that uses a more modest
amount of data, and ended up with this:

-- Setup.
CREATE OR REPLACE FUNCTION randomish_string(integer) RETURNS text AS $$
SELECT string_agg(random()::text, '') FROM generate_series(1, $1);
$$ LANGUAGE sql;

CREATE TABLE source_compressed (a int, b text);
INSERT INTO source_compressed
SELECT g, repeat('a', 2000) FROM generate_series(1, 10000) g;
CREATE TABLE sink_compressed (LIKE source_compressed);

CREATE TABLE source_external (a int, b text);
INSERT INTO source_external
SELECT g, randomish_string(400) FROM generate_series(1, 10000) g;

CREATE TABLE sink_external (LIKE source_external);
CREATE TABLE source_external_uncompressed (a int, b text);
ALTER TABLE source_external_uncompressed ALTER COLUMN b SET STORAGE EXTERNAL;
INSERT INTO source_external_uncompressed
SELECT g, randomish_string(400) FROM generate_series(1, 10000) g;
CREATE TABLE sink_external_uncompressed (LIKE source_external_uncompressed);
ALTER TABLE sink_external_uncompressed ALTER COLUMN b SET STORAGE EXTERNAL;

-- Test.
\timing
TRUNCATE sink_compressed, sink_external, sink_external_uncompressed;
CHECKPOINT;
INSERT INTO sink_compressed SELECT * FROM source_compressed;
INSERT INTO sink_external SELECT * FROM source_external;
INSERT INTO sink_external_uncompressed SELECT * FROM
source_external_uncompressed;

Roughly, on both master and with the patches, the first one takes
about 4.2 seconds, the second 7.5, and the third 1.2.  The third one
is the fastest because it doesn't do any compression.  Since it does
less irrelevant work than the other two cases, it has the best chance
of showing up any performance regression that the patch has caused --
if any regression existed, I suppose that it would be an increased
per-toast-fetch or per-toast-chunk overhead. However, I can't
reproduce any such regression.  My first attempt at testing that case
showed the patch about 1% slower, but that wasn't reliably
reproducible when I did it a bunch more times.  So as far as I can
figure, this patch does not regress performance in any
easily-measurable way.

Barring objections, I plan to commit the whole series of patches here
(latest rebase attached).  They are not perfect and could likely be
improved in various ways, but I think they are an improvement over
what we have now, and it's not like it's set in stone once it's
committed.  We can change it more if we come up with a better idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Andres Freund

Date:

01 August 2019, 20:53:06

Hi,

On 2019-08-01 12:23:42 -0400, Robert Haas wrote:
> Barring objections, I plan to commit the whole series of patches here
> (latest rebase attached).  They are not perfect and could likely be
> improved in various ways, but I think they are an improvement over
> what we have now, and it's not like it's set in stone once it's
> committed.  We can change it more if we come up with a better idea.

Could you wait until I either had a chance to look again, or until, say,
Monday if I don't get around to it?  I'll try to get to it earlier than
that.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Robert Haas

Date:

01 August 2019, 20:55:22

On Thu, Aug 1, 2019 at 1:53 PM Andres Freund <andres@anarazel.de> wrote:
> Could you wait until I either had a chance to look again, or until, say,
> Monday if I don't get around to it?  I'll try to get to it earlier than
> that.

Sure, no problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Andres Freund

Date:

03 August 2019, 01:42:51

Hi,

On 2019-08-01 12:23:42 -0400, Robert Haas wrote:
> Roughly, on both master and with the patches, the first one takes
> about 4.2 seconds, the second 7.5, and the third 1.2.  The third one
> is the fastest because it doesn't do any compression.  Since it does
> less irrelevant work than the other two cases, it has the best chance
> of showing up any performance regression that the patch has caused --
> if any regression existed, I suppose that it would be an increased
> per-toast-fetch or per-toast-chunk overhead. However, I can't
> reproduce any such regression.  My first attempt at testing that case
> showed the patch about 1% slower, but that wasn't reliably
> reproducible when I did it a bunch more times.  So as far as I can
> figure, this patch does not regress performance in any
> easily-measurable way.

Hm, those all include writing, right? And for read-only we don't expect
any additional overhead, correct? The write overhead is probably too
large show a bit of function call overhead - but if so, it'd probably be
on unlogged tables? And with COPY, because that utilizes multi_insert,
which means more toasting in a shorter amount of time?


.oO(why does everyone attach attachements out of order? Is that
a gmail thing?)


> From a4c858c75793f0f8aff7914c572a6615ea5babf8 Mon Sep 17 00:00:00 2001
> From: Robert Haas <rhaas@postgresql.org>
> Date: Mon, 8 Jul 2019 11:58:05 -0400
> Subject: [PATCH 1/4] Split tuptoaster.c into three separate files.
>
> detoast.c/h contain functions required to detoast a datum, partially
> or completely, plus a few other utility functions for examining the
> size of toasted datums.
>
> toast_internals.c/h contain functions that are used internally to the
> TOAST subsystem but which (mostly) do not need to be accessed from
> outside.
>
> heaptoast.c/h contains code that is intrinsically specific to the
> heap AM, either because it operates on HeapTuples or is based on the
> layout of a heap page.
>
> detoast.c and toast_internals.c are placed in
> src/backend/access/common rather than src/backend/access/heap.  At
> present, both files still have dependencies on the heap, but that will
> be improved in a future commit.

I wonder if toasting.c should be moved too?

If anybody doesn't know git's --color-moved, it makes patches like this
a lot easier to review.

> index 00000000000..582af147ea1
> --- /dev/null
> +++ b/src/include/access/detoast.h
> @@ -0,0 +1,92 @@
> +/*-------------------------------------------------------------------------
> + *
> + * detoast.h
> + *    Access to compressed and external varlena values.
>
> Hm. Does it matter that that also includes stuff like expanded datums?
>
> + * Copyright (c) 2000-2019, PostgreSQL Global Development Group
> + *
> + * src/include/access/detoast.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef DETOAST_H
> +#define DETOAST_H

trailing whitespace after "#ifndef DETOAST_H ".


> commit 60d51e6510c66f79c51e43fe22730fe050d87854
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   2019-07-08 12:02:16 -0400
>
>     Create an API for inserting and deleting rows in TOAST tables.
>
>     This moves much of the non-heap-specific logic from toast_delete and
>     toast_insert_or_update into a helper functions accessible via a new
>     header, toast_helper.h.  Using the functions in this module, a table
>     AM can implement creation and deletion of TOAST table rows with
>     much less code duplication than was possible heretofore.  Some
>     table AMs won't want to use the TOAST logic at all, but for those
>     that do this will make that easier.
>
>     Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com

Hm. This leaves toast_insert_or_update() as a name. That makes it sound
like it's generic toast code, rather than heap specific?

It's definitely nice how a lot of repetitive code has been deduplicated.
Also makes it easier to see how algorithmically expensive
toast_insert_or_update() is :(.


>    /*
>     * Second we look for attributes of attstorage 'x' or 'e' that are still
>     * inline, and make them external.  But skip this if there's no toast
>     * table to push them to.
>     */
>    while (heap_compute_data_size(tupleDesc,
>                                  toast_values, toast_isnull) > maxDataLen &&
>           rel->rd_rel->reltoastrelid != InvalidOid)

Shouldn't this condition be the other way round?


>     if (for_compression)
>         skip_colflags |= TOASTCOL_INCOMPRESSIBLE;
>
>     for (i = 0; i < numAttrs; i++)
>     {
>         Form_pg_attribute att = TupleDescAttr(tupleDesc, i);
>
>         if ((ttc->ttc_attr[i].tai_colflags & skip_colflags) != 0)
>             continue;
>         if (VARATT_IS_EXTERNAL(DatumGetPointer(ttc->ttc_values[i])))
>             continue;            /* can't happen, toast_action would be 'p' */
>         if (for_compression &&
>             VARATT_IS_COMPRESSED(DatumGetPointer(ttc->ttc_values[i])))
>             continue;
>         if (check_main && att->attstorage != 'm')
>             continue;
>         if (!check_main && att->attstorage != 'x' && att->attstorage != 'e')
>             continue;
>
>         if (ttc->ttc_attr[i].tai_size > biggest_size)
>         {
>             biggest_attno = i;
>             biggest_size = ttc->ttc_attr[i].tai_size;
>         }
>     }

Couldn't most of these be handled via colflags, instead of having
numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
size check ought to boil down to a single mask test?


> extern void toast_tuple_init(ToastTupleContext *ttc);
> extern int    toast_tuple_find_biggest_attribute(ToastTupleContext *ttc,
>                                                bool for_compression,
>                                                bool check_main);
> extern void toast_tuple_try_compression(ToastTupleContext *ttc, int attribute);
> extern void toast_tuple_externalize(ToastTupleContext *ttc, int attribute,
>                                     int options, int max_chunk_size);
> extern void toast_tuple_cleanup(ToastTupleContext *ttc, bool cleanup_toastrel);

I wonder if a better prefix wouldn't be toasting_...


> +/*
> + * Information about one column of a tuple being toasted.
> + *
> + * NOTE: toast_action[i] can have these values:
> + *      ' '     default handling
> + *      'p'     already processed --- don't touch it
> + *      'x'     incompressible, but OK to move off
> + *
> + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with
> + * toast_action[i] different from 'p'.
> + */
> +typedef struct
> +{
> +    struct varlena *tai_oldexternal;
> +    int32       tai_size;
> +    uint8       tai_colflags;
> +} ToastAttrInfo;

I think that comment is outdated?

> +/*
> + * Flags indicating the overall state of a TOAST operation.
> + *
> + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need
> + * to be deleted.
> + *
> + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed.
> + *
> + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted.
> + *
> + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other
> + * words, the toaster did something.
> + */
> +#define TOAST_NEEDS_DELETE_OLD              0x0001
> +#define TOAST_NEEDS_FREE                    0x0002
> +#define TOAST_HAS_NULLS                     0x0004
> +#define TOAST_NEEDS_CHANGE                  0x0008

I'd make these enums. They're more often accessible in a debugger...



> commit 9e4bd383a00e6bb96088666e57673b343049345c
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   2019-08-01 10:37:02 -0400
>
>     Allow TOAST tables to be implemented using table AMs other than heap.
>
>     toast_fetch_datum, toast_save_datum, and toast_delete_datum are
>     adjusted to use tableam rather than heap-specific functions.  This
>     might have some performance impact, but this patch attempts to
>     mitigate that by restructuring things so that we don't open and close
>     the toast table and indexes multiple times per tuple.
>
>     tableam now exposes an integer value (not a callback) for the
>     maximum TOAST chunk size, and has a new callback allowing table
>     AMs to specify the AM that should be used to implement the TOAST
>     table. Previously, the toast AM was always the same as the table AM.
>
>     Patch by me, tested by Prabhat Sabu.
>
>     Discussion: http://postgr.es/m/CA+TgmoZv-=2iWM4jcw5ZhJeL18HF96+W1yJeYrnGMYdkFFnEpQ@mail.gmail.com

I'm quite unconvinced that making the chunk size specified by the AM is
a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
pg_control etc. It seems a bit dangerous to let AMs provide the size,
without being very clear that any change of the value will make data
inaccessible. It'd be different if the max were only used during
toasting.

I think the *size* checks should be weakened so we check:
1) After each chunk, whether the already assembled chunks exceed the
   expected size.
2) After all chunks have been collected, check that the size is exactly
   what we expect.

I don't think that removes a meaningful amount of error
checking. Missing tuples etc get detected by the chunk_ids not being
consecutive. The overall size is still verified.


The obvious problem with this is the slice fetching logic. For slices
with an offset of 0, it's obviously trivial to implement. For the higher
slice logic, I'd assume we'd have to fetch the first slice by estimating
where the start chunk is based on the current suggest chunk size, and
restarting the scan earlier/later if not accurate.  In most cases it'll
be accurate, so we'd not loose efficiency.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Robert Haas

Date:

05 August 2019, 22:36:59

On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote:
> Hm, those all include writing, right? And for read-only we don't expect
> any additional overhead, correct? The write overhead is probably too
> large show a bit of function call overhead - but if so, it'd probably be
> on unlogged tables? And with COPY, because that utilizes multi_insert,
> which means more toasting in a shorter amount of time?

Yes and yes. I guess we could test the unlogged case and with COPY,
but in any realistic case you're still looking for a tiny CPU overhead
in a sea of I/O costs.  Even if an extremely short COPY on an unlogged
table regresses slightly, we do not normally reject patches that
improve code quality on the grounds that they add function call
overhead in a few places.  Code like this hard to optimize and
maintain; as you remarked yourself, there are multiple opportunities
to do this stuff better that are hard to see in the current structure.

> .oO(why does everyone attach attachements out of order? Is that
> a gmail thing?)

Must be.

> I wonder if toasting.c should be moved too?

I mean, we could, but I don't really see a reason.  It'd just be
moving it from one fairly-generic place to another, and I'd rather
minimize churn.

> trailing whitespace after "#ifndef DETOAST_H ".

Will fix.

> Hm. This leaves toast_insert_or_update() as a name. That makes it sound
> like it's generic toast code, rather than heap specific?

I'll rename it to heap_toast_insert_or_update().  But I think I'll put
that in 0004 with the other renames.

> It's definitely nice how a lot of repetitive code has been deduplicated.
> Also makes it easier to see how algorithmically expensive
> toast_insert_or_update() is :(.

Yep.

> Shouldn't this condition be the other way round?

I had to fight pretty hard to stop myself from tinkering with the
algorithm -- this can clearly be done better, but I wanted to make it
match the existing structure as far as possible. It also only needs to
be tested once, not on every loop iteration, so if we're going to
start changing things, we should go further than just swapping the
order of the tests.  For now I prefer to draw a line in the sand and
change nothing.

> Couldn't most of these be handled via colflags, instead of having
> numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
> TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
> size check ought to boil down to a single mask test?

I'm not really seeing how more flags would significantly simplify this
logic, but I might be missing something.

> I wonder if a better prefix wouldn't be toasting_...

I'm open to other votes, but I think it's toast_tuple is more specific
than toasting_ and thus better.

> > +/*
> > + * Information about one column of a tuple being toasted.
> > + *
> > + * NOTE: toast_action[i] can have these values:
> > + *      ' '     default handling
> > + *      'p'     already processed --- don't touch it
> > + *      'x'     incompressible, but OK to move off
> > + *
> > + * NOTE: toast_attr[i].tai_size is only made valid for varlena attributes with
> > + * toast_action[i] different from 'p'.
> > + */
> > +typedef struct
> > +{
> > +    struct varlena *tai_oldexternal;
> > +    int32       tai_size;
> > +    uint8       tai_colflags;
> > +} ToastAttrInfo;
>
> I think that comment is outdated?

Oops.  Will fix.

> > +/*
> > + * Flags indicating the overall state of a TOAST operation.
> > + *
> > + * TOAST_NEEDS_DELETE_OLD indicates that one or more old TOAST datums need
> > + * to be deleted.
> > + *
> > + * TOAST_NEEDS_FREE indicates that one or more TOAST values need to be freed.
> > + *
> > + * TOAST_HAS_NULLS indicates that nulls were found in the tuple being toasted.
> > + *
> > + * TOAST_NEEDS_CHANGE indicates that a new tuple needs to built; in other
> > + * words, the toaster did something.
> > + */
> > +#define TOAST_NEEDS_DELETE_OLD              0x0001
> > +#define TOAST_NEEDS_FREE                    0x0002
> > +#define TOAST_HAS_NULLS                     0x0004
> > +#define TOAST_NEEDS_CHANGE                  0x0008
>
> I'd make these enums. They're more often accessible in a debugger...

Ugh, I hate that style.  Abusing enums to make flag bits makes my skin
crawl.  I always wondered what the appeal was -- I guess now I know.
Blech.

> I'm quite unconvinced that making the chunk size specified by the AM is
> a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
> pg_control etc. It seems a bit dangerous to let AMs provide the size,
> without being very clear that any change of the value will make data
> inaccessible. It'd be different if the max were only used during
> toasting.

I was actually thinking about proposing that we rip
TOAST_MAX_CHUNK_SIZE out of pg_control.  No real effort has been made
here to make this something that users could configure, and I don't
know of a good reason to configure it. It also seems pretty out of
place in a world where there are multiple AMs floating around -- why
should heap, and only heap, be checked there? Granted it does have
some pride of place, but still.

> I think the *size* checks should be weakened so we check:
> 1) After each chunk, whether the already assembled chunks exceed the
>    expected size.
> 2) After all chunks have been collected, check that the size is exactly
>    what we expect.
>
> I don't think that removes a meaningful amount of error
> checking. Missing tuples etc get detected by the chunk_ids not being
> consecutive. The overall size is still verified.
>
> The obvious problem with this is the slice fetching logic. For slices
> with an offset of 0, it's obviously trivial to implement. For the higher
> slice logic, I'd assume we'd have to fetch the first slice by estimating
> where the start chunk is based on the current suggest chunk size, and
> restarting the scan earlier/later if not accurate.  In most cases it'll
> be accurate, so we'd not loose efficiency.

I don't feel entirely convinced that there's any rush to do all of
this right now, and the more I change the harder it is to make sure
that I haven't broken anything.  How strongly do you feel about this
stuff?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Tom Lane

Date:

05 August 2019, 22:41:02

Robert Haas <robertmhaas@gmail.com> writes:
> I don't feel entirely convinced that there's any rush to do all of
> this right now, and the more I change the harder it is to make sure
> that I haven't broken anything.  How strongly do you feel about this
> stuff?

FWIW, I agree with your comment further up that this patch ought to
just refactor, not change any algorithms.  The latter can be done
in separate patches.

            regards, tom lane

Re: tableam vs. TOAST

From

Andres Freund

Date:

06 August 2019, 10:11:53

Hi,

On 2019-08-05 15:36:59 -0400, Robert Haas wrote:
> On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote:
> > Hm. This leaves toast_insert_or_update() as a name. That makes it sound
> > like it's generic toast code, rather than heap specific?
> 
> I'll rename it to heap_toast_insert_or_update().  But I think I'll put
> that in 0004 with the other renames.

Makes sense.


> > It's definitely nice how a lot of repetitive code has been deduplicated.
> > Also makes it easier to see how algorithmically expensive
> > toast_insert_or_update() is :(.
> 
> Yep.
> 
> > Shouldn't this condition be the other way round?
> 
> I had to fight pretty hard to stop myself from tinkering with the
> algorithm -- this can clearly be done better, but I wanted to make it
> match the existing structure as far as possible. It also only needs to
> be tested once, not on every loop iteration, so if we're going to
> start changing things, we should go further than just swapping the
> order of the tests.  For now I prefer to draw a line in the sand and
> change nothing.

Makes sense.


> > Couldn't most of these be handled via colflags, instead of having
> > numerous individual checks? I.e. if you had TOASTCOL_COMPRESSED,
> > TOASTCOL_IGNORED, TOASTCOL_MAIN, TOASTFLAG_EXTERNAL, etc, all but the
> > size check ought to boil down to a single mask test?
> 
> I'm not really seeing how more flags would significantly simplify this
> logic, but I might be missing something.

Well, right now you have a number of ifs for each attribute. If you
encoded all the parameters into flags, you could change that to a single
flag test - as far as I can tell, all the parameters could be
represented as that, if you moved MAIN etc into flags.  A single if for
flags (and then the size check) is cheaper than several separate checks.


> > I'm quite unconvinced that making the chunk size specified by the AM is
> > a good idea to do in isolation. We have TOAST_MAX_CHUNK_SIZE in
> > pg_control etc. It seems a bit dangerous to let AMs provide the size,
> > without being very clear that any change of the value will make data
> > inaccessible. It'd be different if the max were only used during
> > toasting.
> 
> I was actually thinking about proposing that we rip
> TOAST_MAX_CHUNK_SIZE out of pg_control.  No real effort has been made
> here to make this something that users could configure, and I don't
> know of a good reason to configure it. It also seems pretty out of
> place in a world where there are multiple AMs floating around -- why
> should heap, and only heap, be checked there? Granted it does have
> some pride of place, but still.

> > I think the *size* checks should be weakened so we check:
> > 1) After each chunk, whether the already assembled chunks exceed the
> >    expected size.
> > 2) After all chunks have been collected, check that the size is exactly
> >    what we expect.
> >
> > I don't think that removes a meaningful amount of error
> > checking. Missing tuples etc get detected by the chunk_ids not being
> > consecutive. The overall size is still verified.
> >
> > The obvious problem with this is the slice fetching logic. For slices
> > with an offset of 0, it's obviously trivial to implement. For the higher
> > slice logic, I'd assume we'd have to fetch the first slice by estimating
> > where the start chunk is based on the current suggest chunk size, and
> > restarting the scan earlier/later if not accurate.  In most cases it'll
> > be accurate, so we'd not loose efficiency.
> 
> I don't feel entirely convinced that there's any rush to do all of
> this right now, and the more I change the harder it is to make sure
> that I haven't broken anything.  How strongly do you feel about this
> stuff?

I think we either should leave the hardcoded limit in place, or make it
actually not fixed. Ripping-it-out-but-not-actually just seems like a
trap for the unwary, without much point.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Alvaro Herrera

Date:

05 September 2019, 17:52:25

On 2019-Aug-05, Robert Haas wrote:

> > Shouldn't this condition be the other way round?
> 
> I had to fight pretty hard to stop myself from tinkering with the
> algorithm -- this can clearly be done better, but I wanted to make it
> match the existing structure as far as possible. It also only needs to
> be tested once, not on every loop iteration, so if we're going to
> start changing things, we should go further than just swapping the
> order of the tests.  For now I prefer to draw a line in the sand and
> change nothing.

I agree, and can we move forward with this 0001?  The idea here is to
change no code (as also suggested by Tom elsewhere), and it's the
largest patch in this series by a mile.  I checked --color-moved=zebra
and I think the patch looks fine, and also it compiles fine.  I ran
src/tools/pginclude/headerscheck on it and found no complaints.

So here's a rebased version, where the DETOAST_H whitespace has been
removed.  No other changes from your original.  Will you please push
soon?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

0001-Split-tuptoaster.c-into-three-separate-files.patch

Re: tableam vs. TOAST

From

Robert Haas

Date:

05 September 2019, 20:42:40

On Thu, Sep 5, 2019 at 10:52 AM Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
> I agree, and can we move forward with this 0001?  The idea here is to
> change no code (as also suggested by Tom elsewhere), and it's the
> largest patch in this series by a mile.  I checked --color-moved=zebra
> and I think the patch looks fine, and also it compiles fine.  I ran
> src/tools/pginclude/headerscheck on it and found no complaints.
>
> So here's a rebased version, where the DETOAST_H whitespace has been
> removed.  No other changes from your original.  Will you please push
> soon?

Done, thanks. Here's the rest again with the additional rename added
to 0003 (formerly 0004). I think it's probably OK to go ahead with
that stuff, too, but I'll wait a bit to see if anyone wants to raise
more objections.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Andres Freund

Date:

05 September 2019, 22:10:15

Hi,

On 2019-09-05 13:42:40 -0400, Robert Haas wrote:
> Done, thanks. Here's the rest again with the additional rename added
> to 0003 (formerly 0004). I think it's probably OK to go ahead with
> that stuff, too, but I'll wait a bit to see if anyone wants to raise
> more objections.

Well, I still dislike making the toast chunk size configurable in a
halfhearted manner.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Robert Haas

Date:

05 September 2019, 22:27:28

On Thu, Sep 5, 2019 at 3:10 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-09-05 13:42:40 -0400, Robert Haas wrote:
> > Done, thanks. Here's the rest again with the additional rename added
> > to 0003 (formerly 0004). I think it's probably OK to go ahead with
> > that stuff, too, but I'll wait a bit to see if anyone wants to raise
> > more objections.
>
> Well, I still dislike making the toast chunk size configurable in a
> halfhearted manner.

So, I'd be willing to look into that some more.  But how about if I
commit the next patch in the series first?  I think this comment is
really about the second patch in the series, "Allow TOAST tables to be
implemented using table AMs other than heap," and it's fair to point
out that, since that patch extends table AM, we're somewhat committed
to it once we put it in.  But "Create an API for inserting and
deleting rows in TOAST tables." is just refactoring, and I don't see
what we gain from waiting to commit that part.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Tom Lane

Date:

05 September 2019, 22:36:22

Andres Freund <andres@anarazel.de> writes:
> Well, I still dislike making the toast chunk size configurable in a
> halfhearted manner.

It's hard to make it fully configurable without breaking our on-disk
storage, because of the lack of any explicit representation of the chunk
size in TOAST data.  You have to "just know" how big the chunks are
supposed to be.

However, it's reasonable to ask why we should treat it as an AM property,
especially a fixed AM property as this has it.  If somebody does
reimplement toast logic in some other AM, they might well decide it's
worth the storage cost to be more flexible about the chunk size ... but
too bad, this design won't let them do it.

I don't entirely understand why relation_toast_am is a callback
while toast_max_chunk_size isn't, either.  Why would they not both
be callbacks?  That would at least let an AM set a per-relation
max chunk size, if it wanted.

It seems like this design throws away most of the benefit of a fixed
chunk size (mostly, being able to do relevant modulo arithmetic with
shifts and masks rather than full-fledged integer division) without
getting much of anything in return.

            regards, tom lane

Re: tableam vs. TOAST

From

Robert Haas

Date:

05 September 2019, 22:51:29

On Thu, Sep 5, 2019 at 3:36 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > Well, I still dislike making the toast chunk size configurable in a
> > halfhearted manner.
>
> It's hard to make it fully configurable without breaking our on-disk
> storage, because of the lack of any explicit representation of the chunk
> size in TOAST data.  You have to "just know" how big the chunks are
> supposed to be.

There was a concrete proposal about this from Andres here, down at the
bottom of the email:

http://postgr.es/m/20190802224251.lsxw4o5ebn2ng5ey@alap3.anarazel.de

Basically, detoasting would tolerate whatever chunk size it finds, and
the slice-fetching logic would get complicated.

> However, it's reasonable to ask why we should treat it as an AM property,
> especially a fixed AM property as this has it.  If somebody does
> reimplement toast logic in some other AM, they might well decide it's
> worth the storage cost to be more flexible about the chunk size ... but
> too bad, this design won't let them do it.

Fair complaint.  The main reason I want to treat it as an AM property
is that TOAST_TUPLE_THRESHOLD is defined in terms of heap-specific
constants, and having other AMs include heap-specific header files
seems like a thing we should try hard to avoid. Once you're indirectly
including htup_details.h in every AM in existence, it's going to be
hard to be sure that you've got no other dependencies on the current
heap AM. But I agree that making it not a fixed value could be useful.
One benefit of it would be that you could just change the value, even
for the current heap, without breaking access to already-toasted data.

> It seems like this design throws away most of the benefit of a fixed
> chunk size (mostly, being able to do relevant modulo arithmetic with
> shifts and masks rather than full-fledged integer division) without
> getting much of anything in return.

I don't think you're really getting that particular benefit, because
TOAST_TUPLE_THRESHOLD and TOAST_TUPLE_TARGET are not going to end up
as powers of two.  But you do get the benefit of working with
constants instead of a value determined at runtime.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Andres Freund

Date:

05 September 2019, 23:07:20

On 2019-09-05 15:27:28 -0400, Robert Haas wrote:
> On Thu, Sep 5, 2019 at 3:10 PM Andres Freund <andres@anarazel.de> wrote:
> > On 2019-09-05 13:42:40 -0400, Robert Haas wrote:
> > > Done, thanks. Here's the rest again with the additional rename added
> > > to 0003 (formerly 0004). I think it's probably OK to go ahead with
> > > that stuff, too, but I'll wait a bit to see if anyone wants to raise
> > > more objections.
> >
> > Well, I still dislike making the toast chunk size configurable in a
> > halfhearted manner.
> 
> So, I'd be willing to look into that some more.  But how about if I
> commit the next patch in the series first?  I think this comment is
> really about the second patch in the series, "Allow TOAST tables to be
> implemented using table AMs other than heap," and it's fair to point
> out that, since that patch extends table AM, we're somewhat committed
> to it once we put it in.  But "Create an API for inserting and
> deleting rows in TOAST tables." is just refactoring, and I don't see
> what we gain from waiting to commit that part.

Yea, makes sense to me.

Re: tableam vs. TOAST

From

Robert Haas

Date:

06 September 2019, 17:59:54

On Thu, Sep 5, 2019 at 4:07 PM Andres Freund <andres@anarazel.de> wrote:
> Yea, makes sense to me.

OK, done.  Here's the remaining patches again, with a slight update to
the renaming patch (now 0002).  In the last version, I renamed
toast_insert_or_update to heap_toast_insert_or_update but did not
rename toast_delete to heap_toast_delete.  Actually, I'm not seeing
any particular reason not to go ahead and push the renaming patch at
this point also.  I guess there's a question as to whether I should
more aggressively add "heap" to the names of the other functions in
heaptoast.h, but I'm kinda "meh" about that.  It seems likely that
other AMs will need their own versions of toast_insert_or_update() and
toast_delete(), but they shouldn't really need their own version of,
say, toast_flatten_tuple_to_datum(), and the point there is that we're
building a DatumTuple, so calling it
heap_toast_flatten_tuple_to_datum() seems almost misleading.  I'm
inclined to leave all that stuff alone for now.

0001 needs more thought, as discussed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Robert Haas

Date:

12 September 2019, 16:03:43

On Fri, Aug 2, 2019 at 6:42 PM Andres Freund <andres@anarazel.de> wrote:
> The obvious problem with this is the slice fetching logic. For slices
> with an offset of 0, it's obviously trivial to implement. For the higher
> slice logic, I'd assume we'd have to fetch the first slice by estimating
> where the start chunk is based on the current suggest chunk size, and
> restarting the scan earlier/later if not accurate.  In most cases it'll
> be accurate, so we'd not loose efficiency.

Dilip and I were discussing this proposal this morning, and after
talking to him, I don't quite see how to make this work without
significant loss of efficiency.  Suppose that, based on the estimated
chunk size, you decide that there are probably 10 chunks and that the
value that you need is probably located in chunk 6. So you fetch chunk
6.  Happily, chunk 6 is the size that you expect, so you extract the
bytes that you need and go on your way.

But ... what if there are actually 6 chunks, not 10, and the first
five are bigger than you expected, and the reason why the size of
chunk 6 matched your expectation because it was the last chunk and
thus smaller than the rest? Now you've just returned the wrong answer.

AFAICS, there's no way to detect that except to always read at least
two chunks, which seems like a pretty heavy price to pay. It doesn't
cost if you were going to need to read at least 2 chunks anyway, but
if you were only going to need to read 1, having to fetch 2 stinks.

Actually, when I initially read your proposal, I thought you were
proposing to relax things such that the chunks didn't even have to all
be the same size. That seems like it would be something cool that
could potentially be leveraged not only by AMs but perhaps also by
data types that want to break up their data strategically to cater to
future access patterns. But then there's really no way to make slicing
work without always reading from the beginning.

There would be no big problem here - in either scenario - if each
chunk contained the byte-offset of that chunk relative to the start of
the datum. You could guess wildly and always know whether or not you
had got it right without reading any extra data.  But that's not the
case.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Robert Haas

Date:

04 October 2019, 21:32:51

On Fri, Sep 6, 2019 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 5, 2019 at 4:07 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea, makes sense to me.
>
> OK, done.  Here's the remaining patches again, with a slight update to
> the renaming patch (now 0002).  In the last version, I renamed
> toast_insert_or_update to heap_toast_insert_or_update but did not
> rename toast_delete to heap_toast_delete.  Actually, I'm not seeing
> any particular reason not to go ahead and push the renaming patch at
> this point also.

And, hearing no objections, done.

Here's the last patch back, rebased over that renaming. Although I
think that Andres (and Tom) are probably right that there's room for
improvement here, I currently don't see a way around the issues I
wrote about in http://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com
-- so not quite sure where to go next. Hopefully Andres or someone
else will give me a quick whack with the cluebat if I'm missing
something obvious.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

v7-0001-Allow-TOAST-tables-to-be-implemented-using-table-.patch

Re: tableam vs. TOAST

From

Prabhat Sahu

Date:

30 October 2019, 10:49:06

Hi All,

While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same testcase multiple times.

CREATE OR REPLACE FUNCTION toast_chunks_cnt_func(p1 IN text)
RETURNS int AS $$
DECLARE
chunks_cnt int;
v_tbl text;
BEGIN
SELECT reltoastrelid::regclass INTO v_tbl FROM pg_class WHERE RELNAME = p1;
EXECUTE 'SELECT count(*) FROM ' || v_tbl::regclass INTO chunks_cnt;
RETURN chunks_cnt;
END; $$ LANGUAGE PLPGSQL;

-- Server crash after multiple run of below testcase
-- ------------------------------------------------------------------------
CHECKPOINT;
CREATE TABLE toast_tab (c1 text);
\d+ toast_tab
-- ALTER table column c1 for storage as "EXTERNAL" to make sure that the column value is pushed to the TOAST table but not COMPRESSED.
ALTER TABLE toast_tab ALTER COLUMN c1 SET STORAGE EXTERNAL;
\d+ toast_tab
\timing
INSERT INTO toast_tab
( select repeat('a', 200000)
from generate_series(1,40000) x);
\timing
SELECT reltoastrelid::regclass FROM pg_class WHERE RELNAME = 'toast_tab';
SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks";
SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1;
SELECT DISTINCT SUBSTR(c1, 90000,10) FROM toast_tab;

CHECKPOINT;
\timing
UPDATE toast_tab SET c1 = UPPER(c1);
\timing
SELECT toast_chunks_cnt_func('toast_tab') "Number of chunks";
SELECT pg_column_size(t1.*) FROM toast_tab t1 limit 1;
SELECT DISTINCT SUBSTR(c1, 90000,10) FROM toast_tab;

DROP TABLE toast_tab;
-- ------------------------------------------------------------------------

-- Stacktrace as below:
[centos@host-192-168-1-249 bin]$ gdb -q -c data2/core.3151 postgres
Reading symbols from /home/centos/PG/PGsrc/postgresql/inst/bin/postgres...done.
[New LWP 3151]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: checkpointer '.
Program terminated with signal 6, Aborted.
#0 0x00007f2267d33207 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7_6.5.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.15.1-37.el7_6.x86_64 libcom_err-1.42.9-13.el7.x86_64 libselinux-2.5-14.1.el7.x86_64 openssl-libs-1.0.2k-16.el7_6.1.x86_64 pcre-8.32-17.el7.x86_64 zlib-1.2.7-18.el7.x86_64
(gdb) bt
#0 0x00007f2267d33207 in raise () from /lib64/libc.so.6
#1 0x00007f2267d348f8 in abort () from /lib64/libc.so.6
#2 0x0000000000eb3a80 in errfinish (dummy=0) at elog.c:552
#3 0x0000000000c26530 in ProcessSyncRequests () at sync.c:393
#4 0x0000000000bbbc57 in CheckPointBuffers (flags=256) at bufmgr.c:2589
#5 0x0000000000604634 in CheckPointGuts (checkPointRedo=51448358328, flags=256) at xlog.c:8992
#6 0x0000000000603b5e in CreateCheckPoint (flags=256) at xlog.c:8781
#7 0x0000000000aed8fa in CheckpointerMain () at checkpointer.c:481
#8 0x00000000006240de in AuxiliaryProcessMain (argc=2, argv=0x7ffe887c0880) at bootstrap.c:461
#9 0x0000000000b0e834 in StartChildProcess (type=CheckpointerProcess) at postmaster.c:5414
#10 0x0000000000b09283 in reaper (postgres_signal_arg=17) at postmaster.c:2995
#11 <signal handler called>
#12 0x00007f2267df1f53 in __select_nocancel () from /lib64/libc.so.6
#13 0x0000000000b05000 in ServerLoop () at postmaster.c:1682
#14 0x0000000000b0457b in PostmasterMain (argc=5, argv=0x349bce0) at postmaster.c:1391
#15 0x0000000000971c9f in main (argc=5, argv=0x349bce0) at main.c:210
(gdb)

On Sat, Oct 5, 2019 at 12:03 AM Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Sep 6, 2019 at 10:59 AM Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Sep 5, 2019 at 4:07 PM Andres Freund <andres@anarazel.de> wrote:
> > Yea, makes sense to me.
>
> OK, done. Here's the remaining patches again, with a slight update to
> the renaming patch (now 0002). In the last version, I renamed
> toast_insert_or_update to heap_toast_insert_or_update but did not
> rename toast_delete to heap_toast_delete. Actually, I'm not seeing
> any particular reason not to go ahead and push the renaming patch at
> this point also.

And, hearing no objections, done.

Here's the last patch back, rebased over that renaming. Although I
think that Andres (and Tom) are probably right that there's room for
improvement here, I currently don't see a way around the issues I
wrote about in http://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com
-- so not quite sure where to go next. Hopefully Andres or someone
else will give me a quick whack with the cluebat if I'm missing
something obvious.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: tableam vs. TOAST

From

Robert Haas

Date:

30 October 2019, 19:16:03

On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:

While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same testcase multiple times.

I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.

Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Prabhat Sahu

Date:

31 October 2019, 07:56:27

On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same testcase multiple times.

I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.

Hi Robert,

My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: tableam vs. TOAST

From

Ashutosh Sharma

Date:

05 November 2019, 14:18:12

From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the following two reasons:

1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped or truncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.

2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file being synced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() to update the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's why the "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failed too. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please check ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.

Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point #2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelled and then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUEST for that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details on this.

So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked. But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting the server logfile.

Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for the test-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point #1 is the reason for the panic or point #2.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:

On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same testcase multiple times.

I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.

Hi Robert,

My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: tableam vs. TOAST

From

Peter Eisentraut

Date:

06 November 2019, 12:01:40

On 2019-10-04 20:32, Robert Haas wrote:
> Here's the last patch back, rebased over that renaming. Although I
> think that Andres (and Tom) are probably right that there's room for
> improvement here, I currently don't see a way around the issues I
> wrote about inhttp://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com
> -- so not quite sure where to go next. Hopefully Andres or someone
> else will give me a quick whack with the cluebat if I'm missing
> something obvious.

This patch seems sound as far as the API restructuring goes.

If I may summarize the remaining discussion:  This patch adds a field 
toast_max_chunk_size to TableAmRoutine, to take the place of the 
hardcoded TOAST_MAX_CHUNK_SIZE.  The heapam_methods implementation then 
sets this to TOAST_MAX_CHUNK_SIZE, thus preserving existing behavior. 
Other table AMs can set this to some other value that they find 
suitable.  Currently, TOAST_MAX_CHUNK_SIZE is computed based on 
heap-specific values and assumptions, so it's likely that other AMs 
won't want to use that value.  (Side note: Maybe rename 
TOAST_MAX_CHUNK_SIZE then.)  The concern was raised that while 
TOAST_MAX_CHUNK_SIZE is stored in pg_control, values chosen by other 
table AMs won't be, and so they won't have any safe-guards against 
starting a server with incompatible disk layout.  Then, various ways to 
detect or check the TOAST chunk size at run time were discussed, but 
none seemed satisfactory.

I think AMs are probably going to need a general mechanism to store 
pg_control-like data somewhere.  There are going to be chunk sizes, 
block sizes, segment sizes, and so on.  This one is just a particular 
case of that.

This particular patch doesn't need to be held up by that, though. 
Providing that mechanism can be a separate subproject of pluggable storage.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: tableam vs. TOAST

From

Robert Haas

Date:

06 November 2019, 18:38:58

On Wed, Nov 6, 2019 at 4:01 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> This patch seems sound as far as the API restructuring goes.

Thanks. And thanks for weighing in.

> If I may summarize the remaining discussion:  This patch adds a field
> toast_max_chunk_size to TableAmRoutine, to take the place of the
> hardcoded TOAST_MAX_CHUNK_SIZE.  The heapam_methods implementation then
> sets this to TOAST_MAX_CHUNK_SIZE, thus preserving existing behavior.
> Other table AMs can set this to some other value that they find
> suitable.  Currently, TOAST_MAX_CHUNK_SIZE is computed based on
> heap-specific values and assumptions, so it's likely that other AMs
> won't want to use that value.  (Side note: Maybe rename
> TOAST_MAX_CHUNK_SIZE then.)

Yeah.

> The concern was raised that while
> TOAST_MAX_CHUNK_SIZE is stored in pg_control, values chosen by other
> table AMs won't be, and so they won't have any safe-guards against
> starting a server with incompatible disk layout.  Then, various ways to
> detect or check the TOAST chunk size at run time were discussed, but
> none seemed satisfactory.

Yeah. I've been nervous about trying to proceed with this patch
because Andres seemed confident there was a better approach than what
I did here, but as I wrote about back on September 12th, it doesn't
seem like his idea will work. I'm not clear whether I'm being stupid
and there's a way to salvage his idea, or whether he just made a
mistake.

One possible approach would be to move more of the logic below the
tableam layer. For example, toast_fetch_datum() could do this:

toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock);
call_a_new_tableam_method_here(toast_rel, &toast_pointer);
table_close(toastrel, AccessShareLock);

...and then it becomes the tableam's job to handle everything that
needs to be done in the middle. That might be better than what I've
got now; it's certainly more flexible. It does mean that an AM that
just wants to reuse the existing logic with a different chunk size has
got to repeat some code, but it's probably <~150 lines, so that's
perhaps not a catastrophe.

Alternatively, we could (a) stick with the current approach, (b) use
the current approach but make the table AM member a callback rather
than a constant, or (c) something else entirely. I don't want to give
up on making the TOAST infrastructure pluggable; requiring every AM to
use the heap as its TOAST implementation seems too constraining.

> I think AMs are probably going to need a general mechanism to store
> pg_control-like data somewhere.  There are going to be chunk sizes,
> block sizes, segment sizes, and so on.  This one is just a particular
> case of that.

That's an interesting point. I don't know for sure to what extent we
need that; I think that the toast chunk size is actually not very
interesting to vary, and the fact that we technically allow it to be
varied seems like it isn't buying us much. I think as much as possible
we should allow settings that actually need to be varied to differ
table-by-table, not require a recompile or re-initdb. But if we are
going to have some that do require that, then what you're talking
about here would certainly make that easier to secure.

> This particular patch doesn't need to be held up by that, though.
> Providing that mechanism can be a separate subproject of pluggable storage.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Andres Freund

Date:

06 November 2019, 19:25:41

Hi,

On 2019-11-06 10:01:40 +0100, Peter Eisentraut wrote:
> On 2019-10-04 20:32, Robert Haas wrote:
> > Here's the last patch back, rebased over that renaming. Although I
> > think that Andres (and Tom) are probably right that there's room for
> > improvement here, I currently don't see a way around the issues I
> > wrote about inhttp://postgr.es/m/CA+Tgmoa0zFcaCpOJCsSpOLLGpzTVfSyvcVB-USS8YoKzMO51Yw@mail.gmail.com
> > -- so not quite sure where to go next. Hopefully Andres or someone
> > else will give me a quick whack with the cluebat if I'm missing
> > something obvious.
> 
> This patch seems sound as far as the API restructuring goes.
> 
> If I may summarize the remaining discussion:  This patch adds a field
> toast_max_chunk_size to TableAmRoutine, to take the place of the hardcoded
> TOAST_MAX_CHUNK_SIZE.  The heapam_methods implementation then sets this to
> TOAST_MAX_CHUNK_SIZE, thus preserving existing behavior. Other table AMs can
> set this to some other value that they find suitable.  Currently,
> TOAST_MAX_CHUNK_SIZE is computed based on heap-specific values and
> assumptions, so it's likely that other AMs won't want to use that value.
> (Side note: Maybe rename TOAST_MAX_CHUNK_SIZE then.)  The concern was raised
> that while TOAST_MAX_CHUNK_SIZE is stored in pg_control, values chosen by
> other table AMs won't be, and so they won't have any safe-guards against
> starting a server with incompatible disk layout.  Then, various ways to
> detect or check the TOAST chunk size at run time were discussed, but none
> seemed satisfactory.

I think it's more than just that. It's also that I think presenting a
hardcoded value to the outside of / above an AM is architecturally
wrong. If anything this is an implementation detail of the AM, that the
AM ought to be concerned with internally, not something it should
present to the outside.

I also, and separately from that architectural concern, think that
hardcoding values like this in the control file is a bad practice, and
we shouldn't expand it. It basically makes it practically impossible to
ever change their default value.


> I think AMs are probably going to need a general mechanism to store
> pg_control-like data somewhere.  There are going to be chunk sizes, block
> sizes, segment sizes, and so on.  This one is just a particular case of
> that.

That's imo best done as a meta page within the table.


> This particular patch doesn't need to be held up by that, though. Providing
> that mechanism can be a separate subproject of pluggable storage.

Again seems like something that the AM ought to handle below it.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Robert Haas

Date:

06 November 2019, 19:49:10

On Wed, Nov 6, 2019 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
> I think it's more than just that. It's also that I think presenting a
> hardcoded value to the outside of / above an AM is architecturally
> wrong. If anything this is an implementation detail of the AM, that the
> AM ought to be concerned with internally, not something it should
> present to the outside.

I mean, it depends on your vision of how things ought to be
abstracted. If you want the TOAST stuff to be logically "below" the
table AM layer, then this is an abstraction violation. But if you
think of TOAST as being a parallel system to table AM, then it's fine.
It also depends on your goals. If you want to give the table AM
maximum freedom to do what it likes, the design I proposed is not very
good. If you want to make it easy for someone to plug in a new AM that
does toasting like the current heap but with a different chunk size,
that design lets you do so with a very minimal amount of code.

I don't really care very much about the details here, but I don't want
to just keep kicking the can down the road. If we can agree on *some*
design that lets a new table AM have a TOAST table that uses an AM
other than the heap, and that I can understand and implement with some
halfway-reasonable amount of work, I'll do it. It doesn't have to be
the thing I proposed. But I think it would be better to do that thing
than nothing. We're not engraving anything we do here on stone
tablets.

> I also, and separately from that architectural concern, think that
> hardcoding values like this in the control file is a bad practice, and
> we shouldn't expand it. It basically makes it practically impossible to
> ever change their default value.

I generally agree, although I think there might be exceptions.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Andres Freund

Date:

06 November 2019, 19:55:46

Hi,

On 2019-11-06 10:38:58 -0500, Robert Haas wrote:
> Yeah. I've been nervous about trying to proceed with this patch
> because Andres seemed confident there was a better approach than what
> I did here, but as I wrote about back on September 12th, it doesn't
> seem like his idea will work. I'm not clear whether I'm being stupid
> and there's a way to salvage his idea, or whether he just made a
> mistake.

(still trying to get back to this)

> One possible approach would be to move more of the logic below the
> tableam layer. For example, toast_fetch_datum() could do this:
> 
> toastrel = table_open(toast_pointer.va_toastrelid, AccessShareLock);
> call_a_new_tableam_method_here(toast_rel, &toast_pointer);
> table_close(toastrel, AccessShareLock);

> ...and then it becomes the tableam's job to handle everything that
> needs to be done in the middle.

I think that's a good direction to go in, for more than just the the
discussion we're having here about the fixed chunking size.

I'm fairly sure that plenty AMs e.g. wouldn't want to actually store
toasted datums in a separate relation. And this would at least go more
in the direction of making that possible. And I think the above ought to
not even increase the overhead compared to the patch, as the number of
indirect function calls ought to stay the same or even be lower. The
indirection would otherwise have to happen within toast_fetch_datum(),
whereas with the above, it ought to be possible to avoid needing to do
so processing toast chunks.

It seems, unfortunately, that atm an AM not wanting to store toast
datums in a separate file, would still need to create a toast relation,
just to get a distinct toast oid, to make sure that toasted datums from
the old and new relation are distinct. Which seems like it could be
important e.g. for table rewrites. There's probably some massaging
of table rewrites and cluster needed.

I suspect the new callback ought to allow sliced and non-sliced access,
perhaps by just allowing to specify slice offset / length to 0 and
INT32_MAX respectively (or maybe just -1, -1?). That'd also allow an AM
to make slicing possible in cases that's not possible for heap. And
there seems little point in having two callbacks.

> That might be better than what I've got now; it's certainly more
> flexible. It does mean that an AM that just wants to reuse the
> existing logic with a different chunk size has got to repeat some
> code, but it's probably <~150 lines, so that's perhaps not a
> catastrophe.

Also seems that the relevant code can be made reusable (opting in into
the current logic), in line with where you've been going with this code
already.

> Alternatively, we could (a) stick with the current approach, (b) use
> the current approach but make the table AM member a callback rather
> than a constant, or (c) something else entirely.

A callback returning the chunk size does not seem like an improvement to
me.

> I don't want to give up on making the TOAST infrastructure pluggable;
> requiring every AM to use the heap as its TOAST implementation seems
> too constraining.

+1

> > I think AMs are probably going to need a general mechanism to store
> > pg_control-like data somewhere.  There are going to be chunk sizes,
> > block sizes, segment sizes, and so on.  This one is just a particular
> > case of that.
> 
> That's an interesting point. I don't know for sure to what extent we
> need that; I think that the toast chunk size is actually not very
> interesting to vary, and the fact that we technically allow it to be
> varied seems like it isn't buying us much.

Whether I agree with that statement depends a bit on what you mean by
varying the chunk size. If you mean that there's not much need for a
value other than a adjusted computation of what's currently used, then I
don't agree:

We currently make toast a lot more expensive by quadrupling the number
of separate heap fetches.

And e.g. compressing chunks separately, to allow for sliced access even
when compressed, would also be hard to do with the current sizes.

Additionally, I *very* strongly suspect that, while it makes sense to
use chunks sizes where multiple chunks fit a page for a heavily updated
toast table full of small-ish values, that it makes no sense whatsoever
to do so when toasting a 10MB value that's going to be appended to the
toast relation, because there's no space available for reuse anyway. And
smaller chunking isn't going to help with space reuse either, because
the whole toast datum is going to be deleted together.

I think the right way for toast creation to behave really would be to
check whether there's free space available that'd benefit from using
smaller chunks and do so if available, and otherwise use all the space
in a page for each chunk.

That'd obviously make sliced access harder, so it surely isn't a
panacea.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Andres Freund

Date:

06 November 2019, 20:00:18

Hi,

On 2019-11-06 11:49:10 -0500, Robert Haas wrote:
> On Wed, Nov 6, 2019 at 11:25 AM Andres Freund <andres@anarazel.de> wrote:
> > I think it's more than just that. It's also that I think presenting a
> > hardcoded value to the outside of / above an AM is architecturally
> > wrong. If anything this is an implementation detail of the AM, that the
> > AM ought to be concerned with internally, not something it should
> > present to the outside.
> 
> I mean, it depends on your vision of how things ought to be
> abstracted. If you want the TOAST stuff to be logically "below" the
> table AM layer, then this is an abstraction violation. But if you
> think of TOAST as being a parallel system to table AM, then it's fine.
> It also depends on your goals. If you want to give the table AM
> maximum freedom to do what it likes, the design I proposed is not very
> good. If you want to make it easy for someone to plug in a new AM that
> does toasting like the current heap but with a different chunk size,
> that design lets you do so with a very minimal amount of code.

I'd like an AM to have the *option* of implementing something better, or
at least go in the direction of making that possible.

It seems perfectly possible to have a helper function implementing the
current logic that you just can call with the fixed chunk size as an
additional parameter. Which'd basically mean there's no meaningful
difference in complexity compared to providing the chunk size as an
external AM property. In one case you have a callback that just calls a
helper function with one parameter, in the other you fill in a member of
the struct.

Greetings,

Andres Freund

Re: tableam vs. TOAST

From

Prabhat Sahu

Date:

07 November 2019, 08:27:20

On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the following two reasons:

1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped or truncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.

2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file being synced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests() to update the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's why the "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that failed too. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please check ProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop.

Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me, point #2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first sends a SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as cancelled and then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a file has failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry for this file in the hash table because the backend only removes the file once it has successfully sent the SYNC_FORGET_REQUEST for that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details on this.

So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting panicked. But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting the server logfile.

Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for the test-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether point #1 is the reason for the panic or point #2.

I have ran the same testcases with and without patch multiple times with debug option (log_min_messages = DEBUG1), but this time I am not able to reproduce the crash.

Thanks,

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:

On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
While testing the Toast patch(PG+v7 patch) I found below server crash.
System configuration:
VCPUs: 4, RAM: 8GB, Storage: 320GB

This issue is not frequently reproducible, we need to repeat the same testcase multiple times.

I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is changing. Your report doesn't specify whether you can also reproduce the problem without the patch, which is something that you should always check before reporting a bug in a particular patch.

Hi Robert,

My sincere apologize that I have not mentioned the issue in more detail.
I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure about the backtrace whether its related to patch or not).

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

--
With Regards,
Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

With Regards,

Prabhat Kumar Sahu
Skype ID: prabhat.sahu1984
EnterpriseDB Software India Pvt. Ltd.

The Postgres Database Company

Re: tableam vs. TOAST

From

Ashutosh Sharma

Date:

07 November 2019, 09:15:35

On Thu, Nov 7, 2019 at 10:57 AM Prabhat Sahu
<prabhat.sahu@enterprisedb.com> wrote:
>
>
>
> On Tue, Nov 5, 2019 at 4:48 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>> From the stack trace shared by Prabhat, I understand that the checkpointer process panicked due to one of the
followingtwo reasons: 
>>
>> 1) The fsync() failed in the first attempt itself and the reason for the failure was not due to file being dropped
ortruncated i.e. fsync failed with the error other than ENOENT. Refer to ProcessSyncRequests() for details esp. the
codeinside for (failures = 0; !entry->canceled; failures++) loop. 
>>
>> 2) The first attempt to fsync() failed with ENOENT error because just before the fsync function was called, the file
beingsynced either got dropped or truncated. When this happened, the checkpointer process called AbsorbSyncRequests()
toupdate the entry for deleted file in the hash table but it seems like AbsorbSyncRequests() failed to do so and that's
whythe "entry->canceled" couldn't be set to true. Due to this, fsync() was performed on the same file twice and that
failedtoo. As checkpointer process doesn't expect the fsync on the same file to fail twice, it panicked. Again, please
checkProcessSyncRequests() for details esp. the code inside for (failures = 0; !entry->canceled; failures++) loop. 
>>
>> Now, the point of discussion here is, which one of the above two reasons could the cause for panic? According to me,
point#2 doesn't look like the possible reason for panic. The reason being just before a file is unlinked, backend first
sendsa SYNC_FORGET_REQUEST to the checkpointer process which marks the entry for this file in the hash table as
cancelledand then removes the file. So, with this understanding it is hard to believe that once the first fsync() for a
filehas failed with error ENOENT, a call to AbsorbSyncRequests() made immediately after that wouldn't update the entry
forthis file in the hash table because the backend only removes the file once it has successfully sent the
SYNC_FORGET_REQUESTfor that file to the checkpointer process. See mdunlinkfork()->register_forget_request() for details
onthis. 
>>
>> So, I think the first point that I mentioned above could be the probable reason for the checkpointer process getting
panicked.But, having said all that, it would be good to have some evidence for it which can be confirmed by inspecting
theserver logfile. 
>>
>> Prabhat, is it possible for you to re-run the test-case with log_min_messages set to DEBUG1 and save the logfile for
thetest-case that crashes. This would be helpful in knowing if the fsync was performed just once or twice i.e. whether
point#1 is the reason for the panic or point #2. 
>
>
> I have ran the same testcases with and without patch multiple times with debug option (log_min_messages = DEBUG1),
butthis time I am not able to reproduce the crash. 

Okay, no problem. Thanks for re-running the test-cases.

@Robert, Myself and Prabhat have tried running the test-cases that
caused the checkpointer process to crash earlier multiple times but we
are not able to reproduce it both with and without the patch. However,
from the stack trace shared earlier by Prabhat, it is clear that the
checkpointer process panicked due to fsync failure. But, there is no
further data to know the exact reason for the fsync failure. From the
code of checkpointer process (basically the function to process fsync
requests) it is understood that, the checkpointer process can PANIC
due to one of the following two reasons.

1) The fsync call made by checkpointer process has failed with error
other than ENOENT.

2) The fsync call made by checkpointer process failed with ENOENT
error which caused the checkpointer process to invoke
AbsorbSyncRequests() to update the entry for deleted file in the hash
table (basically to mark the entry as cancelled). But, seems like it
couldn't do so either because - a) possibly there was no
SYNC_FORGET_REQUEST sent by the backend to the checkpointer process or
b) the request was sent but due to some reason the checkpointer
process couldn't absorb the request. This caused the checkpointer
process to perform fsync on the same file once again which is bound to
fail resulting into a panic.

Now, if checkpointer process panicked due to reason #1 then I don't
think it has anything to do with postgres because postgres only cares
when fsync fails with ENOENT error. If in case checkpointer process
panicked due reason #2 then possibly there is some bug in postgres
code which I assume has to be some problem with the way backend is
sending fsync request to the checkpointer for deleted files and the
way checkpointer is handling the requests. At least for me, it is hard
to believe that reason #2 could be the cause for the checkpointer
process getting panicked here - for the reason that before a file is
unlinked by backend, it first sends a SYNC_FORGET_REQUEST to the
checkpointer process, when this is done successfully then only backend
removes the file. So, with this understanding it is hard to believe
that once the first fsync() for a file has failed with error ENOENT, a
call to AbsorbSyncRequests() made immediately after that wouldn't
update the entry for this file in the hash table. And even if reason
#2 is the cause for this failure, I don't think it has anything to do
with your changes, although I haven't studied your patches in detail
but considering the purpose of the patch and from a quick look it
doesn't seem to change anything in the area of the code that might be
causing this crash.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

>>
>>
>> Thanks,
>>
>> --
>> With Regards,
>> Ashutosh Sharma
>> EnterpriseDB:http://www.enterprisedb.com
>>
>> On Thu, Oct 31, 2019 at 10:26 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
>>>
>>>
>>>
>>> On Wed, Oct 30, 2019 at 9:46 PM Robert Haas <robertmhaas@gmail.com> wrote:
>>>>
>>>> On Wed, Oct 30, 2019 at 3:49 AM Prabhat Sahu <prabhat.sahu@enterprisedb.com> wrote:
>>>>>
>>>>> While testing the Toast patch(PG+v7 patch) I found below server crash.
>>>>> System configuration:
>>>>> VCPUs: 4, RAM: 8GB, Storage: 320GB
>>>>>
>>>>> This issue is not frequently reproducible, we need to repeat the same testcase multiple times.
>>>>
>>>>
>>>> I wonder if this is an independent bug, because the backtrace doesn't look like it's related to the stuff this is
changing.Your report doesn't specify whether you can also reproduce the problem without the patch, which is something
thatyou should always check before reporting a bug in a particular patch. 
>>>
>>>
>>> Hi Robert,
>>>
>>> My sincere apologize that I have not mentioned the issue in more detail.
>>> I have ran the same case against both PG HEAD and HEAD+Patch multiple times(7, 10, 20nos), and
>>> as I found this issue was not failing in HEAD and same case is reproducible in HEAD+Patch (again I was not sure
aboutthe backtrace whether its related to patch or not). 
>>>
>>>
>>>>
>>>> --
>>>> Robert Haas
>>>> EnterpriseDB: http://www.enterprisedb.com
>>>> The Enterprise PostgreSQL Company
>>>
>>>
>>>
>>> --
>>>
>>> With Regards,
>>>
>>> Prabhat Kumar Sahu
>>> Skype ID: prabhat.sahu1984
>>> EnterpriseDB Software India Pvt. Ltd.
>>>
>>> The Postgres Database Company
>
>
>
> --
>
> With Regards,
>
> Prabhat Kumar Sahu
> Skype ID: prabhat.sahu1984
> EnterpriseDB Software India Pvt. Ltd.
>
> The Postgres Database Company

Re: tableam vs. TOAST

From

Peter Eisentraut

Date:

07 November 2019, 13:10:19

On 2019-11-06 18:00, Andres Freund wrote:
> I'd like an AM to have the *option* of implementing something better, or
> at least go in the direction of making that possible.

I don't think the presented design prevents that.  An AM can just return 
false from relation_needs_toast_table in all cases and implement 
something internally.

> It seems perfectly possible to have a helper function implementing the
> current logic that you just can call with the fixed chunk size as an
> additional parameter. Which'd basically mean there's no meaningful
> difference in complexity compared to providing the chunk size as an
> external AM property. In one case you have a callback that just calls a
> helper function with one parameter, in the other you fill in a member of
> the struct.

I can see a "moral" concern about having TOAST be part of the table AM 
API.  It should be an implementation concern of the AM.  How much more 
work would it be to refactor TOAST into a separate API that an AM 
implementation could use or not?  How much more complicated would the 
result be?  I guess you would like to at least have it explored.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: tableam vs. TOAST

From

Robert Haas

Date:

07 November 2019, 17:05:13

On Thu, Nov 7, 2019 at 1:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> @Robert, Myself and Prabhat have tried running the test-cases that
> caused the checkpointer process to crash earlier multiple times but we
> are not able to reproduce it both with and without the patch. However,
> from the stack trace shared earlier by Prabhat, it is clear that the
> checkpointer process panicked due to fsync failure. But, there is no
> further data to know the exact reason for the fsync failure. From the
> code of checkpointer process (basically the function to process fsync
> requests) it is understood that, the checkpointer process can PANIC
> due to one of the following two reasons.

Oh, I didn't realize this was a panic due to an fsync() failure when I
looked at the stack trace before.  I think it's concerning that
fsync() failed on Prabhat's machine, and it would be interesting to
know why that happened, but I don't see how this patch could possibly
*cause* fsync() to fail, so I think we can say that whatever is
happening on his machine is unrelated to this patch -- and probably
also unrelated to PostgreSQL.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Ashutosh Sharma

Date:

07 November 2019, 17:45:37

On Thu, Nov 7, 2019 at 7:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 1:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > @Robert, Myself and Prabhat have tried running the test-cases that
> > caused the checkpointer process to crash earlier multiple times but we
> > are not able to reproduce it both with and without the patch. However,
> > from the stack trace shared earlier by Prabhat, it is clear that the
> > checkpointer process panicked due to fsync failure. But, there is no
> > further data to know the exact reason for the fsync failure. From the
> > code of checkpointer process (basically the function to process fsync
> > requests) it is understood that, the checkpointer process can PANIC
> > due to one of the following two reasons.
>
> Oh, I didn't realize this was a panic due to an fsync() failure when I
> looked at the stack trace before.  I think it's concerning that
> fsync() failed on Prabhat's machine, and it would be interesting to
> know why that happened, but I don't see how this patch could possibly
> *cause* fsync() to fail, so I think we can say that whatever is
> happening on his machine is unrelated to this patch -- and probably
> also unrelated to PostgreSQL.
>

That's right and that's exactly what I mentioned in my conclusion too.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: tableam vs. TOAST

From

Robert Haas

Date:

08 November 2019, 19:59:05

On Wed, Nov 6, 2019 at 12:00 PM Andres Freund <andres@anarazel.de> wrote:
> I'd like an AM to have the *option* of implementing something better, or
> at least go in the direction of making that possible.

OK. Could you see what you think of the attached patches? 0001 does
some refactoring of toast_fetch_datum() and toast_fetch_datum_slice()
to make them look more like each other and clean up a bunch of stuff
that I thought was annoying, and 0002 then pulls out the common logic
into a heap-specific function. If you like this direction, we could
then push the heap-specific function below tableam, but I haven't done
that yet.

> It seems perfectly possible to have a helper function implementing the
> current logic that you just can call with the fixed chunk size as an
> additional parameter. Which'd basically mean there's no meaningful
> difference in complexity compared to providing the chunk size as an
> external AM property. In one case you have a callback that just calls a
> helper function with one parameter, in the other you fill in a member of
> the struct.

I haven't tried to do this yet. I think that to make it work, the
helper function would have to operate in terms of slots instead of
using fastgetattr() as this logic does now. I don't know whether that
would be faster (because the current code might have a little less in
terms of indirect function calls) or slower (because the current code
makes two calls to fastgetattr and if we used slots here we could just
deform once). I suspect it might be a small enough difference not to
worry too much about it either way.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Craig Ringer

Date:

10 November 2019, 12:09:04

On Thu, 7 Nov 2019 at 22:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

On Thu, Nov 7, 2019 at 7:35 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Nov 7, 2019 at 1:15 AM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
> > @Robert, Myself and Prabhat have tried running the test-cases that
> > caused the checkpointer process to crash earlier multiple times but we
> > are not able to reproduce it both with and without the patch. However,
> > from the stack trace shared earlier by Prabhat, it is clear that the
> > checkpointer process panicked due to fsync failure. But, there is no
> > further data to know the exact reason for the fsync failure. From the
> > code of checkpointer process (basically the function to process fsync
> > requests) it is understood that, the checkpointer process can PANIC
> > due to one of the following two reasons.
>
> Oh, I didn't realize this was a panic due to an fsync() failure when I
> looked at the stack trace before. I think it's concerning that
> fsync() failed on Prabhat's machine, and it would be interesting to
> know why that happened, but I don't see how this patch could possibly
> *cause* fsync() to fail, so I think we can say that whatever is
> happening on his machine is unrelated to this patch -- and probably
> also unrelated to PostgreSQL.
>

That's right and that's exactly what I mentioned in my conclusion too.

In fact, I suspect this is PostgreSQL successfully protecting itself from an unsafe situation.

Does the host have thin-provisioned storage? lvmthin, thin-provisioned SAN, etc?

Is the DB on NFS?

Craig Ringer http://www.2ndQuadrant.com/
2ndQuadrant - PostgreSQL Solutions for the Enterprise

Re: tableam vs. TOAST

From

Ashutosh Sharma

Date:

11 November 2019, 10:30:00

Hi Craig,

Please find my response inline below.

On Sun, Nov 10, 2019 at 2:39 PM Craig Ringer <craig@2ndquadrant.com> wrote:
>
> On Thu, 7 Nov 2019 at 22:45, Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>>
>
> In fact, I suspect this is PostgreSQL successfully protecting itself from an unsafe situation.
>
> Does the host have thin-provisioned storage? lvmthin, thin-provisioned SAN, etc?
>

No, It doesn't. Infact the machine on which the issue was reproduced
once/twice doesn't have any LVMs. The other machine on which the issue
never got reproduced have some LVMs but they are thick-provisioned not
thin-provisioned.

> Is the DB on NFS?
>

No.

--
With Regards,
Ashutosh Sharma
EnterpriseDB:http://www.enterprisedb.com

Re: tableam vs. TOAST

From

Peter Eisentraut

Date:

11 November 2019, 16:51:01

On 2019-11-08 17:59, Robert Haas wrote:
> On Wed, Nov 6, 2019 at 12:00 PM Andres Freund <andres@anarazel.de> wrote:
>> I'd like an AM to have the *option* of implementing something better, or
>> at least go in the direction of making that possible.
> 
> OK. Could you see what you think of the attached patches? 0001 does
> some refactoring of toast_fetch_datum() and toast_fetch_datum_slice()
> to make them look more like each other and clean up a bunch of stuff
> that I thought was annoying, and 0002 then pulls out the common logic
> into a heap-specific function. If you like this direction, we could
> then push the heap-specific function below tableam, but I haven't done
> that yet.

Compared to the previous patch (v7) where the API just had a "use this 
AM for TOAST" field and the other extreme of pushing TOAST entirely 
inside the heap AM, this seems like the worst of both worlds, with the 
maximum additional complexity.

I don't think we need to nail down this API for eternity, so I'd be 
happy to err on the side of practicality here.  However, it seems it's 
not quite clear what for example the requirements and wishes from zheap 
would be.  What's the simplest way to move this forward?

The refactorings you proposed seem reasonable on their own, and I have 
some additional comments on that if we decide to go forward in this 
direction.  One thing that's confusing is that the TOAST tables have 
fields chunk_id and chunk_seq, but when an error message talks about 
"chunk %d" or "chunk number %d", they usually mean the "seq" and not the 
"id".

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: tableam vs. TOAST

From

Robert Haas

Date:

11 November 2019, 17:31:22

On Mon, Nov 11, 2019 at 8:51 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Compared to the previous patch (v7) where the API just had a "use this
> AM for TOAST" field and the other extreme of pushing TOAST entirely
> inside the heap AM, this seems like the worst of both worlds, with the
> maximum additional complexity.

There might be a misunderstanding here. These patches would still have
a "use this AM for TOAST" callback, just as the previous set did, but
I didn't include that here, because this is talking about a different
part of the problem. The purpose of that callback is to determine
which AM will be used to create the toast table. The purpose of these
patches is to be able to detoast a value given nothing but the TOAST
pointer extracted from the heap tuple, while removing the present
assumption that the TOAST table is a heap table.

(The current coding is actually seriously inconsistent, because right
now, the code that creates TOAST tables always uses the same AM as the
main heap; but the detoasting code only works with heap tables, which
means that no non-heap AM can use the TOAST system at all. If
necessary, we could commit the patch to allow the TOAST table AM to be
changed first, and then handle allowing the detoasting logic to cope
with a non-heap AM as a separate matter.)

> I don't think we need to nail down this API for eternity, so I'd be
> happy to err on the side of practicality here.  However, it seems it's
> not quite clear what for example the requirements and wishes from zheap
> would be.  What's the simplest way to move this forward?

The only thing zheap needs - in the current design, anyway - is the
ability to change the chunk size. However, I think that's mostly
because we haven't spent a lot of time thinking about how to do TOAST
better than the heap does TOAST today. I think it is desirable to
allow for more options than that. That's why I like this approach more
than the previous one. The previous approach allowed the chunk size to
be variable, but permitted no other AM-specific variation; this one
allows the AM to detoast in any way that it likes. The downside of
that is that if you really do only want to vary the chunk size, you'll
have to repeat somewhat more code. That's sad, but we're never likely
to have enough AMs for that to be a really serious problem, and if we
do, the AM-specific callbacks for those AMs that just want a different
chunk size could call a common helper function.

> The refactorings you proposed seem reasonable on their own, and I have
> some additional comments on that if we decide to go forward in this
> direction.  One thing that's confusing is that the TOAST tables have
> fields chunk_id and chunk_seq, but when an error message talks about
> "chunk %d" or "chunk number %d", they usually mean the "seq" and not the
> "id".

Well, we've got errors like this right now:

unexpected chunk number %d (expected %d) for toast value %u in %s

So at least in this case, and I think in many cases, we're referring
to the chunk_id as "toast value %u" and the chunk_seq as "chunk number
%d". I think that's pretty good terminology. It's unfortunate that the
TOAST table columns are called chunk_id and chunk_seq rather than,
say, value_id and chunk_number, and I guess we could possibly change
that without breaking too many things, but I'm not sure that changing
the error messages would help anybody. We could try to rephrase the
error message to mention the two value in the opposite order, which to
me would be more clear, but I'm not exactly sure how to do that
without writing rather awkward English.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: tableam vs. TOAST

From

Peter Eisentraut

Date:

21 November 2019, 13:37:01

On 2019-11-08 17:59, Robert Haas wrote:
> OK. Could you see what you think of the attached patches? 0001 does
> some refactoring of toast_fetch_datum() and toast_fetch_datum_slice()
> to make them look more like each other and clean up a bunch of stuff
> that I thought was annoying, and 0002 then pulls out the common logic
> into a heap-specific function. If you like this direction, we could
> then push the heap-specific function below tableam, but I haven't done
> that yet.

Partial review: The 0001 patch seems very sensible.  Some minor comments 
on that:

Perhaps rename the residx variable (in both functions).  You have gotten 
rid of all the res* variables except that one.  That name as it is right 
now isn't very helpful at all.

You have collapsed the error messages for "chunk %d of %d" and "final 
chunk %d" and replaced it with just "chunk %d".  I think it might be 
better to keep the "chunk %d of %d" wording, for more context, or was 
there a reason why you wanted to remove the total count from the message?

I believe this assertion

+   Assert(endchunk <= totalchunks);

should be < (strictly less).

In the commit message you state that this assertion replaces a run-time 
check, but I couldn't quite make out which one you are referring to 
because all the existing run-time checks are kept, with slightly 
refactored conditions.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: tableam vs. TOAST

From

Robert Haas

Date:

22 November 2019, 18:41:02

On Thu, Nov 21, 2019 at 5:37 AM Peter Eisentraut
<peter.eisentraut@2ndquadrant.com> wrote:
> Partial review: The 0001 patch seems very sensible.  Some minor comments
> on that:

Thanks for the review. Updated patches attached. This version is more
complete than the last set of patches I posted.  It looks like this:

0001 - Lets a table AM that needs a toast table choose the AM that
will be used to implement the toast table.
0002 - Refactoring and code cleanup for TOAST code.
0003 - Move heap-specific portion of logic refactored by previous
patch to a separate function.
0004 - Lets a table AM arrange to call a different function when
detoasting, instead of the one created by 0003.

> Perhaps rename the residx variable (in both functions).  You have gotten
> rid of all the res* variables except that one.  That name as it is right
> now isn't very helpful at all.

OK, I renamed residx to curchunk and nextidx to expectedchunk.

> You have collapsed the error messages for "chunk %d of %d" and "final
> chunk %d" and replaced it with just "chunk %d".  I think it might be
> better to keep the "chunk %d of %d" wording, for more context, or was
> there a reason why you wanted to remove the total count from the message?

No, not really. Adjusted.

> I believe this assertion
>
> +   Assert(endchunk <= totalchunks);
>
> should be < (strictly less).

I think you're right. Fixed.

> In the commit message you state that this assertion replaces a run-time
> check, but I couldn't quite make out which one you are referring to
> because all the existing run-time checks are kept, with slightly
> refactored conditions.

Pre-patch, there is this condition:

-        if ((residx != nextidx) || (residx > endchunk) || (residx < startchunk)

This checks that the expected chunk number is equal to the one we
want, but not greater than the last one we were expecting nor less
than the first one we were expecting. The check is redundant if you
suppose that we never compute nextidx so that it is outside the bounds
of startchunk..endchunk. Since nextidx (or expectedchunk as these
patches now call it) is initialized to startchunk and then only
incremented, it seems impossible for the first condition to ever fail,
so it is no longer tested there. The latter chunk is also no longer
tested there; instead, we do this:

+    Assert(endchunk < totalchunks);

That's what the commit message is on about.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Robert Haas

Date:

18 December 2019, 00:12:49

On Fri, Nov 22, 2019 at 10:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
> Thanks for the review. Updated patches attached. This version is more
> complete than the last set of patches I posted.  It looks like this:
>
> 0001 - Lets a table AM that needs a toast table choose the AM that
> will be used to implement the toast table.
> 0002 - Refactoring and code cleanup for TOAST code.
> 0003 - Move heap-specific portion of logic refactored by previous
> patch to a separate function.
> 0004 - Lets a table AM arrange to call a different function when
> detoasting, instead of the one created by 0003.

Hearing no further comments, I went ahead and pushed 0002 today. That
turned out to have a bug, so I pushed a fix for that. Hopefully the
buildfarm will agree that it's fixed.

Meanwhile, here are the remaining patches again, rebased over the bug fix.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Robert Haas

Date:

18 December 2019, 19:37:43

On Tue, Dec 17, 2019 at 4:12 PM Robert Haas <robertmhaas@gmail.com> wrote:
> Hearing no further comments, I went ahead and pushed 0002 today. That
> turned out to have a bug, so I pushed a fix for that. Hopefully the
> buildfarm will agree that it's fixed.
>
> Meanwhile, here are the remaining patches again, rebased over the bug fix.

OK, I've now pushed the last of the refactoring patches. Here are the
two main patches back, which are actually quite small, though the
second one looks bigger than it is because it moves a function from
detoast.c into heaptoast.c. This is slightly rebased again because the
other refactoring patch I just pushed had a couple of typos which I
fixed.

If nobody has further comments or objections, I plan to commit these
in early January.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Attachment

Re: tableam vs. TOAST

From

Robert Haas

Date:

07 January 2020, 22:38:49

On Wed, Dec 18, 2019 at 11:37 AM Robert Haas <robertmhaas@gmail.com> wrote:
> If nobody has further comments or objections, I plan to commit these
> in early January.

Done.

Which, I think, wraps up the work I felt needed to be done here.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company