Thread: Analyzing foreign tables & memory problems

Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

30 April 2012, 10:28:04

During ANALYZE, in analyze.c, functions compute_minimal_stats
and compute_scalar_stats, values whose length exceed
WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
other than that they are counted as "too wide rows" and assumed
to be all different.

This works fine with regular tables; values exceeding that threshold
don't get detoasted and won't consume excessive memory.

With foreign tables the situation is different.  Even though
values exceeding WIDTH_THRESHOLD won't get used, the complete
rows will be fetched from the foreign table.  This can easily
exhaust maintenance_work_mem.

A foreign data wrapper has no good way to counter the problem.
It can return truncated values in ist AcquireSampleRowsFunc,
but WIDTH_THRESHOLD is private to analyze.c and it's a bad idea
to hard code a cutoff limit of 1025.

I can think of two remedies:
1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation  so that the authors of foreign data wrappers are
awareof the  problem and can avoid it on their side.  This would be quite simple. 

2) Instead of one callback that returns all sample rows, have  a callback that just returns the next table row (or the
next table row from a subset of the pages of the table in the  internal case).  This function could be called in a
short-lived memory context.  Vitter's algorithm for selecting a sample  and the truncation of excessively long values
wouldthen be  handled in analyze.c.  This would avoid the problem completely and make it easier  to write a foreign
datawrapper.  I haven't thought this out completely, and it would require  bigger changes to analyze.c and the API than
areprobably  welcome this close to beta. 

What is your opinion?

Yours,
Laurenz Albe

Re: Analyzing foreign tables & memory problems

From

Tom Lane

Date:

30 April 2012, 14:24:45

"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> During ANALYZE, in analyze.c, functions compute_minimal_stats
> and compute_scalar_stats, values whose length exceed
> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
> other than that they are counted as "too wide rows" and assumed
> to be all different.

> This works fine with regular tables; values exceeding that threshold
> don't get detoasted and won't consume excessive memory.

> With foreign tables the situation is different.  Even though
> values exceeding WIDTH_THRESHOLD won't get used, the complete
> rows will be fetched from the foreign table.  This can easily
> exhaust maintenance_work_mem.

I'm fairly skeptical that this is a real problem, and would prefer not
to complicate wrappers until we see some evidence from the field that
it's worth worrying about.  The WIDTH_THRESHOLD logic was designed a
dozen years ago when common settings for work_mem were a lot smaller
than today.  Moreover, to my mind it's always been about avoiding
detoasting operations as much as saving memory, and we don't have
anything equivalent to that consideration in foreign data wrappers.
        regards, tom lane

Re: Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

30 April 2012, 15:01:05

Tom Lane wrote:
>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>> and compute_scalar_stats, values whose length exceed
>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>> other than that they are counted as "too wide rows" and assumed
>> to be all different.
>
>> This works fine with regular tables; values exceeding that threshold
>> don't get detoasted and won't consume excessive memory.
>
>> With foreign tables the situation is different.  Even though
>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>> rows will be fetched from the foreign table.  This can easily
>> exhaust maintenance_work_mem.

> I'm fairly skeptical that this is a real problem, and would prefer not
> to complicate wrappers until we see some evidence from the field that
> it's worth worrying about.  The WIDTH_THRESHOLD logic was designed a
> dozen years ago when common settings for work_mem were a lot smaller
> than today.  Moreover, to my mind it's always been about avoiding
> detoasting operations as much as saving memory, and we don't have
> anything equivalent to that consideration in foreign data wrappers.

If I have a table with 100000 rows and default_statistics_target
at 100, then a sample of 30000 rows will be taken.

If each row contains binary data of 1MB (an Image), then the
data structure returned will use about 30 GB of memory, which
will probably exceed maintenance_work_mem.

Or is there a flaw in my reasoning?

Yours,
Laurenz Albe

Re: Analyzing foreign tables & memory problems

From

Tom Lane

Date:

30 April 2012, 15:23:43

"Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
> Tom Lane wrote:
>> I'm fairly skeptical that this is a real problem, and would prefer not
>> to complicate wrappers until we see some evidence from the field that
>> it's worth worrying about.

> If I have a table with 100000 rows and default_statistics_target
> at 100, then a sample of 30000 rows will be taken.

> If each row contains binary data of 1MB (an Image), then the
> data structure returned will use about 30 GB of memory, which
> will probably exceed maintenance_work_mem.

> Or is there a flaw in my reasoning?

Only that I don't believe this is a real-world scenario for a foreign
table.  If you have a foreign table in which all, or even many, of the
rows are that wide, its performance is going to suck so badly that
you'll soon look for a different schema design anyway.

I don't want to complicate FDWs for this until it's an actual bottleneck
in real applications, which it may never be, and certainly won't be
until we've gone through a few rounds of performance refinement for
basic operations.
        regards, tom lane

Re: Analyzing foreign tables & memory problems

From

Simon Riggs

Date:

30 April 2012, 15:29:41

On Mon, Apr 30, 2012 at 3:24 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> "Albe Laurenz" <laurenz.albe@wien.gv.at> writes:
>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>> and compute_scalar_stats, values whose length exceed
>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>> other than that they are counted as "too wide rows" and assumed
>> to be all different.
>
>> This works fine with regular tables; values exceeding that threshold
>> don't get detoasted and won't consume excessive memory.
>
>> With foreign tables the situation is different.  Even though
>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>> rows will be fetched from the foreign table.  This can easily
>> exhaust maintenance_work_mem.
>
> I'm fairly skeptical that this is a real problem

AFAIK its not possible to select all columns from an Oracle database.
If you use an unqualified LONG column as part of the query then you
get an error.

So there are issues with simply requesting data for analysis.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: Analyzing foreign tables & memory problems

From

Noah Misch

Date:

30 April 2012, 17:13:04

On Mon, Apr 30, 2012 at 12:27:45PM +0200, Albe Laurenz wrote:
> During ANALYZE, in analyze.c, functions compute_minimal_stats
> and compute_scalar_stats, values whose length exceed
> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
> other than that they are counted as "too wide rows" and assumed
> to be all different.
> 
> This works fine with regular tables; values exceeding that threshold
> don't get detoasted and won't consume excessive memory.
> 
> With foreign tables the situation is different.  Even though
> values exceeding WIDTH_THRESHOLD won't get used, the complete
> rows will be fetched from the foreign table.  This can easily
> exhaust maintenance_work_mem.
> 
> A foreign data wrapper has no good way to counter the problem.
> It can return truncated values in ist AcquireSampleRowsFunc,
> but WIDTH_THRESHOLD is private to analyze.c and it's a bad idea
> to hard code a cutoff limit of 1025.
> 
> I can think of two remedies:
> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>    so that the authors of foreign data wrappers are aware of the
>    problem and can avoid it on their side.
>    This would be quite simple.

Seems reasonable.  How would the FDW return an indication that a value was
non-NULL but removed due to excess width?

Not all databases can cheaply filter out wide column values; by the time the
remote side has an exact width, the remote I/O damage may already be done.  To
dodge that problem, when a column has "SET STATISTICS 0", the FDW should be
able to completely omit reading it.  (I haven't studied the API needs, if any,
to make that possible.)

> 2) Instead of one callback that returns all sample rows, have
>    a callback that just returns the next table row (or the next
>    table row from a subset of the pages of the table in the
>    internal case).  This function could be called in a short-lived
>    memory context.  Vitter's algorithm for selecting a sample
>    and the truncation of excessively long values would then be
>    handled in analyze.c.
>    This would avoid the problem completely and make it easier
>    to write a foreign data wrapper.
>    I haven't thought this out completely, and it would require
>    bigger changes to analyze.c and the API than are probably
>    welcome this close to beta.

This solves the (in your downthread example) 30 GiB of memory consumption, but
you'll still read 30 GiB on the remote side and ship it all over the network.
To call this fixed, we'll need something like (1) that lets the FDW limit
volume at the remote side.

Thanks,
nm

Re: Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

30 April 2012, 18:37:27

Simon Riggs wrote:
>>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>>> and compute_scalar_stats, values whose length exceed
>>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>>> other than that they are counted as "too wide rows" and assumed
>>> to be all different.
>>
>>> This works fine with regular tables; values exceeding that threshold
>>> don't get detoasted and won't consume excessive memory.
>>
>>> With foreign tables the situation is different.  Even though
>>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>>> rows will be fetched from the foreign table.  This can easily
>>> exhaust maintenance_work_mem.
>>
>> I'm fairly skeptical that this is a real problem

> AFAIK its not possible to select all columns from an Oracle database.
> If you use an unqualified LONG column as part of the query then you
> get an error.
>
> So there are issues with simply requesting data for analysis.

To detail on the specific case of Oracle, I have given up on LONG
since a) it has been deprecated for a long time and
b) it is not possible to retrieve a LONG column unless you know
in advance how long it is.

But you can have several BLOB and CLOB columns in a table, each
of which can be arbitrarily large and can lead to the problem
I described.

Yours,
Laurenz Albe

Re: Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

30 April 2012, 18:50:21

Tom Lane wrote:
>>> I'm fairly skeptical that this is a real problem, and would prefer not
>>> to complicate wrappers until we see some evidence from the field that
>>> it's worth worrying about.

>> If I have a table with 100000 rows and default_statistics_target
>> at 100, then a sample of 30000 rows will be taken.

>> If each row contains binary data of 1MB (an Image), then the
>> data structure returned will use about 30 GB of memory, which
>> will probably exceed maintenance_work_mem.

>> Or is there a flaw in my reasoning?

> Only that I don't believe this is a real-world scenario for a foreign
> table.  If you have a foreign table in which all, or even many, of the
> rows are that wide, its performance is going to suck so badly that
> you'll soon look for a different schema design anyway.

Of course it wouldn't work well to SELECT * from such a foreign table,
but it would work well enough to get one or a few rows at a time,
which is probably such a table's purpose in life anyway.

> I don't want to complicate FDWs for this until it's an actual bottleneck
> in real applications, which it may never be, and certainly won't be
> until we've gone through a few rounds of performance refinement for
> basic operations.

I agree that it may not be the right thing to do something invasive
to solve an anticipated problem that may never be one.

So scrap my second idea.  But I think that exposing WIDTH_THRESHOLD
wouldn't be unreasonable, would it?

Yours,
Laurenz Albe

Re: Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

30 April 2012, 19:03:55

Noah Misch wrote:
>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>> and compute_scalar_stats, values whose length exceed
>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>> other than that they are counted as "too wide rows" and assumed
>> to be all different.
>>
>> This works fine with regular tables;

>> With foreign tables the situation is different.  Even though
>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>> rows will be fetched from the foreign table.  This can easily
>> exhaust maintenance_work_mem.

>> I can think of two remedies:
>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>>    so that the authors of foreign data wrappers are aware of the
>>    problem and can avoid it on their side.
>>    This would be quite simple.

> Seems reasonable.  How would the FDW return an indication that a value was
> non-NULL but removed due to excess width?

The FDW would return a value of length WIDTH_THRESHOLD+1 that is
long enough to be recognized as too long, but not long enough to
cause a problem.

> Not all databases can cheaply filter out wide column values; by the time the
> remote side has an exact width, the remote I/O damage may already be done.  To
> dodge that problem, when a column has "SET STATISTICS 0", the FDW should be
> able to completely omit reading it.  (I haven't studied the API needs, if any,
> to make that possible.)

Depending on the capabilities of the remote side, a FDW can
do more or less intelligent things to avoid the problem.
But it must know WIDTH_THRESHOLD.

Disabling statistics for a column as a workaround is an
interesting idea, but would be more work for the FDW writer
and the user.

>> 2) Instead of one callback that returns all sample rows, have
>>    a callback that just returns the next table row (or the next
>>    table row from a subset of the pages of the table in the
>>    internal case).  This function could be called in a short-lived
>>    memory context.  Vitter's algorithm for selecting a sample
>>    and the truncation of excessively long values would then be
>>    handled in analyze.c.
>>    This would avoid the problem completely and make it easier
>>    to write a foreign data wrapper.

> This solves the (in your downthread example) 30 GiB of memory consumption, but
> you'll still read 30 GiB on the remote side and ship it all over the network.
> To call this fixed, we'll need something like (1) that lets the FDW limit
> volume at the remote side.

You are right.  I guess the first idea is the more promising one.

Yours,
Laurenz Albe

Re: Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

02 May 2012, 10:20:59

I wrote:
> Noah Misch wrote:
>>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>>> and compute_scalar_stats, values whose length exceed
>>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>>> other than that they are counted as "too wide rows" and assumed
>>> to be all different.
>>>
>>> This works fine with regular tables;

>>> With foreign tables the situation is different.  Even though
>>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>>> rows will be fetched from the foreign table.  This can easily
>>> exhaust maintenance_work_mem.

>>> I can think of two remedies:
>>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>>>    so that the authors of foreign data wrappers are aware of the
>>>    problem and can avoid it on their side.
>>>    This would be quite simple.

>> Seems reasonable.  How would the FDW return an indication that a
value was
>> non-NULL but removed due to excess width?
>
> The FDW would return a value of length WIDTH_THRESHOLD+1 that is
> long enough to be recognized as too long, but not long enough to
> cause a problem.

Here is a simple patch for that.

Yours,
Laurenz Albe

Attachment

analyze.patch

Re: Analyzing foreign tables & memory problems

From

Noah Misch

Date:

13 May 2012, 15:45:27

On Wed, May 02, 2012 at 12:20:39PM +0200, Albe Laurenz wrote:
> >>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
> >>>    so that the authors of foreign data wrappers are aware of the
> >>>    problem and can avoid it on their side.
> >>>    This would be quite simple.
> 
> >> Seems reasonable.  How would the FDW return an indication that a
> value was
> >> non-NULL but removed due to excess width?
> > 
> > The FDW would return a value of length WIDTH_THRESHOLD+1 that is
> > long enough to be recognized as too long, but not long enough to
> > cause a problem.
> 
> Here is a simple patch for that.

It feels to me like a undue hack to ask FDW authors to synthesize such values.
It's easy enough for data types such as text/bytea.  In general, though,
simple truncation may not produce a valid value of the type.  That shouldn't
matter, since the next action taken on the value should be to discard it, but
it's fragile.  Can we do better?

Just thinking out loud, we could provide an "extern Datum AnalyzeWideValue;"
and direct FDW authors to use that particular datum.  It could look like a
toasted datum of external size WIDTH_THRESHOLD+1 but bear va_toastrelid ==
InvalidOid.  Then, if future code movement leads us to actually examine one of
these values, we'll get an early, hard error.

Thanks,
nm

Re: Analyzing foreign tables & memory problems

From

"Albe Laurenz"

Date:

14 May 2012, 07:22:13

Noah Misch wrote:
>>>>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add
documentation
>>>>>    so that the authors of foreign data wrappers are aware of the
>>>>>    problem and can avoid it on their side.
>>>>>    This would be quite simple.
>>
>>>> Seems reasonable.  How would the FDW return an indication that a
value was
>>>> non-NULL but removed due to excess width?
>>>
>>> The FDW would return a value of length WIDTH_THRESHOLD+1 that is
>>> long enough to be recognized as too long, but not long enough to
>>> cause a problem.
>>
>> Here is a simple patch for that.
>
> It feels to me like a undue hack to ask FDW authors to synthesize such
values.
> It's easy enough for data types such as text/bytea.  In general,
though,
> simple truncation may not produce a valid value of the type.  That
shouldn't
> matter, since the next action taken on the value should be to discard
it, but
> it's fragile.  Can we do better?
>
> Just thinking out loud, we could provide an "extern Datum
AnalyzeWideValue;"
> and direct FDW authors to use that particular datum.  It could look
like a
> toasted datum of external size WIDTH_THRESHOLD+1 but bear
va_toastrelid ==
> InvalidOid.  Then, if future code movement leads us to actually
examine one of
> these values, we'll get an early, hard error.

That would be very convenient indeed.

Even better would be a function
extern Datum createAnalyzeWideValue(integer width)
so that row width calculations could be more accurate.

Yours,
Laurenz Albe

Re: Analyzing foreign tables & memory problems

From

Noah Misch

Date:

15 May 2012, 13:08:35

On Mon, May 14, 2012 at 09:21:20AM +0200, Albe Laurenz wrote:
> Noah Misch wrote:
> > Just thinking out loud, we could provide an "extern Datum
> AnalyzeWideValue;"
> > and direct FDW authors to use that particular datum.  It could look
> like a
> > toasted datum of external size WIDTH_THRESHOLD+1 but bear
> va_toastrelid ==
> > InvalidOid.  Then, if future code movement leads us to actually
> examine one of
> > these values, we'll get an early, hard error.
> 
> That would be very convenient indeed.
> 
> Even better would be a function
> extern Datum createAnalyzeWideValue(integer width)
> so that row width calculations could be more accurate.

Yes; good call.