Re: Analyzing foreign tables & memory problems - Mailing list pgsql-hackers

From Albe Laurenz
Subject Re: Analyzing foreign tables & memory problems
Date
Msg-id D960CB61B694CF459DCFB4B0128514C2049FCE85@exadv11.host.magwien.gv.at
Whole thread Raw
In response to Analyzing foreign tables & memory problems  ("Albe Laurenz" <laurenz.albe@wien.gv.at>)
Responses Re: Analyzing foreign tables & memory problems  ("Albe Laurenz" <laurenz.albe@wien.gv.at>)
List pgsql-hackers
Noah Misch wrote:
>> During ANALYZE, in analyze.c, functions compute_minimal_stats
>> and compute_scalar_stats, values whose length exceed
>> WIDTH_THRESHOLD (= 1024) are not used for calculating statistics
>> other than that they are counted as "too wide rows" and assumed
>> to be all different.
>>
>> This works fine with regular tables;

>> With foreign tables the situation is different.  Even though
>> values exceeding WIDTH_THRESHOLD won't get used, the complete
>> rows will be fetched from the foreign table.  This can easily
>> exhaust maintenance_work_mem.

>> I can think of two remedies:
>> 1) Expose WIDTH_THRESHOLD in commands/vacuum.h and add documentation
>>    so that the authors of foreign data wrappers are aware of the
>>    problem and can avoid it on their side.
>>    This would be quite simple.

> Seems reasonable.  How would the FDW return an indication that a value was
> non-NULL but removed due to excess width?

The FDW would return a value of length WIDTH_THRESHOLD+1 that is
long enough to be recognized as too long, but not long enough to
cause a problem.

> Not all databases can cheaply filter out wide column values; by the time the
> remote side has an exact width, the remote I/O damage may already be done.  To
> dodge that problem, when a column has "SET STATISTICS 0", the FDW should be
> able to completely omit reading it.  (I haven't studied the API needs, if any,
> to make that possible.)

Depending on the capabilities of the remote side, a FDW can
do more or less intelligent things to avoid the problem.
But it must know WIDTH_THRESHOLD.

Disabling statistics for a column as a workaround is an
interesting idea, but would be more work for the FDW writer
and the user.

>> 2) Instead of one callback that returns all sample rows, have
>>    a callback that just returns the next table row (or the next
>>    table row from a subset of the pages of the table in the
>>    internal case).  This function could be called in a short-lived
>>    memory context.  Vitter's algorithm for selecting a sample
>>    and the truncation of excessively long values would then be
>>    handled in analyze.c.
>>    This would avoid the problem completely and make it easier
>>    to write a foreign data wrapper.

> This solves the (in your downthread example) 30 GiB of memory consumption, but
> you'll still read 30 GiB on the remote side and ship it all over the network.
> To call this fixed, we'll need something like (1) that lets the FDW limit
> volume at the remote side.

You are right.  I guess the first idea is the more promising one.

Yours,
Laurenz Albe


pgsql-hackers by date:

Previous
From: Simon Riggs
Date:
Subject: Re: Future In-Core Replication
Next
From: "David Johnston"
Date:
Subject: Re: precision and scale functions for numeric