Re: [Proposal] Global temporary tables - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: [Proposal] Global temporary tables
Date
Msg-id 20200109163037.u76k3kdr3pcmrytb@development
Whole thread Raw
In response to Re: [Proposal] Global temporary tables  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
Responses Re: [Proposal] Global temporary tables  (Konstantin Knizhnik <k.knizhnik@postgrespro.ru>)
List pgsql-hackers
On Thu, Jan 09, 2020 at 06:07:46PM +0300, Konstantin Knizhnik wrote:
>
>
>On 06.01.2020 14:01, Tomas Vondra wrote:
>>On Mon, Jan 06, 2020 at 01:04:15PM +0800, 曾文旌(义从) wrote:
>>>In the previous communication
>>>
>>>1 we agreed on the general direction 1.1 gtt use local (private)
>>>buffer 1.2 no replica access in first version
>>>
>>
>>OK, good.
>>
>>>2 We feel that gtt needs to maintain statistics, but there is no
>>>agreement on what it will be done.
>>>
>>
>>I certainly agree GTT needs to maintain statistics, otherwise it'll
>>lead to poor query plans. AFAIK the current patch stores the info in a
>>hash table in a backend private memory, and I don't see how else to do
>>that (e.g. storing it in a catalog would cause catalog bloat).
>>
>>FWIW this is a reasons why I think just using shared buffers (instead
>>of local ones) is not sufficient to support parallel queriesl as
>>proposed by Alexander. The workers would not know the stats, breaking
>>planning of queries in PARALLEL SAFE plpgsql functions etc.
>
>
>I do not think that "all or nothing" approach is so good for software
>development as for database transactions.

Well, sure. I'm not saying we need to have a perfect solution in v1. I'm
saying if we have two choices:

(1) Use shared buffers even if it means the parallel query plan may be
     arbitrarily bad.

(2) Use private buffers, even if it means no parallel queries with temp
     tables.

Then I'm voting for (2) because it's less likely to break down. I can
imagine allowing parallel queries with GTT when there's no risk of
having to plan in the worker, but that's not there yet.

If we can come up with a reasonable solution for the parallel case, we
can enable it later.

>Yes, if we have function in PL/pgSQL which performs queries om 
>temporary tables, then
>parallel workers may build inefficient plan for this queries due to 
>lack of statistics.

IMHO that's a pretty awful deficiency, because it essentially means
users may need to disable parallelism for such queries. Which means
we'll get complaints from users, and we'll have to come up with some
sort of solution. I'd rather not be in that position.

>From my point of view this is not a pitfall of GTT but result of lack 
>of global plan cache in Postgres. And it should be fixed not at GTT 
>level.
>

That doesn't give us free pass to just ignore the issue. Even if it
really was due to a lack of global plan cache, the fact is we don't have
that feature, so we have a problem. I mean, if you need infrastructure
that is not available, you either have to implement that infrastructure
or make it work properly without it.

>Also I never see real use cases with such functions, even in the 
>systems which using hard temporary tables and stored procedures.
>But there are many other real problems with temp tables  (except 
>already mentioned in this thread).

Oh, I'm sure there are pretty large plpgsql applications, and I'd be
surprised if at least some of those were not affected. And I'm sure
there are apps using UDF to do all sorts of stuff (e.g. I wonder if
PostGIS would have this issue - IIRC it's using SPI etc.).

The question is whether we should consider existing apps affected,
because they are using the regular temporary tables and not GTT. So
unless they switch to GTT there is no regression ...

But even in that case I don't think it's a good idea to accept this as
an acceptable limitation. I admit one of the reasons why I think that
may be that statistics and planning are my areas of interest, so I'm not
quite willing to accept incomplete stuff as OK.

>In PgPro/EE we have fixes for some of them, for example:
>
>1. Do not reserve space in the file for temp relations. Right now 
>append of relation cause writing zero page to the disk by mdextend.
>It cause useless disk IO for temp tables which in most cases fit in 
>memory and should not be written at disk.
>
>2. Implicitly perform analyze of temp table intermediately after 
>storing data in it. Usually tables are analyzed by autovacuum in 
>background.
>But it doesn't work for temp tables which are not processes by 
>autovacuum and are accessed immediately after filling them with data 
>and lack of statistic  may cause
>building very inefficient plan. We have online_analyze extension which 
>force analyze of the table after appending some bulk of data to it.
>It can be used for normal table but most of all it is useful for temp 
>relations.
>
>Unlike hypothetical example with parallel safe function working with 
>temp tables,
>this are real problems observed by some of our customers.
>Them are applicable both to local and global temp tables and this is 
>why I do not want to discuss them in context of GTT.
>

I think those are both interesting issues worth fixing, but I don't
think it makes the issue discussed here less important.

>
>>
>>>3 Still no one commented on GTT's transaction information 
>>>processing, they include
>>>3.1 Should gtt's frozenxid need to be care?
>>>3.2 gtt’s clog clean
>>>3.3 How to deal with "too old" gtt data
>>>
>>
>>No idea what to do about this.
>>
>
>I wonder what is the specific of GTT here?
>The same problem takes place for normal (local) temp tables, doesn't it?
>

Not sure. TBH I'm not sure I understand what the issue actually is.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: Removing pg_pltemplate and creating "trustable" extensions
Next
From: Tom Lane
Date:
Subject: Re: Recognizing superuser in pg_hba.conf