Re: Question about optimising (Postgres_)FDW - Mailing list pgsql-hackers

From Hannu Krosing
Subject Re: Question about optimising (Postgres_)FDW
Date
Msg-id 534E82B8.9070205@krosing.net
Whole thread Raw
In response to Re: Question about optimising (Postgres_)FDW  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
Responses Re: Question about optimising (Postgres_)FDW  (Hannu Krosing <hannu@krosing.net>)
Re: Question about optimising (Postgres_)FDW  (Etsuro Fujita <fujita.etsuro@lab.ntt.co.jp>)
List pgsql-hackers
On 04/16/2014 01:35 PM, Etsuro Fujita wrote:
> (2014/04/16 6:55), Hannu Krosing wrote:
...
>
> Maybe I'm missing something, but I think that you can do what I think
> you'd like to do by the following procedure:
No, what I'd like PostgreSQL to do is to

1. select the id+set from local table
2. select the rows from remote table with WHERE ID IN (<set selected in
step 1>)
3. then join the original set to selected set, with any suitable join
strategy

The things I do not want are

A. selecting all rows from remote table   (this is what your examples below do)

or

B. selecting rows from remote table by single selects using "ID = $"   (this is something that I managed to do by some
tweakingof costs)
 

as A will be always slow if there are millions of rows in remote table
and B is slow(ish) when the idset is over a few hundred ids

I hope this is a bit better explanation than I provided before .

Cheers
Hannu

P.S. I am not sure if this is a limitation of postgres_fdw or postgres
itself

P.P.S I tested a little with with Multicorn an postgresql did not
request row
counts for any IN plans, so it may be that the planner does not consider
this
kind of plan at all. (testing was on PgSQL 9.3.4)

Hannu
>
> postgres=# ALTER SERVER loop OPTIONS (ADD fdw_startup_cost '1000');
> ALTER SERVER
> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
>                                           QUERY PLAN
> -----------------------------------------------------------------------------------------------
>
>  Hash Semi Join  (cost=1023.10..41983.21 rows=100 width=30)
>    Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> onemillion_pgsql.data
>    Hash Cond: (onemillion_pgsql.id = onemillion.id)
>    ->  Foreign Scan on public.onemillion_pgsql 
> (cost=1000.00..39334.00 rows=1000000 width=29)
>          Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> onemillion_pgsql.data
>          Remote SQL: SELECT id, inserted, data FROM public.onemillion
>    ->  Hash  (cost=21.85..21.85 rows=100 width=4)
>          Output: onemillion.id
>          ->  Limit  (cost=0.00..20.85 rows=100 width=4)
>                Output: onemillion.id
>                ->  Seq Scan on public.onemillion  (cost=0.00..20834.00
> rows=99918 width=4)
>                      Output: onemillion.id
>                      Filter: (onemillion.data > '0.9'::text)
>  Planning time: 0.690 ms
> (14 rows)
>
> or, that as Tom mentioned, by disabling the use_remote_estimate function:
>
> postgres=# ALTER FOREIGN TABLE onemillion_pgsql OPTIONS (SET
> use_remote_estimate 'false');
> ALTER FOREIGN TABLE
> postgres=# EXPLAIN VERBOSE SELECT * FROM onemillion_pgsql WHERE id in
> (SELECT id FROM onemillion WHERE data > '0.9' LIMIT 100);
>                                           QUERY PLAN
> ----------------------------------------------------------------------------------------------
>
>  Hash Semi Join  (cost=123.10..41083.21 rows=100 width=30)
>    Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> onemillion_pgsql.data
>    Hash Cond: (onemillion_pgsql.id = onemillion.id)
>    ->  Foreign Scan on public.onemillion_pgsql  (cost=100.00..38434.00
> rows=1000000 width=30)
>          Output: onemillion_pgsql.id, onemillion_pgsql.inserted,
> onemillion_pgsql.data
>          Remote SQL: SELECT id, inserted, data FROM public.onemillion
>    ->  Hash  (cost=21.85..21.85 rows=100 width=4)
>          Output: onemillion.id
>          ->  Limit  (cost=0.00..20.85 rows=100 width=4)
>                Output: onemillion.id
>                ->  Seq Scan on public.onemillion  (cost=0.00..20834.00
> rows=99918 width=4)
>                      Output: onemillion.id
>                      Filter: (onemillion.data > '0.9'::text)
>  Planning time: 0.215 ms
> (14 rows)
>
> Thanks,
>
> Best regards,
> Etsuro Fujita
>
>




pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Clock sweep not caching enough B-Tree leaf pages?
Next
From: Merlin Moncure
Date:
Subject: Re: Clock sweep not caching enough B-Tree leaf pages?