Thread: Query Performance / Planner estimate off

Query Performance / Planner estimate off

From
Mats Julian Olsen
Date:
Dear Postgres community,

I'm looking for some help to manage queries against two large tables.

Context:
We run a relatively large postgresql instance (5TB, 32 vCPU, 120GB RAM) with a hybrid transactional/analytical workload. Data is written in batches every 15 seconds or so, and the all queryable tables are append-only (we never update or delete). Our users can run analytical queries on top of these tables.

We recently came across a series of troublesome queries one of which I'll dive into here.

Please see the following gist for both the query we run and the \d+ output: https://gist.github.com/mewwts/9f11ae5e6a5951593b8999559f5418cf.

The tables in question are:
- `ethereum.transactions`: 833M rows, partitioned, 171M rows after WHERE
- `uniswap_v2."Pair_evt_Swap": 12M rows, not partitioned, 12M rows after WHERE

The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

Let me know if there is anything I left out here that would be useful for further debugging.

--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
David Rowley
Date:
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen <mats@duneanalytics.com> wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this
query(and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and
`enable_seqscan`off we can take the total runtime down from 16 minutes to 2 minutes. 
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

> Let me know if there is anything I left out here that would be useful for further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

would be useful.

David



Re: Query Performance / Planner estimate off

From
Mats Julian Olsen
Date:
On Tue, Oct 20, 2020 at 9:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen <mats@duneanalytics.com> wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

random_page_cost is 1.1 and effective_cache_size is '60GB' (listed in the gist). random_page_cost may be too low?
 
You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

Interesting, here are the values for those:
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
 

> Let me know if there is anything I left out here that would be useful for further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

default_statistics_target = 500
effective_cache_size = 7864320
random_page_cost = 1.1
 
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit

would be useful.

David

Thanks David, see above for more information.

--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
Pavel Stehule
Date:


út 20. 10. 2020 v 11:59 odesílatel Mats Julian Olsen <mats@duneanalytics.com> napsal:
On Tue, Oct 20, 2020 at 9:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen <mats@duneanalytics.com> wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

random_page_cost is 1.1 and effective_cache_size is '60GB' (listed in the gist). random_page_cost may be too low?

random_page_cost 2 is safer - the value 1.5 is a little bit aggressive for me.

 
You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

Interesting, here are the values for those:
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
 

> Let me know if there is anything I left out here that would be useful for further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

default_statistics_target = 500
effective_cache_size = 7864320
random_page_cost = 1.1
 
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit

would be useful.

David

Thanks David, see above for more information.

--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
Mats Julian Olsen
Date:


On Tue, Oct 20, 2020 at 10:50 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


út 20. 10. 2020 v 11:59 odesílatel Mats Julian Olsen <mats@duneanalytics.com> napsal:
On Tue, Oct 20, 2020 at 9:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen <mats@duneanalytics.com> wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

random_page_cost is 1.1 and effective_cache_size is '60GB' (listed in the gist). random_page_cost may be too low?

random_page_cost 2 is safer - the value 1.5 is a little bit aggressive for me.

Thanks Pavel. I tried changing random_page_cost from 1.1 to 2, to 3... all the way up to 10. All values resulted in the same query plan, except for 10, which then executed a parallel hash join (however with sequential scans) https://explain.depesz.com/s/Srcb.

10 seems like a way too high value for random_page_cost though?


 
You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

Interesting, here are the values for those:
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
 

> Let me know if there is anything I left out here that would be useful for further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

default_statistics_target = 500
effective_cache_size = 7864320
random_page_cost = 1.1
 
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit

would be useful.

David

Thanks David, see above for more information.

--
Mats
CTO @ Dune Analytics


--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
Pavel Stehule
Date:


út 20. 10. 2020 v 13:09 odesílatel Mats Julian Olsen <mats@duneanalytics.com> napsal:


On Tue, Oct 20, 2020 at 10:50 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


út 20. 10. 2020 v 11:59 odesílatel Mats Julian Olsen <mats@duneanalytics.com> napsal:
On Tue, Oct 20, 2020 at 9:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen <mats@duneanalytics.com> wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

random_page_cost is 1.1 and effective_cache_size is '60GB' (listed in the gist). random_page_cost may be too low?

random_page_cost 2 is safer - the value 1.5 is a little bit aggressive for me.

Thanks Pavel. I tried changing random_page_cost from 1.1 to 2, to 3... all the way up to 10. All values resulted in the same query plan, except for 10, which then executed a parallel hash join (however with sequential scans) https://explain.depesz.com/s/Srcb.

10 seems like a way too high value for random_page_cost though?

it is not usual, but I know about analytics cases where is this value. But maybe  effective_cache_size is too high.


 
You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

Interesting, here are the values for those:
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
 

> Let me know if there is anything I left out here that would be useful for further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

default_statistics_target = 500
effective_cache_size = 7864320
random_page_cost = 1.1
 
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit

would be useful.

David

Thanks David, see above for more information.

--
Mats
CTO @ Dune Analytics


--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
Mats Julian Olsen
Date:


On Tue, Oct 20, 2020 at 11:16 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


út 20. 10. 2020 v 13:09 odesílatel Mats Julian Olsen <mats@duneanalytics.com> napsal:


On Tue, Oct 20, 2020 at 10:50 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:


út 20. 10. 2020 v 11:59 odesílatel Mats Julian Olsen <mats@duneanalytics.com> napsal:
On Tue, Oct 20, 2020 at 9:50 AM David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 20 Oct 2020 at 22:38, Mats Julian Olsen <mats@duneanalytics.com> wrote:
>
> The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.
>
> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> 3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx
>
> How can I get Postgres not to loop over 12M rows?

You'll likely want to look at what random_page_cost is set to. If the
planner is preferring nested loops then it may be too low.  You'll
also want to see if effective_cache_size is set to something
realistic.  Higher values of that will prefer nested loops like this.

random_page_cost is 1.1 and effective_cache_size is '60GB' (listed in the gist). random_page_cost may be too low?

random_page_cost 2 is safer - the value 1.5 is a little bit aggressive for me.

Thanks Pavel. I tried changing random_page_cost from 1.1 to 2, to 3... all the way up to 10. All values resulted in the same query plan, except for 10, which then executed a parallel hash join (however with sequential scans) https://explain.depesz.com/s/Srcb.

10 seems like a way too high value for random_page_cost though?

it is not usual, but I know about analytics cases where is this value. But maybe  effective_cache_size is too high.

Changing the effective_cache_size from 10GB up to 60GB does not affect the Nested Loop-part of this query plan. It does alter the inner part of a loop from sequential (low cache) to index scans (high cache).
 


 
You may also want to reduce max_parallel_workers_per_gather.  It looks
like you're not getting your parallel workers as often as you'd like.
If the planner chooses a plan thinking it's going to get some workers
and gets none, then that plan may be inferior the one that the planner
would have chosen if it had known the workers would be unavailable.

Interesting, here are the values for those:
max_parallel_workers = 8
max_parallel_workers_per_gather = 4
 

> Let me know if there is anything I left out here that would be useful for further debugging.

select name,setting from pg_Settings where category like 'Query
Tuning%' and source <> 'default';
select version();

default_statistics_target = 500
effective_cache_size = 7864320
random_page_cost = 1.1
 
PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit

would be useful.

David

Thanks David, see above for more information.

--
Mats
CTO @ Dune Analytics


--
Mats
CTO @ Dune Analytics


--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
Victor Yegorov
Date:
вт, 20 окт. 2020 г. в 11:38, Mats Julian Olsen <mats@duneanalytics.com>:
I'm looking for some help to manage queries against two large tables.

Can you tell the version you're running currently and the output of this query, please?

    select name,setting,source from pg_settings where source not in ('default','override'); 

--
Victor Yegorov

Re: Query Performance / Planner estimate off

From
Victor Yegorov
Date:
вт, 20 окт. 2020 г. в 11:38, Mats Julian Olsen <mats@duneanalytics.com>:
I'm looking for some help to manage queries against two large tables.

Also, can you enable `track_io_timing` (no restart required) and provide output of `EXPLAIN (analyze, buffers, settings)` for all 3 variants, please? 
(I assume you're on 12+.)

--
Victor Yegorov

Re: Query Performance / Planner estimate off

From
Sushant Pawar
Date:
Looping in the main group ID.

Regards
Sushant

On Tue, Oct 20, 2020 at 6:49 PM Sushant Pawar <sushantxp@gmail.com> wrote:

On Tue, Oct 20, 2020 at 3:08 PM Mats Julian Olsen <mats@duneanalytics.com> wrote:
Dear Postgres community,

I'm looking for some help to manage queries against two large tables.

Context:
We run a relatively large postgresql instance (5TB, 32 vCPU, 120GB RAM) with a hybrid transactional/analytical workload. Data is written in batches every 15 seconds or so, and the all queryable tables are append-only (we never update or delete). Our users can run analytical queries on top of these tables.

We recently came across a series of troublesome queries one of which I'll dive into here.

Please see the following gist for both the query we run and the \d+ output: https://gist.github.com/mewwts/9f11ae5e6a5951593b8999559f5418cf.

The tables in question are:
- `ethereum.transactions`: 833M rows, partitioned, 171M rows after WHERE
- `uniswap_v2."Pair_evt_Swap": 12M rows, not partitioned, 12M rows after WHERE

The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

The cost of a query while using the default Vanila plan is very less compared to the 3rd plan with nested loop and seqscan  being set to off.  As the JIT is enabled, it seems the planner tries to select the plan with the least cost and going for the plan which is taking more time of execution. Can you try running this query with JIT=off in the session and see if it selects the plan with the least time for execution?

How can I get Postgres not to loop over 12M rows?

Let me know if there is anything I left out here that would be useful for further debugging.

--
Regards
    Sushant 

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/20/20 3:04 PM, Victor Yegorov wrote:
вт, 20 окт. 2020 г. в 11:38, Mats Julian Olsen <mats@duneanalytics.com>:
I'm looking for some help to manage queries against two large tables.

Can you tell the version you're running currently and the output of this query, please?

    select name,setting,source from pg_settings where source not in ('default','override'); 

Running "PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit"

Updated the gist to include the results forom pg_settings. Here's the direct link https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/e5deebbbb48680e04570bec4e9a816fa009da34f/pg_settings

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:

On 10/20/20 11:37 AM, Mats Julian Olsen wrote:

Dear Postgres community,

I'm looking for some help to manage queries against two large tables.

Context:
We run a relatively large postgresql instance (5TB, 32 vCPU, 120GB RAM) with a hybrid transactional/analytical workload. Data is written in batches every 15 seconds or so, and the all queryable tables are append-only (we never update or delete). Our users can run analytical queries on top of these tables.

We recently came across a series of troublesome queries one of which I'll dive into here.

Please see the following gist for both the query we run and the \d+ output: https://gist.github.com/mewwts/9f11ae5e6a5951593b8999559f5418cf.

The tables in question are:
- `ethereum.transactions`: 833M rows, partitioned, 171M rows after WHERE
- `uniswap_v2."Pair_evt_Swap": 12M rows, not partitioned, 12M rows after WHERE

The query plans I submitted was querying the table `uniswap_v2."Pair_evt_Mint"`which has 560k rows before and after WHERE. Also not partitioned. Apologies for the inconsistency, but as I mentioned the same performance problem holds when using `uniswap_v2."Pair_evt_Swap" (even worse due to it's size).


The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

Let me know if there is anything I left out here that would be useful for further debugging.

--
Mats
CTO @ Dune Analytics

Re: Query Performance / Planner estimate off

From
Victor Yegorov
Date:
вт, 20 окт. 2020 г. в 16:50, Mats Olsen <mats@duneanalytics.com>:

On 10/20/20 3:04 PM, Victor Yegorov wrote:

вт, 20 окт. 2020 г. в 11:38, Mats Julian Olsen <mats@duneanalytics.com>:
I'm looking for some help to manage queries against two large tables.

Can you tell the version you're running currently and the output of this query, please?

    select name,setting,source from pg_settings where source not in ('default','override'); 

Running "PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit"

Updated the gist to include the results forom pg_settings. Here's the direct link https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/e5deebbbb48680e04570bec4e9a816fa009da34f/pg_settings

It looks like indexes currently chosen by the planner don't quite fit your query.

I would create the following index (if it's possible to update schema):
   ON "uniswap_v2.Pair_evt_Mint" (evt_tx_hash, evt_block_time)

Same for the second table, looks like
  ON "ethereum.transactions" (hash, block_time)
is a better fit for your query. In fact, I do not think `transactions_block_number_time` index is used frequently, 'cos second column of the index is a partitioning key.

Currently planner wants to go via indexes 'cos you've made random access really cheap compared to sequential one (and your findings shows this).
Perhaps on a NVMe disks this could work, but in your case you need to find the real bottleneck (therefore I asked for buffers).

I would set `random_page_cost` to a 2.5 at least with your numbers. Also, I would check DB and indexes for bloat (just a guess now, 'cos your plans miss buffers figures).


--
Victor Yegorov

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/20/20 3:40 PM, Sushant Pawar wrote:
Looping in the main group ID.

Regards
Sushant

On Tue, Oct 20, 2020 at 6:49 PM Sushant Pawar <sushantxp@gmail.com> wrote:

On Tue, Oct 20, 2020 at 3:08 PM Mats Julian Olsen <mats@duneanalytics.com> wrote:
Dear Postgres community,

I'm looking for some help to manage queries against two large tables.

Context:
We run a relatively large postgresql instance (5TB, 32 vCPU, 120GB RAM) with a hybrid transactional/analytical workload. Data is written in batches every 15 seconds or so, and the all queryable tables are append-only (we never update or delete). Our users can run analytical queries on top of these tables.

We recently came across a series of troublesome queries one of which I'll dive into here.

Please see the following gist for both the query we run and the \d+ output: https://gist.github.com/mewwts/9f11ae5e6a5951593b8999559f5418cf.

The tables in question are:
- `ethereum.transactions`: 833M rows, partitioned, 171M rows after WHERE
- `uniswap_v2."Pair_evt_Swap": 12M rows, not partitioned, 12M rows after WHERE

The crux of our issue is that the query planner chooses a nested loop join for this query. Essentially making this query (and other queries) take a very long time to complete. In contrast, by toggling `enable_nestloop` and `enable_seqscan` off we can take the total runtime down from 16 minutes to 2 minutes.

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

The cost of a query while using the default Vanila plan is very less compared to the 3rd plan with nested loop and seqscan  being set to off.  As the JIT is enabled, it seems the planner tries to select the plan with the least cost and going for the plan which is taking more time of execution. Can you try running this query with JIT=off in the session and see if it selects the plan with the least time for execution?

Thank you for your reply. Here's the result using set jit=off; https://explain.depesz.com/s/rpKc. It's essentially the same plan as the initial one.


How can I get Postgres not to loop over 12M rows?

Let me know if there is anything I left out here that would be useful for further debugging.

--
Regards
    Sushant 

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/20/20 3:22 PM, Victor Yegorov wrote:
вт, 20 окт. 2020 г. в 11:38, Mats Julian Olsen <mats@duneanalytics.com>:
I'm looking for some help to manage queries against two large tables.

Also, can you enable `track_io_timing` (no restart required) and provide output of `EXPLAIN (analyze, buffers, settings)` for all 3 variants, please? 
(I assume you're on 12+.)

Thanks! Yes on 12.2. Here's the output:

vanilla: https://explain.depesz.com/s/Ktrd

set enable_nestloop=off: https://explain.depesz.com/s/mvSD

set enable_nestloop=off; set enable_seqscan=off: https://explain.depesz.com/s/XIDo

Are these helpful?


--
Victor Yegorov

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/20/20 6:51 PM, Victor Yegorov wrote:
вт, 20 окт. 2020 г. в 16:50, Mats Olsen <mats@duneanalytics.com>:

On 10/20/20 3:04 PM, Victor Yegorov wrote:

вт, 20 окт. 2020 г. в 11:38, Mats Julian Olsen <mats@duneanalytics.com>:
I'm looking for some help to manage queries against two large tables.

Can you tell the version you're running currently and the output of this query, please?

    select name,setting,source from pg_settings where source not in ('default','override'); 

Running "PostgreSQL 12.2 (Ubuntu 12.2-2.pgdg19.10+1) on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 9.2.1-9ubuntu2) 9.2.1 20191008, 64-bit"

Updated the gist to include the results forom pg_settings. Here's the direct link https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/e5deebbbb48680e04570bec4e9a816fa009da34f/pg_settings

It looks like indexes currently chosen by the planner don't quite fit your query.

I would create the following index (if it's possible to update schema):
   ON "uniswap_v2.Pair_evt_Mint" (evt_tx_hash, evt_block_time)
I'll try to add this.

Same for the second table, looks like
  ON "ethereum.transactions" (hash, block_time)
is a better fit for your query. In fact, I do not think `transactions_block_number_time` index is used frequently, 'cos second column of the index is a partitioning key.
I'll see if I can add it. This table is huge so normally we only make changes to these when we redeploy the database.

Currently planner wants to go via indexes 'cos you've made random access really cheap compared to sequential one (and your findings shows this).
Perhaps on a NVMe disks this could work, but in your case you need to find the real bottleneck (therefore I asked for buffers).

I would set `random_page_cost` to a 2.5 at least with your numbers. Also, I would check DB and indexes for bloat (just a guess now, 'cos your plans miss buffers figures)

Yeah, 1.1 seems way to low.

Here's the output of the explain (analyze, buffers, settings) you asked for:

vanilla: https://explain.depesz.com/s/Ktrd

set enable_nestloop=off: https://explain.depesz.com/s/mvSD

set enable_nestloop=off; set enable_seqscan=off: https://explain.depesz.com/s/XIDo




--
Victor Yegorov

Re: Query Performance / Planner estimate off

From
Sebastian Dressler
Date:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE

- I however think that the misestimate might be caused by the evt_tx_hash being of type bytea. I believe that PG cannot estimate this very well for JOINs and will rather pick row numbers too low. Hence the nested loop is picked and there might be no way around this. I have experienced similar things when applying JOINs on VARCHAR with e.g. more than 3 fields for comparison.

- Other things to look into:

    - work_mem seems too low to me with 56MB, consider raising this to the GB range to avoid disk-based operations
    - min_parallel_table_scan_size - try 0
    - parallel_setup_cost (default 1000, maybe try 500)
    - parallel_tuple_cost (default 1.0, maybe try 0.1)
    - random_page_cost (as mentioned consider raising this maybe much higher, factor 10 or sth like this) or (typically) seq_page_cost can be possibly much lower (0.1, 0.01) depending on your storage

I hope this helps to get to a parallel plan without setting `nested_loop = off`. EXPLAIN should be enough already to see the difference.

Best,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B 
 

Attachment

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/21/20 2:38 PM, Sebastian Dressler wrote:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both tables involved, unfortunately the plan is the same. The columns we use for joining here are hashes and we expect very few duplicates in the tables. Hence I think extended statistics (storing most common values and histogram bounds) aren't useful for this kind of data. Would you say the same thing?

- I however think that the misestimate might be caused by the evt_tx_hash being of type bytea. I believe that PG cannot estimate this very well for JOINs and will rather pick row numbers too low. Hence the nested loop is picked and there might be no way around this. I have experienced similar things when applying JOINs on VARCHAR with e.g. more than 3 fields for comparison.

This is very interesting, and I have never heard of issues with using `bytea` for joins. Our entire database is filled with them, as we deal with hashes of different lengths. In fact I would estimate that 60% of columns are bytea's. My intuition would say that it's better to store the hashes as byte arrays, rather than `text` fields as you can compare the raw bytes directly without encoding first?  Do you have any references for this?

Alternatively, since I know the length of the hashes in advance, I could've used `varchar(n)`, but I don't think there's any gains to be had in postgres by doing that? Something like `bytea(n)` would also have been interesting, had postgres been able to exploit that information.

- Other things to look into:

    - work_mem seems too low to me with 56MB, consider raising this to the GB range to avoid disk-based operations
    - min_parallel_table_scan_size - try 0
    - parallel_setup_cost (default 1000, maybe try 500)
    - parallel_tuple_cost (default 1.0, maybe try 0.1)
    - random_page_cost (as mentioned consider raising this maybe much higher, factor 10 or sth like this) or (typically) seq_page_cost can be possibly much lower (0.1, 0.01) depending on your storage

I've tried various settings of these parameters now, and unfortunately the only parameter that alters the query plan is the last one (random_page_cost), which also has the side effect of (almost) forcing sequential scans for most queries as far as I understand? Our storage is Google Cloud pd-ssd.

Thank you so much for you response, I'm looking forward to keep the discussion going.


I hope this helps to get to a parallel plan without setting `nested_loop = off`. EXPLAIN should be enough already to see the difference.

Best,
Sebastian

--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B 
 


Best,

Mats

Attachment

Re: Query Performance / Planner estimate off

From
Michael Lewis
Date:


On Wed, Oct 21, 2020, 8:42 AM Mats Olsen <mats@duneanalytics.com> wrote:


On 10/21/20 2:38 PM, Sebastian Dressler wrote:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both tables involved, unfortunately the plan is the same. The columns we use for joining here are hashes and we expect very few duplicates in the tables. Hence I think extended statistics (storing most common values and histogram bounds) aren't useful for this kind of data. Would you say the same thing?

Have you checked if ndistinct is roughly accurate? It can be set manually on a column, or set to some value less than one with the calculation depending on reltuples.

Re: Query Performance / Planner estimate off

From
Sebastian Dressler
Date:
Hi Mats,

Happy to help.

On 21. Oct 2020, at 16:42, Mats Olsen <mats@duneanalytics.com> wrote:
On 10/21/20 2:38 PM, Sebastian Dressler wrote:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both tables involved, unfortunately the plan is the same. The columns we use for joining here are hashes and we expect very few duplicates in the tables. Hence I think extended statistics (storing most common values and histogram bounds) aren't useful for this kind of data. Would you say the same thing?

Yes, that looks like a given in this case.


- I however think that the misestimate might be caused by the evt_tx_hash being of type bytea. I believe that PG cannot estimate this very well for JOINs and will rather pick row numbers too low. Hence the nested loop is picked and there might be no way around this. I have experienced similar things when applying JOINs on VARCHAR with e.g. more than 3 fields for comparison.

This is very interesting, and I have never heard of issues with using `bytea` for joins. Our entire database is filled with them, as we deal with hashes of different lengths. In fact I would estimate that 60% of columns are bytea's. My intuition would say that it's better to store the hashes as byte arrays, rather than `text` fields as you can compare the raw bytes directly without encoding first?  Do you have any references for this?

Unfortunately, I have not dealt yet with `bytea` that much. It just rang a bell when I saw these kind of off-estimates in combination with nested loops. In the case I referenced it was, that the tables had 3 VARCHAR columns to be joined on and the estimate was very much off. As a result, PG chose nested loops in the upper layers of processing. Due to another JOIN the estimate went down to 1 row whereas it was 1 million rows in reality. Now, yours is "only" a factor 5 away, i.e. this might be a totally different reason.

However, I looked into the plan once more and realized, that the source of the problem could also be the scan on "Pair_evt_Mint" along the date dimension. Although you have a stats target of 10k there. If the timestamp is (roughly) sorted, you could try adding a BRIN index and by that maybe get a better estimate & scan-time.

Alternatively, since I know the length of the hashes in advance, I could've used `varchar(n)`, but I don't think there's any gains to be had in postgres by doing that? Something like `bytea(n)` would also have been interesting, had postgres been able to exploit that information.

I think giving VARCHAR a shot makes sense, maybe on an experimental basis to see whether the estimates get better. Maybe PG can then estimate that there are (almost) no dupes within the table but that there are N-many across tables. Another option to explore is maybe to use UUID as a type. As said above, it more looks like the timestamp causing the mis-estimate.

Maybe try querying this table by itself with that timestamp to see what kind of estimate you get?

- Other things to look into:

    - work_mem seems too low to me with 56MB, consider raising this to the GB range to avoid disk-based operations
    - min_parallel_table_scan_size - try 0
    - parallel_setup_cost (default 1000, maybe try 500)
    - parallel_tuple_cost (default 1.0, maybe try 0.1)
    - random_page_cost (as mentioned consider raising this maybe much higher, factor 10 or sth like this) or (typically) seq_page_cost can be possibly much lower (0.1, 0.01) depending on your storage

I've tried various settings of these parameters now, and unfortunately the only parameter that alters the query plan is the last one (random_page_cost), which also has the side effect of (almost) forcing sequential scans for most queries as far as I understand? Our storage is Google Cloud pd-ssd.

I think a combination of random_page_cost with parallel_tuple_cost and min_parallel_table_scan_size might make sense. By that you possibly get at least parallel sequential scans. But I understand that this is possibly having the same effect as using `enable_nestloop = off`.

Thank you so much for you response, I'm looking forward to keep the discussion going.

You're very welcome.

Best,
Sebastian


--

Sebastian Dressler, Solution Architect 
+49 30 994 0496 72 | sebastian@swarm64.com 

Swarm64 AS
Parkveien 41 B | 0258 Oslo | Norway
Registered at Brønnøysundregistrene in Norway under Org.-Number 911 662 787
CEO/Geschäftsführer (Daglig Leder): Thomas Richter; Chairman/Vorsitzender (Styrets Leder): Dr. Sverre Munck 

Swarm64 AS Zweigstelle Hive
Ullsteinstr. 120 | 12109 Berlin | Germany
Registered at Amtsgericht Charlottenburg - HRB 154382 B 
 

Attachment

Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/21/20 5:29 PM, Michael Lewis wrote:


On Wed, Oct 21, 2020, 8:42 AM Mats Olsen <mats@duneanalytics.com> wrote:


On 10/21/20 2:38 PM, Sebastian Dressler wrote:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both tables involved, unfortunately the plan is the same. The columns we use for joining here are hashes and we expect very few duplicates in the tables. Hence I think extended statistics (storing most common values and histogram bounds) aren't useful for this kind of data. Would you say the same thing?

Have you checked if ndistinct is roughly accurate? It can be set manually on a column, or set to some value less than one with the calculation depending on reltuples.
Thank you for your reply!

I included ndistinct-counts in the gist: see https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/24ca1f227940b48842a03435b731f82364f3576d/stats%2520Mint and https://gist.githubusercontent.com/mewwts/9f11ae5e6a5951593b8999559f5418cf/raw/24ca1f227940b48842a03435b731f82364f3576d/stats%2520transactions.

The join keys `transactions.hash` (unique) and `"Pair_evt_Mint".evt_tx_hash` (nearly unique) both have ndistinct=-1 which seems to make sense to me. The Mint-table has -0.8375 for evt_block_time whereas this query returns 0.56 `select count(distinct evt_block_time)::numeric/count(*) from uniswap_v2."Pair_evt_Mint";`. Should I adjust that one?

Many of the other ndistinct-values for `transactions` seem strange, as it's a giant (partitioned) table, but I don't know enough about the statistics to draw any conclusions from it. What do you think?

Re: Query Performance / Planner estimate off

From
Justin Pryzby
Date:
On Wed, Oct 21, 2020 at 04:42:02PM +0200, Mats Olsen wrote:
> On 10/21/20 2:38 PM, Sebastian Dressler wrote:
> > > On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com
> > > 
> > > [...]
> > > 
> > > 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
> > > <https://explain.depesz.com/s/NvDR>
> > > 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
> > > <https://explain.depesz.com/s/buKK>
> > > 3) enable_nestloop=off; enable_seqscan=off (2 min):
> > > https://explain.depesz.com/s/0WXx
> > > <https://explain.depesz.com/s/0WXx>
> > > 
> > > How can I get Postgres not to loop over 12M rows?
> > 
> > I looked at the plans and your config and there are some thoughts I'm
> > having:
> > 
> > - The row estimate is off, as you possibly noticed. This can be possibly
> > solved by raising `default_statistics_target` to e.g. 2500 (we typically
> > use that) and run ANALYZE
> I've `set default_statistics_target=2500` and ran analyze on both tables
> involved, unfortunately the plan is the same. The columns we use for joining
> here are hashes and we expect very few duplicates in the tables. Hence I
> think extended statistics (storing most common values and histogram bounds)
> aren't useful for this kind of data. Would you say the same thing?

In postgres, extended statistics means "MV stats objects", not MCV+histogram,
which are "simple statistics", like ndistinct.

Your indexes maybe aren't ideal for this query, as mentioned.
The indexes that do exist might also be inefficient, due to being unclustered,
or bloated, or due to multiple columns.

These look redundant (which doesn't matter for this the query):

Partition key: RANGE (block_number)
Indexes:
    "transactions_block_number_btree" btree (block_number DESC)
    "transactions_block_number_hash_key" UNIQUE CONSTRAINT, btree (block_number, hash)
    "transactions_block_number_time" btree (hash, block_number)

Maybe that would be an index just on "hash", which might help here.

Possibly you'd want to try to use a BRIN index on timestamp (or maybe
block_number?).

Maybe you'd want to VACUUM the table to allow index-only scan on the hash
columns ?

Maybe you'd want to check if reindexing reduces the index size ?  We don't know
if the table gets lots of UPDATE/DELETE or if any of the columns have high
logical vs physical "correlation".
https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram

Have you ANALYZED the partitioned parent recently ?
This isn't handled by autovacuum.

-- 
Justin



Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:
On 10/22/20 8:37 AM, Justin Pryzby wrote:
> On Wed, Oct 21, 2020 at 04:42:02PM +0200, Mats Olsen wrote:
>> On 10/21/20 2:38 PM, Sebastian Dressler wrote:
>>>> On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com
>>>>
>>>> [...]
>>>>
>>>> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
>>>> <https://explain.depesz.com/s/NvDR>
>>>> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
>>>> <https://explain.depesz.com/s/buKK>
>>>> 3) enable_nestloop=off; enable_seqscan=off (2 min):
>>>> https://explain.depesz.com/s/0WXx
>>>> <https://explain.depesz.com/s/0WXx>
>>>>
>>>> How can I get Postgres not to loop over 12M rows?
>>> I looked at the plans and your config and there are some thoughts I'm
>>> having:
>>>
>>> - The row estimate is off, as you possibly noticed. This can be possibly
>>> solved by raising `default_statistics_target` to e.g. 2500 (we typically
>>> use that) and run ANALYZE
>> I've `set default_statistics_target=2500` and ran analyze on both tables
>> involved, unfortunately the plan is the same. The columns we use for joining
>> here are hashes and we expect very few duplicates in the tables. Hence I
>> think extended statistics (storing most common values and histogram bounds)
>> aren't useful for this kind of data. Would you say the same thing?
> In postgres, extended statistics means "MV stats objects", not MCV+histogram,
> which are "simple statistics", like ndistinct.
>
> Your indexes maybe aren't ideal for this query, as mentioned.
> The indexes that do exist might also be inefficient, due to being unclustered,
> or bloated, or due to multiple columns.

This table is append-only, i.e. no updates. The partitions are clustered 
on a btree index on block_time `

"transactions_p500000_block_time_idx" btree (block_time) CLUSTER

>
> These look redundant (which doesn't matter for this the query):
>
> Partition key: RANGE (block_number)
> Indexes:
>      "transactions_block_number_btree" btree (block_number DESC)
>      "transactions_block_number_hash_key" UNIQUE CONSTRAINT, btree (block_number, hash)
>      "transactions_block_number_time" btree (hash, block_number)
>
> Maybe that would be an index just on "hash", which might help here.
>
> Possibly you'd want to try to use a BRIN index on timestamp (or maybe
> block_number?).

Yeah this could be a good idea, but the size of this table doesn't let 
me add any indexes while it's online. I'll revisit these the next time 
we redeploy the database.

>
> Maybe you'd want to VACUUM the table to allow index-only scan on the hash
> columns ?
>
> Maybe you'd want to check if reindexing reduces the index size ?  We don't know
> if the table gets lots of UPDATE/DELETE or if any of the columns have high
> logical vs physical "correlation".
> https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram
>
> Have you ANALYZED the partitioned parent recently ?
> This isn't handled by autovacuum.

As mentioned above there aren't any updates or deletes to this table. 
Both tables have been ANALYZEd. I ran that query and the output is here 
https://gist.github.com/mewwts/86ef43ff82120e104a654cd7fbb5ec06. I ran 
it for the two specific columns and all partitions for the transactions 
table, and for all columns on "Pair_evt_Mint". Does these values tell 
you anything?




Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:
Thanks for your response Justin.

On 10/22/20 3:48 PM, Justin Pryzby wrote:
> On Thu, Oct 22, 2020 at 09:36:03AM +0200, Mats Olsen wrote:
>> On 10/22/20 8:37 AM, Justin Pryzby wrote:
>>> These look redundant (which doesn't matter for this the query):
>>>
>>> Partition key: RANGE (block_number)
>>> Indexes:
>>>       "transactions_block_number_btree" btree (block_number DESC)
>>>       "transactions_block_number_hash_key" UNIQUE CONSTRAINT, btree (block_number, hash)
>>>       "transactions_block_number_time" btree (hash, block_number)
>>>
>>> Maybe that would be an index just on "hash", which might help here.
>>>
>>> Possibly you'd want to try to use a BRIN index on timestamp (or maybe
>>> block_number?).
>> Yeah this could be a good idea, but the size of this table doesn't let me
>> add any indexes while it's online. I'll revisit these the next time we
>> redeploy the database.
> Why not CREATE INDEX CONCURRENTLY ?
We could, but it would take forever on the `ethereum.transactions` table.
> It seems to me you could add BRIN on all correlated indexes.  It's nearly free.
>
>    0.102922715 | Pair_evt_Mint | evt_block_time   | f         |         0 | -0.56466025 | 10000 |  10001 |
0.964666
>     0.06872191 | Pair_evt_Mint | evt_block_time   | f         |         0 |  -0.8379525 |   500 |    501 |
0.99982
>     0.06872191 | Pair_evt_Mint | evt_block_number | f         |         0 |  -0.8379525 |   500 |    501 |
0.99982
>    0.032878816 | Pair_evt_Mint | evt_block_number | f         |         0 | -0.56466025 |  2500 |   2501 |
0.964666
Agreed, could try to add BRIN's on these.
>
>>> Maybe you'd want to VACUUM the table to allow index-only scan on the hash
>>> columns ?
> Did you try it ? I think this could be a big win.
> Since it's append-only, autovacuum won't hit it (until you upgrade to pg13).

I vacuumed the uniswap_v2."Pair_evt_Mint", but still getting the same 
plan, unfortunately.


>



Re: Query Performance / Planner estimate off

From
Mats Olsen
Date:


On 10/21/20 5:35 PM, Sebastian Dressler wrote:
Hi Mats,

Happy to help.

On 21. Oct 2020, at 16:42, Mats Olsen <mats@duneanalytics.com> wrote:
On 10/21/20 2:38 PM, Sebastian Dressler wrote:
Hi Mats,

On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com> wrote:

[...]

1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
3) enable_nestloop=off; enable_seqscan=off (2 min): https://explain.depesz.com/s/0WXx 

How can I get Postgres not to loop over 12M rows?

I looked at the plans and your config and there are some thoughts I'm having:

- The row estimate is off, as you possibly noticed. This can be possibly solved by raising `default_statistics_target` to e.g. 2500 (we typically use that) and run ANALYZE
I've `set default_statistics_target=2500` and ran analyze on both tables involved, unfortunately the plan is the same. The columns we use for joining here are hashes and we expect very few duplicates in the tables. Hence I think extended statistics (storing most common values and histogram bounds) aren't useful for this kind of data. Would you say the same thing?

Yes, that looks like a given in this case.


- I however think that the misestimate might be caused by the evt_tx_hash being of type bytea. I believe that PG cannot estimate this very well for JOINs and will rather pick row numbers too low. Hence the nested loop is picked and there might be no way around this. I have experienced similar things when applying JOINs on VARCHAR with e.g. more than 3 fields for comparison.

This is very interesting, and I have never heard of issues with using `bytea` for joins. Our entire database is filled with them, as we deal with hashes of different lengths. In fact I would estimate that 60% of columns are bytea's. My intuition would say that it's better to store the hashes as byte arrays, rather than `text` fields as you can compare the raw bytes directly without encoding first?  Do you have any references for this?

Unfortunately, I have not dealt yet with `bytea` that much. It just rang a bell when I saw these kind of off-estimates in combination with nested loops. In the case I referenced it was, that the tables had 3 VARCHAR columns to be joined on and the estimate was very much off. As a result, PG chose nested loops in the upper layers of processing. Due to another JOIN the estimate went down to 1 row whereas it was 1 million rows in reality. Now, yours is "only" a factor 5 away, i.e. this might be a totally different reason.

However, I looked into the plan once more and realized, that the source of the problem could also be the scan on "Pair_evt_Mint" along the date dimension. Although you have a stats target of 10k there. If the timestamp is (roughly) sorted, you could try adding a BRIN index and by that maybe get a better estimate & scan-time.
Hi again, after around 48 hours a CREATE INDEX CONCURRENTLY ran successfully. The new plan still uses a nested loop, but the scan on "Pair_evt_Mint" is now a Parallel index scan. See https://explain.depesz.com/s/8ZzT

Alternatively, since I know the length of the hashes in advance, I could've used `varchar(n)`, but I don't think there's any gains to be had in postgres by doing that? Something like `bytea(n)` would also have been interesting, had postgres been able to exploit that information.

I think giving VARCHAR a shot makes sense, maybe on an experimental basis to see whether the estimates get better. Maybe PG can then estimate that there are (almost) no dupes within the table but that there are N-many across tables. Another option to explore is maybe to use UUID as a type. As said above, it more looks like the timestamp causing the mis-estimate.

Maybe try querying this table by itself with that timestamp to see what kind of estimate you get?

- Other things to look into:

    - work_mem seems too low to me with 56MB, consider raising this to the GB range to avoid disk-based operations
    - min_parallel_table_scan_size - try 0
    - parallel_setup_cost (default 1000, maybe try 500)
    - parallel_tuple_cost (default 1.0, maybe try 0.1)
    - random_page_cost (as mentioned consider raising this maybe much higher, factor 10 or sth like this) or (typically) seq_page_cost can be possibly much lower (0.1, 0.01) depending on your storage

I've tried various settings of these parameters now, and unfortunately the only parameter that alters the query plan is the last one (random_page_cost), which also has the side effect of (almost) forcing sequential scans for most queries as far as I understand? Our storage is Google Cloud pd-ssd.

I think a combination of random_page_cost with parallel_tuple_cost and min_parallel_table_scan_size might make sense. By that you possibly get at least parallel sequential scans. But I understand that this is possibly having the same effect as using `enable_nestloop = off`.

I'll have a closer look at these parameters.

Again, thank you.

Mats