Re: Query Performance / Planner estimate off - Mailing list pgsql-performance

From Mats Olsen
Subject Re: Query Performance / Planner estimate off
Date
Msg-id 7f586824-0d53-0f84-15f1-f460dbc414b5@duneanalytics.com
Whole thread Raw
In response to Re: Query Performance / Planner estimate off  (Justin Pryzby <pryzby@telsasoft.com>)
List pgsql-performance
On 10/22/20 8:37 AM, Justin Pryzby wrote:
> On Wed, Oct 21, 2020 at 04:42:02PM +0200, Mats Olsen wrote:
>> On 10/21/20 2:38 PM, Sebastian Dressler wrote:
>>>> On 20. Oct 2020, at 11:37, Mats Julian Olsen <mats@duneanalytics.com
>>>>
>>>> [...]
>>>>
>>>> 1) Vanilla plan (16 min) : https://explain.depesz.com/s/NvDR
>>>> <https://explain.depesz.com/s/NvDR>
>>>> 2) enable_nestloop=off (4 min): https://explain.depesz.com/s/buKK
>>>> <https://explain.depesz.com/s/buKK>
>>>> 3) enable_nestloop=off; enable_seqscan=off (2 min):
>>>> https://explain.depesz.com/s/0WXx
>>>> <https://explain.depesz.com/s/0WXx>
>>>>
>>>> How can I get Postgres not to loop over 12M rows?
>>> I looked at the plans and your config and there are some thoughts I'm
>>> having:
>>>
>>> - The row estimate is off, as you possibly noticed. This can be possibly
>>> solved by raising `default_statistics_target` to e.g. 2500 (we typically
>>> use that) and run ANALYZE
>> I've `set default_statistics_target=2500` and ran analyze on both tables
>> involved, unfortunately the plan is the same. The columns we use for joining
>> here are hashes and we expect very few duplicates in the tables. Hence I
>> think extended statistics (storing most common values and histogram bounds)
>> aren't useful for this kind of data. Would you say the same thing?
> In postgres, extended statistics means "MV stats objects", not MCV+histogram,
> which are "simple statistics", like ndistinct.
>
> Your indexes maybe aren't ideal for this query, as mentioned.
> The indexes that do exist might also be inefficient, due to being unclustered,
> or bloated, or due to multiple columns.

This table is append-only, i.e. no updates. The partitions are clustered 
on a btree index on block_time `

"transactions_p500000_block_time_idx" btree (block_time) CLUSTER

>
> These look redundant (which doesn't matter for this the query):
>
> Partition key: RANGE (block_number)
> Indexes:
>      "transactions_block_number_btree" btree (block_number DESC)
>      "transactions_block_number_hash_key" UNIQUE CONSTRAINT, btree (block_number, hash)
>      "transactions_block_number_time" btree (hash, block_number)
>
> Maybe that would be an index just on "hash", which might help here.
>
> Possibly you'd want to try to use a BRIN index on timestamp (or maybe
> block_number?).

Yeah this could be a good idea, but the size of this table doesn't let 
me add any indexes while it's online. I'll revisit these the next time 
we redeploy the database.

>
> Maybe you'd want to VACUUM the table to allow index-only scan on the hash
> columns ?
>
> Maybe you'd want to check if reindexing reduces the index size ?  We don't know
> if the table gets lots of UPDATE/DELETE or if any of the columns have high
> logical vs physical "correlation".
> https://wiki.postgresql.org/wiki/Slow_Query_Questions#Statistics:_n_distinct.2C_MCV.2C_histogram
>
> Have you ANALYZED the partitioned parent recently ?
> This isn't handled by autovacuum.

As mentioned above there aren't any updates or deletes to this table. 
Both tables have been ANALYZEd. I ran that query and the output is here 
https://gist.github.com/mewwts/86ef43ff82120e104a654cd7fbb5ec06. I ran 
it for the two specific columns and all partitions for the transactions 
table, and for all columns on "Pair_evt_Mint". Does these values tell 
you anything?




pgsql-performance by date:

Previous
From: Justin Pryzby
Date:
Subject: Re: Query Performance / Planner estimate off
Next
From: Mats Olsen
Date:
Subject: Re: Query Performance / Planner estimate off