Re: Faster distinct query? - Mailing list pgsql-general

From Israel Brewster
Subject Re: Faster distinct query?
Date
Msg-id 2795367B-80ED-4B44-9354-5C2A6EE15D40@alaska.edu
Whole thread Raw
In response to Re: Faster distinct query?  (Michael Lewis <mlewis@entrata.com>)
List pgsql-general
On Sep 23, 2021, at 8:55 PM, Michael Lewis <mlewis@entrata.com> wrote:

It’s only when I add in the AND data.channels=channels.channel that the query time blows up to 5+ minutes. I personally don’t understand why there would be such a large difference between the two queries - something wrong with my indexes?

Did you do the manual vacuum as suggested by Tom?

I ran a VACUUM ANALYZE, yes.

Maybe at least check pg_stat_user_tables for the last vacuum or autovacuum timestamp, and how many records have changed since then.

volcano_seismology=# SELECT seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan            | 95
seq_tup_read        | 25899340540
idx_scan            | 728372
idx_tup_fetch       | 51600217033
n_tup_ins           | 840283699
n_tup_upd           | 66120702
n_tup_del           | 2375651
n_tup_hot_upd       | 0
n_live_tup          | 839266956
n_dead_tup          | 66585751
n_mod_since_analyze | 58896052
n_ins_since_vacuum  | 24890460
last_vacuum         | 2021-09-22 21:32:11.367855+00
last_autovacuum     | 2021-09-14 07:13:23.745862+00
last_analyze        | 2021-09-22 21:32:21.071092+00
last_autoanalyze    | 2021-09-21 11:54:36.924762+00
vacuum_count        | 1
autovacuum_count    | 1
analyze_count       | 1
autoanalyze_count   | 2

Note that the update count was due to a (hopefully) one-time process where I had to change the value of a bunch of records. Generally this *should be* an insert-once-read-many database.


The system is deciding on an index scan because that should be very fast, but if it needs to look at the actual table table to determine if the process executing that query should still be allowed to see that tuple (row version) then the index only scan slows down a lot I believe. The huge number of heap fetches that Tom pointed out mean that the system doesn't know that ALL processes should be able to see those tuples. As someone else suggested lowering the max freeze age, you might just do a manual "vacuum freeze" as well to allow future auto vacuum processes to quickly skip over those pages as well.

Ok, ran a VACUUM (ANALYZE, FREEZE), and am now showing this from pg_stat_user_tables:

volcano_seismology=# SELECT seq_scan,seq_tup_read,idx_scan,idx_tup_fetch,n_tup_ins,n_tup_upd,n_tup_del,n_tup_hot_upd,n_live_tup,n_dead_tup,n_mod_since_analyze,n_ins_since_vacuum,last_vacuum,last_autovacuum,last_analyze,last_autoanalyze,vacuum_count,autovacuum_count,analyze_count,autoanalyze_count FROM pg_stat_user_tables WHERE relname='data';
-[ RECORD 1 ]-------+------------------------------
seq_scan            | 96
seq_tup_read        | 26737263238
idx_scan            | 732396
idx_tup_fetch       | 52571927369
n_tup_ins           | 841017819
n_tup_upd           | 66120702
n_tup_del           | 2388723
n_tup_hot_upd       | 0
n_live_tup          | 840198830
n_dead_tup          | 10173
n_mod_since_analyze | 0
n_ins_since_vacuum  | 0
last_vacuum         | 2021-09-24 17:18:18.34282+00
last_autovacuum     | 2021-09-14 07:13:23.745862+00
last_analyze        | 2021-09-24 17:18:31.576238+00
last_autoanalyze    | 2021-09-21 11:54:36.924762+00
vacuum_count        | 2
autovacuum_count    | 1
analyze_count       | 2
autoanalyze_count   | 2

However, adding the AND data.channels=channels.channel to the query still makes it take around 5 minutes (https://explain.depesz.com/s/7hb1). So, again, running VACUUM didn’t appear to help any.

Also perhaps interestingly, if I again modify the query to only match on channel, not station, it is again fast (though not quite as fast): https://explain.depesz.com/s/HLb8

So, basically, I can quickly get a list of all channels for which I have data, or all stations for which I have data, but getting a list of all channels for each station is slow.

I've heard of the loose indexscan before mentioned on this thread, but I'm not seeing how to implement that for multiple columns. Anyone have an example or pseudo code perhaps?

This is my stumbling block to implementing this option as well. That said, with the lookup table in place, perhaps all the rest of this is a moot point? 

---
Israel Brewster
Software Engineer
Alaska Volcano Observatory 
Geophysical Institute - UAF 
2156 Koyukuk Drive 
Fairbanks AK 99775-7320
Work: 907-474-5172
cell:  907-328-9145


pgsql-general by date:

Previous
From: "Clive Swan"
Date:
Subject: RE: Get COUNT results from two different columns
Next
From: Garfield Lewis
Date:
Subject: Using XMLNAMESPACES with XMLEMENT