Re: Query take a long time and use no index - Mailing list pgsql-general

From Tomas Vondra
Subject Re: Query take a long time and use no index
Date
Msg-id d9708792-19fc-2dc7-4ecf-f93b69a4c2b0@enterprisedb.com
Whole thread Raw
In response to Query take a long time and use no index  (basti <mailinglist@unix-solution.de>)
Responses Re: Query take a long time and use no index  (Tomas Vondra <tomas.vondra@enterprisedb.com>)
List pgsql-general

On 7/17/23 11:13, basti wrote:
> Hello,
> 
> I use volkszaehler.org (a DIY Smartmeter) on postgres.
> 
> The hardware is a Dual-core 2,4 GHz amd64-CPU and 6 GB RAM.
> The databse is approx. 1.5 GB on SSD
> 
> my local config looks like:
> 
> root@vz:~# cat /etc/postgresql/15/main/conf.d/local.conf
> # DB Version: 15
> # OS Type: linux
> # Total Memory (RAM): 6 GB
> # CPUs num: 2
> # Connections num: 20
> # Data Storage: sdd
> 
> max_connections = 20
> shared_buffers = 2GB
> effective_cache_size = 4608MB
> maintenance_work_mem = 384MB
> checkpoint_completion_target = 0.9
> wal_buffers = 16MB
> default_statistics_target = 100
> random_page_cost = 0.9
> effective_io_concurrency = 200
> min_wal_size = 1GB
> max_wal_size = 4GB
> work_mem = 512MB
> temp_buffers = 512MB
> 
> # log
> log_directory = '/var/log/postgresql/'  # directory where log files are
> written,
>                                         # can be absolute or relative to
> PGDATA
> log_filename = 'postgresql-15-main.log'
> #log_statement = 'all'
> #logging_collector = on
> log_temp_files = 0
> session_preload_libraries = 'auto_explain'
> auto_explain.log_min_duration = '3s'
> 
> 
> The Values:
> 
> volkszaehler=# SELECT *  FROM data limit 10;
>    timestamp   | channel_id | value
> ---------------+------------+-------
>  1666008690865 |          5 |   232
>  1666008691867 |          5 |   236
>  1666008692865 |          5 |   237
>  1666008693867 |          5 |   232
>  1666008694865 |          5 |   227
>  1666008695867 |          5 |   229
>  1666008698866 |          5 |   228
>  1666008699868 |          5 |   229
>  1666008700864 |          5 |   228
>  1666008701868 |          5 |   230
> 
> 
> volkszaehler=# SELECT COUNT(value) FROM data;
>   count
> ----------
>  17884247
> (1 row)
> 
> 
> The Indexes:
> 
> volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE
> tablename LIKE 'data%' ORDER BY tablename,indexname;
> 
>  tablename |           indexname            |                   indexdef
>
-----------+--------------------------------+--------------------------------------------------------------------------------------------------
>  data      | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX
> idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree
> (channel_id)
>  data      | idx_16391_primary              | CREATE UNIQUE INDEX
> idx_16391_primary ON volkszaehler.data USING btree (channel_id,
> "timestamp")
>  data      | idx_data_timestamp             | CREATE INDEX
> idx_data_timestamp ON volkszaehler.data USING btree ("timestamp")
> (3 rows)
> 
> 
> The Query:
> 
> volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day',
> TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data
> WHERE channel_id = 5 AND timestamp >= 0;
> 
>                                                          QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=590793.68..590793.69 rows=1 width=8) (actual
> time=15449.536..15449.539 rows=1 loops=1)
>    ->  Seq Scan on data  (cost=0.00..382037.82 rows=16700469 width=8)
> (actual time=247.092..3833.495 rows=16669429 loops=1)
>          Filter: (("timestamp" >= 0) AND (channel_id = 5))
>          Rows Removed by Filter: 1215163
>  Planning Time: 0.374 ms
>  JIT:
>    Functions: 5
>    Options: Inlining true, Optimization true, Expressions true,
> Deforming true
>    Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 59.816
> ms, Emission 28.472 ms, Total 248.322 ms
>  Execution Time: 15451.093 ms
> (10 rows)
> 
> Round about 16 sec is too long, the frontend run in timeout or other erros.
> 
> A simple count look like
> 
> volkszaehler=# explain analyze SELECT count (channel_id) FROM data;
>                                                                  QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Finalize Aggregate  (cost=208089.76..208089.77 rows=1 width=8) (actual
> time=3514.293..3523.842 rows=1 loops=1)
>    ->  Gather  (cost=208089.55..208089.76 rows=2 width=8) (actual
> time=3514.247..3523.800 rows=3 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          ->  Partial Aggregate  (cost=207089.55..207089.56 rows=1
> width=8) (actual time=3427.139..3427.141 rows=1 loops=3)
>                ->  Parallel Seq Scan on data  (cost=0.00..188457.44
> rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3)
>  Planning Time: 0.247 ms
>  JIT:
>    Functions: 11
>    Options: Inlining false, Optimization false, Expressions true,
> Deforming true
>    Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612
> ms, Emission 48.580 ms, Total 52.857 ms
>  Execution Time: 3525.656 ms
> (12 rows)
> 
> Why are the indexes not used ?

It's a common misconception that indexes necessarily improve performance
and that slow queries are slow because of missing indexes. In practice
indexes are most useful for selective queries, i.e. queries that scan
only a fraction of the table - say, a couple percent.

But this query scans ~17M rows from the table, and my guess is that's a
significant part of the table. In which case the index is useless.

Also, notice the seqscan took ~4s. Even if that drops to ~1s with an
index, the overall query will still take 12s.

> Is it a Problem of the DATE_TRUNC ?

My guess is it's more about the DISTINCT, which forces a sort before the
aggregate function. Which also means the query can't be parallel. You
can try simplifying the query step by step - remove the DISTINCT first,
then the DATE_TRUNC.

> How can I optimize?
> 

It's always going to be slow with the COUNT(DISTINCT), I'm afraid.

Not sure how much you can modify the query / database, and how accurate
results you need. If you're OK with estimates, you can try postgres-hll
extension [2] which estimates count(distinct). For exact reaults, I
wrote count_distinct extension [2] that uses hashtable. Might be worth a
try, I guess.

Another option is to precalculate stuff, and do queries on that. That
is, you're not going to add data with old timestamps, so you can
pre-aggregate that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



pgsql-general by date:

Previous
From: Tomas Vondra
Date:
Subject: Re: [Beginner Question]What is query jumbling?
Next
From: Tomas Vondra
Date:
Subject: Re: Query take a long time and use no index