Thread: Query take a long time and use no index

Query take a long time and use no index

From

basti

Date:

17 July 2023, 09:13:23

Hello,

I use volkszaehler.org (a DIY Smartmeter) on postgres.

The hardware is a Dual-core 2,4 GHz amd64-CPU and 6 GB RAM.
The databse is approx. 1.5 GB on SSD

my local config looks like:

root@vz:~# cat /etc/postgresql/15/main/conf.d/local.conf
# DB Version: 15
# OS Type: linux
# Total Memory (RAM): 6 GB
# CPUs num: 2
# Connections num: 20
# Data Storage: sdd

max_connections = 20
shared_buffers = 2GB
effective_cache_size = 4608MB
maintenance_work_mem = 384MB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 100
random_page_cost = 0.9
effective_io_concurrency = 200
min_wal_size = 1GB
max_wal_size = 4GB
work_mem = 512MB
temp_buffers = 512MB

# log
log_directory = '/var/log/postgresql/'  # directory where log files are 
written,
                                         # can be absolute or relative 
to PGDATA
log_filename = 'postgresql-15-main.log'
#log_statement = 'all'
#logging_collector = on
log_temp_files = 0
session_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '3s'


The Values:

volkszaehler=# SELECT *  FROM data limit 10;
    timestamp   | channel_id | value
---------------+------------+-------
  1666008690865 |          5 |   232
  1666008691867 |          5 |   236
  1666008692865 |          5 |   237
  1666008693867 |          5 |   232
  1666008694865 |          5 |   227
  1666008695867 |          5 |   229
  1666008698866 |          5 |   228
  1666008699868 |          5 |   229
  1666008700864 |          5 |   228
  1666008701868 |          5 |   230


volkszaehler=# SELECT COUNT(value) FROM data;
   count
----------
  17884247
(1 row)


The Indexes:

volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE 
tablename LIKE 'data%' ORDER BY tablename,indexname;

  tablename |           indexname            | 
                   indexdef

-----------+--------------------------------+--------------------------------------------------------------------------------------------------
  data      | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX 
idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree (channel_id)
  data      | idx_16391_primary              | CREATE UNIQUE INDEX 
idx_16391_primary ON volkszaehler.data USING btree (channel_id, "timestamp")
  data      | idx_data_timestamp             | CREATE INDEX 
idx_data_timestamp ON volkszaehler.data USING btree ("timestamp")
(3 rows)


The Query:

volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day', 
TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data 
WHERE channel_id = 5 AND timestamp >= 0;

                                                          QUERY PLAN 


----------------------------------------------------------------------------------------------------------------------------
  Aggregate  (cost=590793.68..590793.69 rows=1 width=8) (actual 
time=15449.536..15449.539 rows=1 loops=1)
    ->  Seq Scan on data  (cost=0.00..382037.82 rows=16700469 width=8) 
(actual time=247.092..3833.495 rows=16669429 loops=1)
          Filter: (("timestamp" >= 0) AND (channel_id = 5))
          Rows Removed by Filter: 1215163
  Planning Time: 0.374 ms
  JIT:
    Functions: 5
    Options: Inlining true, Optimization true, Expressions true, 
Deforming true
    Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 
59.816 ms, Emission 28.472 ms, Total 248.322 ms
  Execution Time: 15451.093 ms
(10 rows)

Round about 16 sec is too long, the frontend run in timeout or other erros.

A simple count look like

volkszaehler=# explain analyze SELECT count (channel_id) FROM data;
                                                                  QUERY 
PLAN

---------------------------------------------------------------------------------------------------------------------------------------------
  Finalize Aggregate  (cost=208089.76..208089.77 rows=1 width=8) (actual 
time=3514.293..3523.842 rows=1 loops=1)
    ->  Gather  (cost=208089.55..208089.76 rows=2 width=8) (actual 
time=3514.247..3523.800 rows=3 loops=1)
          Workers Planned: 2
          Workers Launched: 2
          ->  Partial Aggregate  (cost=207089.55..207089.56 rows=1 
width=8) (actual time=3427.139..3427.141 rows=1 loops=3)
                ->  Parallel Seq Scan on data  (cost=0.00..188457.44 
rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3)
  Planning Time: 0.247 ms
  JIT:
    Functions: 11
    Options: Inlining false, Optimization false, Expressions true, 
Deforming true
    Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612 
ms, Emission 48.580 ms, Total 52.857 ms
  Execution Time: 3525.656 ms
(12 rows)

Why are the indexes not used ?
Is it a Problem of the DATE_TRUNC ?
How can I optimize?

Best regards

Re: Query take a long time and use no index

From

Tomas Vondra

Date:

17 July 2023, 11:20:13


On 7/17/23 11:13, basti wrote:
> Hello,
> 
> I use volkszaehler.org (a DIY Smartmeter) on postgres.
> 
> The hardware is a Dual-core 2,4 GHz amd64-CPU and 6 GB RAM.
> The databse is approx. 1.5 GB on SSD
> 
> my local config looks like:
> 
> root@vz:~# cat /etc/postgresql/15/main/conf.d/local.conf
> # DB Version: 15
> # OS Type: linux
> # Total Memory (RAM): 6 GB
> # CPUs num: 2
> # Connections num: 20
> # Data Storage: sdd
> 
> max_connections = 20
> shared_buffers = 2GB
> effective_cache_size = 4608MB
> maintenance_work_mem = 384MB
> checkpoint_completion_target = 0.9
> wal_buffers = 16MB
> default_statistics_target = 100
> random_page_cost = 0.9
> effective_io_concurrency = 200
> min_wal_size = 1GB
> max_wal_size = 4GB
> work_mem = 512MB
> temp_buffers = 512MB
> 
> # log
> log_directory = '/var/log/postgresql/'  # directory where log files are
> written,
>                                         # can be absolute or relative to
> PGDATA
> log_filename = 'postgresql-15-main.log'
> #log_statement = 'all'
> #logging_collector = on
> log_temp_files = 0
> session_preload_libraries = 'auto_explain'
> auto_explain.log_min_duration = '3s'
> 
> 
> The Values:
> 
> volkszaehler=# SELECT *  FROM data limit 10;
>    timestamp   | channel_id | value
> ---------------+------------+-------
>  1666008690865 |          5 |   232
>  1666008691867 |          5 |   236
>  1666008692865 |          5 |   237
>  1666008693867 |          5 |   232
>  1666008694865 |          5 |   227
>  1666008695867 |          5 |   229
>  1666008698866 |          5 |   228
>  1666008699868 |          5 |   229
>  1666008700864 |          5 |   228
>  1666008701868 |          5 |   230
> 
> 
> volkszaehler=# SELECT COUNT(value) FROM data;
>   count
> ----------
>  17884247
> (1 row)
> 
> 
> The Indexes:
> 
> volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE
> tablename LIKE 'data%' ORDER BY tablename,indexname;
> 
>  tablename |           indexname            |                   indexdef
>
-----------+--------------------------------+--------------------------------------------------------------------------------------------------
>  data      | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX
> idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree
> (channel_id)
>  data      | idx_16391_primary              | CREATE UNIQUE INDEX
> idx_16391_primary ON volkszaehler.data USING btree (channel_id,
> "timestamp")
>  data      | idx_data_timestamp             | CREATE INDEX
> idx_data_timestamp ON volkszaehler.data USING btree ("timestamp")
> (3 rows)
> 
> 
> The Query:
> 
> volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day',
> TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data
> WHERE channel_id = 5 AND timestamp >= 0;
> 
>                                                          QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=590793.68..590793.69 rows=1 width=8) (actual
> time=15449.536..15449.539 rows=1 loops=1)
>    ->  Seq Scan on data  (cost=0.00..382037.82 rows=16700469 width=8)
> (actual time=247.092..3833.495 rows=16669429 loops=1)
>          Filter: (("timestamp" >= 0) AND (channel_id = 5))
>          Rows Removed by Filter: 1215163
>  Planning Time: 0.374 ms
>  JIT:
>    Functions: 5
>    Options: Inlining true, Optimization true, Expressions true,
> Deforming true
>    Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 59.816
> ms, Emission 28.472 ms, Total 248.322 ms
>  Execution Time: 15451.093 ms
> (10 rows)
> 
> Round about 16 sec is too long, the frontend run in timeout or other erros.
> 
> A simple count look like
> 
> volkszaehler=# explain analyze SELECT count (channel_id) FROM data;
>                                                                  QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------
>  Finalize Aggregate  (cost=208089.76..208089.77 rows=1 width=8) (actual
> time=3514.293..3523.842 rows=1 loops=1)
>    ->  Gather  (cost=208089.55..208089.76 rows=2 width=8) (actual
> time=3514.247..3523.800 rows=3 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          ->  Partial Aggregate  (cost=207089.55..207089.56 rows=1
> width=8) (actual time=3427.139..3427.141 rows=1 loops=3)
>                ->  Parallel Seq Scan on data  (cost=0.00..188457.44
> rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3)
>  Planning Time: 0.247 ms
>  JIT:
>    Functions: 11
>    Options: Inlining false, Optimization false, Expressions true,
> Deforming true
>    Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612
> ms, Emission 48.580 ms, Total 52.857 ms
>  Execution Time: 3525.656 ms
> (12 rows)
> 
> Why are the indexes not used ?

It's a common misconception that indexes necessarily improve performance
and that slow queries are slow because of missing indexes. In practice
indexes are most useful for selective queries, i.e. queries that scan
only a fraction of the table - say, a couple percent.

But this query scans ~17M rows from the table, and my guess is that's a
significant part of the table. In which case the index is useless.

Also, notice the seqscan took ~4s. Even if that drops to ~1s with an
index, the overall query will still take 12s.

> Is it a Problem of the DATE_TRUNC ?

My guess is it's more about the DISTINCT, which forces a sort before the
aggregate function. Which also means the query can't be parallel. You
can try simplifying the query step by step - remove the DISTINCT first,
then the DATE_TRUNC.

> How can I optimize?
> 

It's always going to be slow with the COUNT(DISTINCT), I'm afraid.

Not sure how much you can modify the query / database, and how accurate
results you need. If you're OK with estimates, you can try postgres-hll
extension [2] which estimates count(distinct). For exact reaults, I
wrote count_distinct extension [2] that uses hashtable. Might be worth a
try, I guess.

Another option is to precalculate stuff, and do queries on that. That
is, you're not going to add data with old timestamps, so you can
pre-aggregate that.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Query take a long time and use no index

From

Tomas Vondra

Date:

17 July 2023, 11:21:36

On 7/17/23 13:20, Tomas Vondra wrote:
> ...
> 
> It's always going to be slow with the COUNT(DISTINCT), I'm afraid.
> 
> Not sure how much you can modify the query / database, and how accurate
> results you need. If you're OK with estimates, you can try postgres-hll
> extension [2] which estimates count(distinct). For exact reaults, I
> wrote count_distinct extension [2] that uses hashtable. Might be worth a
> try, I guess.
> 

Damn, I forgot to add the links:

[1] https://github.com/citusdata/postgresql-hll

[2] https://github.com/tvondra/count_distinct

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Query take a long time and use no index

From

Ron

Date:

17 July 2023, 13:03:36

On 7/17/23 04:13, basti wrote:
[snip]
>
> The Indexes:
>
> volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE 
> tablename LIKE 'data%' ORDER BY tablename,indexname;
>
>  tablename |           indexname            | indexdef
>
-----------+--------------------------------+--------------------------------------------------------------------------------------------------

>
>  data      | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX 
> idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree (channel_id)
>  data      | idx_16391_primary              | CREATE UNIQUE INDEX 
> idx_16391_primary ON volkszaehler.data USING btree (channel_id, "timestamp")
>  data      | idx_data_timestamp             | CREATE INDEX 
> idx_data_timestamp ON volkszaehler.data USING btree ("timestamp")
> (3 rows)
>
>
> The Query:
>
> volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day', 
> TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data WHERE 
> channel_id = 5 AND timestamp >= 0;
>
>                                                          QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------

>
>  Aggregate  (cost=590793.68..590793.69 rows=1 width=8) (actual 
> time=15449.536..15449.539 rows=1 loops=1)
>    ->  Seq Scan on data  (cost=0.00..382037.82 rows=16700469 width=8) 
> (actual time=247.092..3833.495 rows=16669429 loops=1)
>          Filter: (("timestamp" >= 0) AND (channel_id = 5))
>          Rows Removed by Filter: 1215163
>  Planning Time: 0.374 ms
>  JIT:
>    Functions: 5
>    Options: Inlining true, Optimization true, Expressions true, Deforming 
> true
>    Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 59.816 
> ms, Emission 28.472 ms, Total 248.322 ms
>  Execution Time: 15451.093 ms
> (10 rows)
>
> Round about 16 sec is too long, the frontend run in timeout or other erros.

What fraction of the rows in the table meet the "WHERE channel_id = 5 AND 
timestamp >= 0" qualification?

If it's high (and "high" can be a seemingly low value), then Postgresql will 
decide it's cheaper to sequentially scan the table.

>
> A simple count look like
>
> volkszaehler=# explain analyze SELECT count (channel_id) FROM data;
> QUERY PLAN
>
---------------------------------------------------------------------------------------------------------------------------------------------

>
>  Finalize Aggregate  (cost=208089.76..208089.77 rows=1 width=8) (actual 
> time=3514.293..3523.842 rows=1 loops=1)
>    ->  Gather  (cost=208089.55..208089.76 rows=2 width=8) (actual 
> time=3514.247..3523.800 rows=3 loops=1)
>          Workers Planned: 2
>          Workers Launched: 2
>          ->  Partial Aggregate  (cost=207089.55..207089.56 rows=1 width=8) 
> (actual time=3427.139..3427.141 rows=1 loops=3)
>                ->  Parallel Seq Scan on data (cost=0.00..188457.44 
> rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3)
>  Planning Time: 0.247 ms
>  JIT:
>    Functions: 11
>    Options: Inlining false, Optimization false, Expressions true, 
> Deforming true
>    Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612 ms, 
> Emission 48.580 ms, Total 52.857 ms
>  Execution Time: 3525.656 ms
> (12 rows)
>
> Why are the indexes not used ?
> Is it a Problem of the DATE_TRUNC ?
> How can I optimize?
>
> Best regards
>
>
>
>
>
>
>

-- 
Born in Arizona, moved to Babylonia.

Re: Query take a long time and use no index

From

basti

Date:

17 July 2023, 18:19:08

Thanks a lot tomas, i will try it.

I have find out that there is a 'aggregation' function in the frontend.
But this is MySQL specific and I have no idea the transform it to postgres.

It looks like:
'REPLACE INTO aggregate (channel_id, type, timestamp, value, count)


SELECT channel_id, ? AS type, MAX(agg.timestamp) AS timest
   amp, COALESCE( SUM(agg.val_by_time) / (MAX(agg.timestamp) - 
MIN(agg.prev_timestamp)), AVG(agg.value)) AS value, COUNT(agg.value) AS 
count FROM ( SELECT channel_id,
   timestamp, value, value * (timestamp - @prev_timestamp) AS 
val_by_time, COALESCE(@prev_timestamp, 0) AS prev_timestamp, 
@prev_timestamp := timestamp FROM data CROSS
    JOIN (SELECT @prev_timestamp := NULL) AS vars WHERE channel_id = ? 
AND timestamp < UNIX_TIMESTAMP(DATE_FORMAT(NOW(), "%Y-%m-%d")) * 1000 ) 
AS agg GROUP BY channel_
   id, DATE_TRUNC('day', TIMESTAMP 'epoch' + timestamp * INTERVAL '1 
millisecond')' with params [3, 5]: 



 
 

   SQLSTATE[42601]: Syntax error: 7 ERROR:  syntax error at or near 
"REPLACE" 

   LINE 1: REPLACE INTO aggregate (channel_id, type, timestamp, value, 
...


Am 17.07.23 um 13:21 schrieb Tomas Vondra:
> On 7/17/23 13:20, Tomas Vondra wrote:
>> ...
>>
>> It's always going to be slow with the COUNT(DISTINCT), I'm afraid.
>>
>> Not sure how much you can modify the query / database, and how accurate
>> results you need. If you're OK with estimates, you can try postgres-hll
>> extension [2] which estimates count(distinct). For exact reaults, I
>> wrote count_distinct extension [2] that uses hashtable. Might be worth a
>> try, I guess.
>>
> 
> Damn, I forgot to add the links:
> 
> [1] https://github.com/citusdata/postgresql-hll
> 
> [2] https://github.com/tvondra/count_distinct
> 
> regards
>

Re: Query take a long time and use no index

From

David Rowley

Date:

18 July 2023, 11:28:02

On Tue, 18 Jul 2023 at 06:19, basti <mailinglist@unix-solution.de> wrote:
>
> Thanks a lot tomas, i will try it.
>
> I have find out that there is a 'aggregation' function in the frontend.
> But this is MySQL specific and I have no idea the transform it to postgres.
>
> It looks like:
> 'REPLACE INTO aggregate (channel_id, type, timestamp, value, count)

Perhaps that's something like PostgreSQL's INSERT ON CONFLICT [1]

David

[1] https://www.postgresql.org/docs/current/sql-insert.html

Re: Query take a long time and use no index

From

David Rowley

Date:

18 July 2023, 11:34:39

On Mon, 17 Jul 2023 at 21:13, basti <mailinglist@unix-solution.de> wrote:
> volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day',
> TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data
> WHERE channel_id = 5 AND timestamp >= 0;

Alternatively, you could express this as:

SELECT COUNT(*) FROM (SELECT DISTINCT DATE_TRUNC('day', TIMESTAMP
'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data WHERE
channel_id = 5 AND timestamp >= 0) a;

If there was an index on (channel_id, (DATE_TRUNC('day', TIMESTAMP
'epoch' + timestamp * INTERVAL '1 millisecond'))); then the distinct
could efficiently perform a Group Aggregate. Otherwise, it could at
least hash aggregate and the distinct could be done in parallel
(assuming you're using at least PostgreSQL 15).

The yet-to-be-released PostgreSQL 16 will allow more efficient
execution of DISTINCT and ORDER BY aggregates by allowing indexed to
provide pre-sorted input. In the meantime, the query above will
probably help you.

David