Thread: Query take a long time and use no index
Hello, I use volkszaehler.org (a DIY Smartmeter) on postgres. The hardware is a Dual-core 2,4 GHz amd64-CPU and 6 GB RAM. The databse is approx. 1.5 GB on SSD my local config looks like: root@vz:~# cat /etc/postgresql/15/main/conf.d/local.conf # DB Version: 15 # OS Type: linux # Total Memory (RAM): 6 GB # CPUs num: 2 # Connections num: 20 # Data Storage: sdd max_connections = 20 shared_buffers = 2GB effective_cache_size = 4608MB maintenance_work_mem = 384MB checkpoint_completion_target = 0.9 wal_buffers = 16MB default_statistics_target = 100 random_page_cost = 0.9 effective_io_concurrency = 200 min_wal_size = 1GB max_wal_size = 4GB work_mem = 512MB temp_buffers = 512MB # log log_directory = '/var/log/postgresql/' # directory where log files are written, # can be absolute or relative to PGDATA log_filename = 'postgresql-15-main.log' #log_statement = 'all' #logging_collector = on log_temp_files = 0 session_preload_libraries = 'auto_explain' auto_explain.log_min_duration = '3s' The Values: volkszaehler=# SELECT * FROM data limit 10; timestamp | channel_id | value ---------------+------------+------- 1666008690865 | 5 | 232 1666008691867 | 5 | 236 1666008692865 | 5 | 237 1666008693867 | 5 | 232 1666008694865 | 5 | 227 1666008695867 | 5 | 229 1666008698866 | 5 | 228 1666008699868 | 5 | 229 1666008700864 | 5 | 228 1666008701868 | 5 | 230 volkszaehler=# SELECT COUNT(value) FROM data; count ---------- 17884247 (1 row) The Indexes: volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE tablename LIKE 'data%' ORDER BY tablename,indexname; tablename | indexname | indexdef -----------+--------------------------------+-------------------------------------------------------------------------------------------------- data | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree (channel_id) data | idx_16391_primary | CREATE UNIQUE INDEX idx_16391_primary ON volkszaehler.data USING btree (channel_id, "timestamp") data | idx_data_timestamp | CREATE INDEX idx_data_timestamp ON volkszaehler.data USING btree ("timestamp") (3 rows) The Query: volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day', TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data WHERE channel_id = 5 AND timestamp >= 0; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=590793.68..590793.69 rows=1 width=8) (actual time=15449.536..15449.539 rows=1 loops=1) -> Seq Scan on data (cost=0.00..382037.82 rows=16700469 width=8) (actual time=247.092..3833.495 rows=16669429 loops=1) Filter: (("timestamp" >= 0) AND (channel_id = 5)) Rows Removed by Filter: 1215163 Planning Time: 0.374 ms JIT: Functions: 5 Options: Inlining true, Optimization true, Expressions true, Deforming true Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 59.816 ms, Emission 28.472 ms, Total 248.322 ms Execution Time: 15451.093 ms (10 rows) Round about 16 sec is too long, the frontend run in timeout or other erros. A simple count look like volkszaehler=# explain analyze SELECT count (channel_id) FROM data; QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------- Finalize Aggregate (cost=208089.76..208089.77 rows=1 width=8) (actual time=3514.293..3523.842 rows=1 loops=1) -> Gather (cost=208089.55..208089.76 rows=2 width=8) (actual time=3514.247..3523.800 rows=3 loops=1) Workers Planned: 2 Workers Launched: 2 -> Partial Aggregate (cost=207089.55..207089.56 rows=1 width=8) (actual time=3427.139..3427.141 rows=1 loops=3) -> Parallel Seq Scan on data (cost=0.00..188457.44 rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3) Planning Time: 0.247 ms JIT: Functions: 11 Options: Inlining false, Optimization false, Expressions true, Deforming true Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612 ms, Emission 48.580 ms, Total 52.857 ms Execution Time: 3525.656 ms (12 rows) Why are the indexes not used ? Is it a Problem of the DATE_TRUNC ? How can I optimize? Best regards
On 7/17/23 11:13, basti wrote: > Hello, > > I use volkszaehler.org (a DIY Smartmeter) on postgres. > > The hardware is a Dual-core 2,4 GHz amd64-CPU and 6 GB RAM. > The databse is approx. 1.5 GB on SSD > > my local config looks like: > > root@vz:~# cat /etc/postgresql/15/main/conf.d/local.conf > # DB Version: 15 > # OS Type: linux > # Total Memory (RAM): 6 GB > # CPUs num: 2 > # Connections num: 20 > # Data Storage: sdd > > max_connections = 20 > shared_buffers = 2GB > effective_cache_size = 4608MB > maintenance_work_mem = 384MB > checkpoint_completion_target = 0.9 > wal_buffers = 16MB > default_statistics_target = 100 > random_page_cost = 0.9 > effective_io_concurrency = 200 > min_wal_size = 1GB > max_wal_size = 4GB > work_mem = 512MB > temp_buffers = 512MB > > # log > log_directory = '/var/log/postgresql/' # directory where log files are > written, > # can be absolute or relative to > PGDATA > log_filename = 'postgresql-15-main.log' > #log_statement = 'all' > #logging_collector = on > log_temp_files = 0 > session_preload_libraries = 'auto_explain' > auto_explain.log_min_duration = '3s' > > > The Values: > > volkszaehler=# SELECT * FROM data limit 10; > timestamp | channel_id | value > ---------------+------------+------- > 1666008690865 | 5 | 232 > 1666008691867 | 5 | 236 > 1666008692865 | 5 | 237 > 1666008693867 | 5 | 232 > 1666008694865 | 5 | 227 > 1666008695867 | 5 | 229 > 1666008698866 | 5 | 228 > 1666008699868 | 5 | 229 > 1666008700864 | 5 | 228 > 1666008701868 | 5 | 230 > > > volkszaehler=# SELECT COUNT(value) FROM data; > count > ---------- > 17884247 > (1 row) > > > The Indexes: > > volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE > tablename LIKE 'data%' ORDER BY tablename,indexname; > > tablename | indexname | indexdef > -----------+--------------------------------+-------------------------------------------------------------------------------------------------- > data | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX > idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree > (channel_id) > data | idx_16391_primary | CREATE UNIQUE INDEX > idx_16391_primary ON volkszaehler.data USING btree (channel_id, > "timestamp") > data | idx_data_timestamp | CREATE INDEX > idx_data_timestamp ON volkszaehler.data USING btree ("timestamp") > (3 rows) > > > The Query: > > volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day', > TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data > WHERE channel_id = 5 AND timestamp >= 0; > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > Aggregate (cost=590793.68..590793.69 rows=1 width=8) (actual > time=15449.536..15449.539 rows=1 loops=1) > -> Seq Scan on data (cost=0.00..382037.82 rows=16700469 width=8) > (actual time=247.092..3833.495 rows=16669429 loops=1) > Filter: (("timestamp" >= 0) AND (channel_id = 5)) > Rows Removed by Filter: 1215163 > Planning Time: 0.374 ms > JIT: > Functions: 5 > Options: Inlining true, Optimization true, Expressions true, > Deforming true > Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 59.816 > ms, Emission 28.472 ms, Total 248.322 ms > Execution Time: 15451.093 ms > (10 rows) > > Round about 16 sec is too long, the frontend run in timeout or other erros. > > A simple count look like > > volkszaehler=# explain analyze SELECT count (channel_id) FROM data; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > Finalize Aggregate (cost=208089.76..208089.77 rows=1 width=8) (actual > time=3514.293..3523.842 rows=1 loops=1) > -> Gather (cost=208089.55..208089.76 rows=2 width=8) (actual > time=3514.247..3523.800 rows=3 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Partial Aggregate (cost=207089.55..207089.56 rows=1 > width=8) (actual time=3427.139..3427.141 rows=1 loops=3) > -> Parallel Seq Scan on data (cost=0.00..188457.44 > rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3) > Planning Time: 0.247 ms > JIT: > Functions: 11 > Options: Inlining false, Optimization false, Expressions true, > Deforming true > Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612 > ms, Emission 48.580 ms, Total 52.857 ms > Execution Time: 3525.656 ms > (12 rows) > > Why are the indexes not used ? It's a common misconception that indexes necessarily improve performance and that slow queries are slow because of missing indexes. In practice indexes are most useful for selective queries, i.e. queries that scan only a fraction of the table - say, a couple percent. But this query scans ~17M rows from the table, and my guess is that's a significant part of the table. In which case the index is useless. Also, notice the seqscan took ~4s. Even if that drops to ~1s with an index, the overall query will still take 12s. > Is it a Problem of the DATE_TRUNC ? My guess is it's more about the DISTINCT, which forces a sort before the aggregate function. Which also means the query can't be parallel. You can try simplifying the query step by step - remove the DISTINCT first, then the DATE_TRUNC. > How can I optimize? > It's always going to be slow with the COUNT(DISTINCT), I'm afraid. Not sure how much you can modify the query / database, and how accurate results you need. If you're OK with estimates, you can try postgres-hll extension [2] which estimates count(distinct). For exact reaults, I wrote count_distinct extension [2] that uses hashtable. Might be worth a try, I guess. Another option is to precalculate stuff, and do queries on that. That is, you're not going to add data with old timestamps, so you can pre-aggregate that. regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/17/23 13:20, Tomas Vondra wrote: > ... > > It's always going to be slow with the COUNT(DISTINCT), I'm afraid. > > Not sure how much you can modify the query / database, and how accurate > results you need. If you're OK with estimates, you can try postgres-hll > extension [2] which estimates count(distinct). For exact reaults, I > wrote count_distinct extension [2] that uses hashtable. Might be worth a > try, I guess. > Damn, I forgot to add the links: [1] https://github.com/citusdata/postgresql-hll [2] https://github.com/tvondra/count_distinct regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 7/17/23 04:13, basti wrote: [snip] > > The Indexes: > > volkszaehler=# SELECT tablename,indexname,indexdef FROM pg_indexes WHERE > tablename LIKE 'data%' ORDER BY tablename,indexname; > > tablename | indexname | indexdef > -----------+--------------------------------+-------------------------------------------------------------------------------------------------- > > data | idx_16391_idx_adf3f36372f5a1aa | CREATE INDEX > idx_16391_idx_adf3f36372f5a1aa ON volkszaehler.data USING btree (channel_id) > data | idx_16391_primary | CREATE UNIQUE INDEX > idx_16391_primary ON volkszaehler.data USING btree (channel_id, "timestamp") > data | idx_data_timestamp | CREATE INDEX > idx_data_timestamp ON volkszaehler.data USING btree ("timestamp") > (3 rows) > > > The Query: > > volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day', > TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data WHERE > channel_id = 5 AND timestamp >= 0; > > QUERY PLAN > ---------------------------------------------------------------------------------------------------------------------------- > > Aggregate (cost=590793.68..590793.69 rows=1 width=8) (actual > time=15449.536..15449.539 rows=1 loops=1) > -> Seq Scan on data (cost=0.00..382037.82 rows=16700469 width=8) > (actual time=247.092..3833.495 rows=16669429 loops=1) > Filter: (("timestamp" >= 0) AND (channel_id = 5)) > Rows Removed by Filter: 1215163 > Planning Time: 0.374 ms > JIT: > Functions: 5 > Options: Inlining true, Optimization true, Expressions true, Deforming > true > Timing: Generation 1.201 ms, Inlining 158.833 ms, Optimization 59.816 > ms, Emission 28.472 ms, Total 248.322 ms > Execution Time: 15451.093 ms > (10 rows) > > Round about 16 sec is too long, the frontend run in timeout or other erros. What fraction of the rows in the table meet the "WHERE channel_id = 5 AND timestamp >= 0" qualification? If it's high (and "high" can be a seemingly low value), then Postgresql will decide it's cheaper to sequentially scan the table. > > A simple count look like > > volkszaehler=# explain analyze SELECT count (channel_id) FROM data; > QUERY PLAN > --------------------------------------------------------------------------------------------------------------------------------------------- > > Finalize Aggregate (cost=208089.76..208089.77 rows=1 width=8) (actual > time=3514.293..3523.842 rows=1 loops=1) > -> Gather (cost=208089.55..208089.76 rows=2 width=8) (actual > time=3514.247..3523.800 rows=3 loops=1) > Workers Planned: 2 > Workers Launched: 2 > -> Partial Aggregate (cost=207089.55..207089.56 rows=1 width=8) > (actual time=3427.139..3427.141 rows=1 loops=3) > -> Parallel Seq Scan on data (cost=0.00..188457.44 > rows=7452844 width=8) (actual time=0.137..2121.695 rows=5962263 loops=3) > Planning Time: 0.247 ms > JIT: > Functions: 11 > Options: Inlining false, Optimization false, Expressions true, > Deforming true > Timing: Generation 2.665 ms, Inlining 0.000 ms, Optimization 1.612 ms, > Emission 48.580 ms, Total 52.857 ms > Execution Time: 3525.656 ms > (12 rows) > > Why are the indexes not used ? > Is it a Problem of the DATE_TRUNC ? > How can I optimize? > > Best regards > > > > > > > -- Born in Arizona, moved to Babylonia.
Thanks a lot tomas, i will try it. I have find out that there is a 'aggregation' function in the frontend. But this is MySQL specific and I have no idea the transform it to postgres. It looks like: 'REPLACE INTO aggregate (channel_id, type, timestamp, value, count) SELECT channel_id, ? AS type, MAX(agg.timestamp) AS timest amp, COALESCE( SUM(agg.val_by_time) / (MAX(agg.timestamp) - MIN(agg.prev_timestamp)), AVG(agg.value)) AS value, COUNT(agg.value) AS count FROM ( SELECT channel_id, timestamp, value, value * (timestamp - @prev_timestamp) AS val_by_time, COALESCE(@prev_timestamp, 0) AS prev_timestamp, @prev_timestamp := timestamp FROM data CROSS JOIN (SELECT @prev_timestamp := NULL) AS vars WHERE channel_id = ? AND timestamp < UNIX_TIMESTAMP(DATE_FORMAT(NOW(), "%Y-%m-%d")) * 1000 ) AS agg GROUP BY channel_ id, DATE_TRUNC('day', TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')' with params [3, 5]: SQLSTATE[42601]: Syntax error: 7 ERROR: syntax error at or near "REPLACE" LINE 1: REPLACE INTO aggregate (channel_id, type, timestamp, value, ... Am 17.07.23 um 13:21 schrieb Tomas Vondra: > On 7/17/23 13:20, Tomas Vondra wrote: >> ... >> >> It's always going to be slow with the COUNT(DISTINCT), I'm afraid. >> >> Not sure how much you can modify the query / database, and how accurate >> results you need. If you're OK with estimates, you can try postgres-hll >> extension [2] which estimates count(distinct). For exact reaults, I >> wrote count_distinct extension [2] that uses hashtable. Might be worth a >> try, I guess. >> > > Damn, I forgot to add the links: > > [1] https://github.com/citusdata/postgresql-hll > > [2] https://github.com/tvondra/count_distinct > > regards >
On Tue, 18 Jul 2023 at 06:19, basti <mailinglist@unix-solution.de> wrote: > > Thanks a lot tomas, i will try it. > > I have find out that there is a 'aggregation' function in the frontend. > But this is MySQL specific and I have no idea the transform it to postgres. > > It looks like: > 'REPLACE INTO aggregate (channel_id, type, timestamp, value, count) Perhaps that's something like PostgreSQL's INSERT ON CONFLICT [1] David [1] https://www.postgresql.org/docs/current/sql-insert.html
On Mon, 17 Jul 2023 at 21:13, basti <mailinglist@unix-solution.de> wrote: > volkszaehler=# explain analyze SELECT COUNT(DISTINCT DATE_TRUNC('day', > TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data > WHERE channel_id = 5 AND timestamp >= 0; Alternatively, you could express this as: SELECT COUNT(*) FROM (SELECT DISTINCT DATE_TRUNC('day', TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond')) FROM data WHERE channel_id = 5 AND timestamp >= 0) a; If there was an index on (channel_id, (DATE_TRUNC('day', TIMESTAMP 'epoch' + timestamp * INTERVAL '1 millisecond'))); then the distinct could efficiently perform a Group Aggregate. Otherwise, it could at least hash aggregate and the distinct could be done in parallel (assuming you're using at least PostgreSQL 15). The yet-to-be-released PostgreSQL 16 will allow more efficient execution of DISTINCT and ORDER BY aggregates by allowing indexed to provide pre-sorted input. In the meantime, the query above will probably help you. David