Thread: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Dimitrios Apostolou
Date:
Hello list, I run a very simple SELECT on a huge table (350M rows, 10 columns of type integer or timestamp). The server is an old Mac Mini with 4GB RAM and a slow HDD. All caches are cold either via reboot, or by restarting postgres and using the `purge` command to flush the OS cache. The question is why this simple query is taking so long to complete. Most of the time I see constant 30MB/s read I/O. The memory usage is low and the system is not swapping or in other ways unhealthy. Postgres is not serving any other requests. Is postgres reading all the table sequentially, just to return 10 rows in no particular order? I'm trying to understand what PostgreSQL is doing and the data access patterns involved. All feedback is appreciated. # SELECT version(); PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 10; Limit (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,test_datatag_n Buffers: shared read=2334526 I/O Timings: shared/local read=691137.029 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning: Buffers: shared hit=69 read=18 I/O Timings: shared/local read=265.550 Planning Time: 271.719 ms Execution Time: 746057.424 ms (14 rows) Repeating the query returns instantaneously, and EXPLAIN shows shared hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to respond, with constant read I/O while it runs: # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 100; Limit (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 rows=100 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=14 read=772427 I/O Timings: shared/local read=171484.072 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=0.029..184692.664rows=100 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,test_datatag_n Buffers: shared hit=14 read=772427 I/O Timings: shared/local read=171484.072 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.083 ms Execution Time: 184692.748 ms (11 rows) Reruning this is instantaneous again. Other postgres settings: shared_buffers = 1024MB effective_io_concurrency = 0 wal_compression = zstd wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB checkpoint_timeout = 15min Bonus question: some time ago the server crashed and rebooted, and after a long WAL recovery and some hours of idle time, I run the same query. It took even longer to respond, and besides reading, there was also a lot writing involved. What kind of writes was postgres doing? # EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS) SELECT * FROM test_runs_raw LIMIT 10; Limit (cost=0.00..0.29 rows=10 width=42) (actual time=1284114.810..1284114.819 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484 I/O Timings: shared/local read=1041111.264 write=107258.150 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=1283955.130..1283955.135rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,test_datatag_n Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484 I/O Timings: shared/local read=1041111.264 write=107258.150 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.545 ms Execution Time: 1284114.972 ms (11 rows) Relevant settings: synchronous_commit = off wal_recycle = on min_wal_size = 1GB max_wal_size = 10GB track_io_timing = on track_wal_io_timing = on wal_compression = zstd Thanks in advance, Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Dimitrios Apostolou
Date:
I managed to double-paste different subset of my settings. FWIW all of them are true for all the experiments. Sorry for the confusion. On Tue, 31 Jan 2023, Dimitrios Apostolou wrote: > Other postgres settings: > > shared_buffers = 1024MB > effective_io_concurrency = 0 > wal_compression = zstd > wal_recycle = on > min_wal_size = 1GB > max_wal_size = 10GB > checkpoint_timeout = 15min [...] > Relevant settings: > > synchronous_commit = off > wal_recycle = on > min_wal_size = 1GB > max_wal_size = 10GB > track_io_timing = on > track_wal_io_timing = on > wal_compression = zstd
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
"David G. Johnston"
Date:
On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
-> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n
Buffers: shared read=2334526
I/O Timings: shared/local read=691137.029
The system has to return 10 live rows to you. If it needs to search through that many buffers to find 10 live rows you most likely have a large bloating problem going on. Seems like it is time to vacuum full.
David J.
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > The question is why this simple query is taking so long to complete. Do you get the same 10 rows when you repeat the command? On the basis of the limited info you provided, I'm going to guess that (1) there are huge subranges of the table containing no live rows, so that a seqscan might have to pass over many blocks before it finds some to return; (2) once you do reach an area having live rows, the next SELECT picks up scanning in that same area due to the effects of "synchronize_seqscans", so you get immediate answers until you reach the next desert of dead tuples. If turning synchronize_seqscans off changes the behavior, that'd be a good clue that this is the right theory. As for a real fix, it might be time for a VACUUM FULL or CLUSTER on that table. regards, tom lane
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Dimitrios Apostolou
Date:
On Tue, 31 Jan 2023, David G. Johnston wrote: > On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou <jimis@gmx.net> wrote: > > -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677rows=10 loops=1) > Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,test_datatag_n > Buffers: shared read=2334526 > I/O Timings: shared/local read=691137.029 > > > The system has to return 10 live rows to you. If it needs to search through that many buffers to find 10 live rows youmost likely have a large bloating problem going on. Seems like it is time to vacuum full. I looked up on how to measure bloat, so I run the query found at [1]. [1] https://wiki.postgresql.org/wiki/Show_database_bloat The first two rows show huge bloat on the two indices of this table: ... ORDER BY wastedbytes DESC LIMIT 2; current_database | schemaname | tablename | tbloat | wastedbytes | iname | ibloat | wastedibytes ------------------+------------+---------------+--------+-------------+------------------------------+--------+-------------- coin | public | test_runs_raw | 1.8 | 21742305280 | test_runs_raw_pkey | 1.0 | 0 coin | public | test_runs_raw | 1.8 | 21742305280 | test_runs_raw_idx_workitem_n | 0.3 | 0 (2 rows) Is this bloat even affecting queries that do not use the index? It seems I have to add VACUUM FULL to nightly maintainance. I had run some schema restructuring (several ADD COLUMN followed by UPDATE SET on all rows) some days ago, and I was not aware this degraded the table. Thanks for the useful info! Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Christophe Pettus
Date:
> On Jan 31, 2023, at 07:40, Dimitrios Apostolou <jimis@gmx.net> wrote: > Is this bloat even affecting queries that do not use the index? No, but a bloated index often (although not always) goes along with a bloated table. > It seems I have to add VACUUM FULL to nightly maintainance. I wouldn't go that far; that's basically changing your oil every time you get gas. However, monitoring bloat and eitherrebuilding the indexes (if they're all that's bloated) or using pg_repack [1] periodically is a good practice. [1] https://github.com/reorg/pg_repack
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
"David G. Johnston"
Date:
On Tue, Jan 31, 2023 at 8:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dimitrios Apostolou <jimis@gmx.net> writes:
> The question is why this simple query is taking so long to complete.
Do you get the same 10 rows when you repeat the command?
On the basis of the limited info you provided, I'm going to guess that
(1) there are huge subranges of the table containing no live rows,
so that a seqscan might have to pass over many blocks before it finds
some to return;
It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead. A limit without an order is not a query to optimize for of course...
(2) once you do reach an area having live rows, the next SELECT picks
up scanning in that same area due to the effects of
"synchronize_seqscans", so you get immediate answers until you reach
the next desert of dead tuples.
Interesting, I was assuming that the behavior on repeated calls was more due to intelligent buffer pool usage. When doing a sequential scan (I suppose the key word being sequential...) the system decides which pages to fetch before checking to see if they are already in shared buffers (as opposed to returning the ones in shared buffers first then fetching more if needed)?
David J.
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Dimitrios Apostolou
Date:
Thanks for the insight on the internals. Regarding your questions: On Tue, 31 Jan 2023, Tom Lane wrote: > Do you get the same 10 rows when you repeat the command? Yes. Just tested with both cold and hot caches. The first 10 rows are exactly the same, either they return slowly or immediately. > If turning synchronize_seqscans off changes the behavior, that'd > be a good clue that this is the right theory. Turning it off makes the query slow no matter how many times I re-run it. The system is doing lots of read I/O with both hot and cold caches. Here is the EXPLAIN output from the hot cache run (that previously had only 14 hits and no reads): Limit (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n Buffers: shared hit=64 read=2334462 I/O Timings: shared/local read=567846.559 -> Seq Scan on public.test_runs_raw (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=620510.800..620510.804rows=10 loops=1) Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,test_datatag_n Buffers: shared hit=64 read=2334462 I/O Timings: shared/local read=567846.559 Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB' Planning Time: 0.099 ms Execution Time: 620510.855 ms After reading the docs, I'm surprised this setting affects my case given that I have no parallelism in my setup. > As for a real fix, > it might be time for a VACUUM FULL or CLUSTER on that table. Regarding CLUSTER, would it help with a seqscan on a bloated table? Furthermore, given that the table is expected to grow every day by a few million rows, do you suggest running CLUSTER every night? Will postgres remember that the rows up to N are clustered, even after appending more rows? Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Tom Lane
Date:
Dimitrios Apostolou <jimis@gmx.net> writes: > On Tue, 31 Jan 2023, Tom Lane wrote: >> Do you get the same 10 rows when you repeat the command? > Yes. Just tested with both cold and hot caches. The first 10 rows are > exactly the same, either they return slowly or immediately. Hm. I don't recall exactly how synchronize_seqscans works --- it definitely changes the point in the table where a seqscan begins scanning, but I don't remember where/when that point gets updated (and I'm too lazy to go look). This result suggests that the first query advances the table's sync start point to someplace a bit before the first live tuple, not after all the returned tuples as I'd been thinking. It might be that the start point is always the beginning of a block, so that if you fetch only a few tuples that are all in the same block then the next attempt will re-fetch them. If you increase the LIMIT to say 1000, does the behavior (with synchronize_seqscans on) change? >> If turning synchronize_seqscans off changes the behavior, that'd >> be a good clue that this is the right theory. > Turning it off makes the query slow no matter how many times I re-run it. OK, well that's enough of a smoking gun to say that it is the sync start point that's the relevant state, and not the contents of your data caches at all. Anyway, the follow-on discussion makes it clear that you do indeed have a badly bloated table, and fixing that situation should fix this performance issue. regards, tom lane
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
Dimitrios Apostolou
Date:
On Tue, 31 Jan 2023, David G. Johnston wrote: > > It feels like there is room for improvement here using table statistics and the visibility map to significantly reducethe number of pages retrieved that turn out to be all dead. Yes, I too wish postgres was more clever for a simple unordered query. I should also mention that this table has two btree indices, one primary key index and one on another column. Would it be feasible to "descend" an index in order to immediately find some live data? > A limit without an order is not a query to optimize for of course... A good reason is that the web interface I'm using to browse my tables (and I guess many other too) do a simple limited+unordered SELECT when you click on a table. Timing out was what made me look deeper. I couldn't accept that no matter how slow my system is, it has trouble returning a few rows. It is also the first query somoebody will type on psql to interactively get a clue about the contents of a table. Regards, Dimitris
Re: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches
From
"David G. Johnston"
Date:
On Wed, Feb 1, 2023 at 11:15 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
On Tue, 31 Jan 2023, David G. Johnston wrote:
>
> It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead.
Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?
That isn't possible since indexes don't have tuple visibility information within them; most of the dead bloat entries present in the table have dead bloat entries pointing to them in the index.
David J.
VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)
From
Dimitrios Apostolou
Date:
Thanks everyone for the help. I have now run simple VACUUM but it didn't change anything, the simple SELECT is still slow. My understanding by reading the docs is that it should reclaim all unused space, just not return it to the OS. Which is fine by me. Any idea why it failed to reclaim the space in my case? I'm now running VACUUM FULL as everyone suggested. I just tried plain VACUUM as I was curious if it would work and because it doesn't lock the table with an operation that takes hours for my setup. Thanks, Dimitris
Re: VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)
From
"David G. Johnston"
Date:
On Wed, Feb 1, 2023 at 11:27 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.
My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?
The system just isn't that intelligent for "sequential scan", instead it does literally what the label says, goes through the table one page at a time and returns any live rows it finds. You already had lots of completely dead pages at the start of the table and after vacuum those pages still exist (pages are sequentially numbered and gapless) just with a different kind of dead contents.
In short, say you have 20 pages, 1-10 dead and 11-20 alive. The only way to change which pages exist and get rid of 1-10 is to rewrite the table putting the contents in 11-20 into the newly recreated pages 1-10. VACUUM doesn't do that - it just makes it so when new data is written they can be placed into the abandoned 1-10 range. If for some reason page 21 existed and VACUUM saw it was all dead it would remove page 21 from the end of the relation since that wouldn't affect "sequential and gapless".
David J.
Re: VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)
From
Christophe Pettus
Date:
> On Feb 1, 2023, at 10:45, David G. Johnston <david.g.johnston@gmail.com> wrote: > The system just isn't that intelligent for "sequential scan", instead it does literally what the label says, goes throughthe table one page at a time and returns any live rows it finds. Although this does raise a question: Couldn't it skip pages that are marked appropriately in the visibility map?
Re: VACUUM vs VACUUM FULL (was: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches)
From
Tom Lane
Date:
Christophe Pettus <xof@thebuild.com> writes: >> On Feb 1, 2023, at 10:45, David G. Johnston <david.g.johnston@gmail.com> wrote: >> The system just isn't that intelligent for "sequential scan", instead it does literally what the label says, goes throughthe table one page at a time and returns any live rows it finds. > Although this does raise a question: Couldn't it skip pages that are marked appropriately in the visibility map? There's no bit in the visibility map for "contains no live tuples". I think adding one would be optimizing for the wrong thing. (You might be thinking of the free-space map, but we don't maintain that rigorously enough to trust it as a guide to skipping pages.) regards, tom lane