Thread: SELECT * FROM huge_table LIMIT 10; Why does it take more than 10 min to complete, with cold caches

Hello list,

I run a very simple SELECT on a huge table (350M rows, 10 columns of type
integer or timestamp). The server is an old Mac Mini with 4GB RAM and a
slow HDD. All caches are cold either via reboot, or by restarting postgres
and using the `purge` command to flush the OS cache.

The question is why this simple query is taking so long to complete.
Most of the time I see constant 30MB/s read I/O. The memory usage is low
and the system is not swapping or in other ways unhealthy.  Postgres is
not serving any other requests.

Is postgres reading all the table sequentially, just to return 10 rows in
no particular order?  I'm trying to understand what PostgreSQL is doing
and the data access patterns involved.  All feedback is appreciated.


# SELECT version();

  PostgreSQL 15.0 on x86_64-apple-darwin20.6.0, compiled by Apple clang version 12.0.0 (clang-1200.0.32.29), 64-bit

# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw LIMIT 10;

  Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=746005.251..746014.910 rows=10 loops=1)
    Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,
test_datatag_n
    Buffers: shared read=2334526
    I/O Timings: shared/local read=691137.029
    ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual
time=745910.672..745910.677rows=10 loops=1) 
          Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n,
test_function_n,test_datatag_n 
          Buffers: shared read=2334526
          I/O Timings: shared/local read=691137.029
  Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB'
  Planning:
    Buffers: shared hit=69 read=18
    I/O Timings: shared/local read=265.550
  Planning Time: 271.719 ms
  Execution Time: 746057.424 ms
(14 rows)


Repeating the query returns instantaneously, and EXPLAIN shows shared
hit=14 and no reads. Changing the LIMIT to 100 takes another 3min to
respond, with constant read I/O while it runs:


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw LIMIT 100;

  Limit  (cost=0.00..2.91 rows=100 width=42) (actual time=0.030..184692.698 rows=100 loops=1)
    Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,
test_datatag_n
    Buffers: shared hit=14 read=772427
    I/O Timings: shared/local read=171484.072
    ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual
time=0.029..184692.664rows=100 loops=1) 
          Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n,
test_function_n,test_datatag_n 
          Buffers: shared hit=14 read=772427
          I/O Timings: shared/local read=171484.072
  Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB'
  Planning Time: 0.083 ms
  Execution Time: 184692.748 ms
(11 rows)


Reruning this is instantaneous again.
Other postgres settings:

shared_buffers = 1024MB
effective_io_concurrency = 0
wal_compression = zstd
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
checkpoint_timeout = 15min


Bonus question: some time ago the server crashed and rebooted, and after a
long WAL recovery and some hours of idle time, I run the same query.  It
took even longer to respond, and besides reading, there was also a lot
writing involved.  What kind of writes was postgres doing?


# EXPLAIN (ANALYZE, VERBOSE,BUFFERS,SETTINGS)  SELECT * FROM test_runs_raw LIMIT 10;

  Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=1284114.810..1284114.819 rows=10 loops=1)
    Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,
test_datatag_n
    Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
    I/O Timings: shared/local read=1041111.264 write=107258.150
    ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual
time=1283955.130..1283955.135rows=10 loops=1) 
          Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n,
test_function_n,test_datatag_n 
          Buffers: shared hit=47490 read=2287036 dirtied=1809974 written=1762484
          I/O Timings: shared/local read=1041111.264 write=107258.150
  Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB'
  Planning Time: 0.545 ms
  Execution Time: 1284114.972 ms
(11 rows)


Relevant settings:

synchronous_commit = off
wal_recycle = on
min_wal_size = 1GB
max_wal_size = 10GB
track_io_timing = on
track_wal_io_timing = on
wal_compression = zstd


Thanks in advance,
Dimitris




I managed to double-paste different subset of my settings. FWIW all of
them are true for all the experiments. Sorry for the confusion.


On Tue, 31 Jan 2023, Dimitrios Apostolou wrote:

> Other postgres settings:
>
> shared_buffers = 1024MB
> effective_io_concurrency = 0
> wal_compression = zstd
> wal_recycle = on
> min_wal_size = 1GB
> max_wal_size = 10GB
> checkpoint_timeout = 15min

[...]

> Relevant settings:
>
> synchronous_commit = off
> wal_recycle = on
> min_wal_size = 1GB
> max_wal_size = 10GB
> track_io_timing = on
> track_wal_io_timing = on
> wal_compression = zstd



On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
 
    ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual time=745910.672..745910.677 rows=10 loops=1)
          Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n, test_datatag_n
          Buffers: shared read=2334526
          I/O Timings: shared/local read=691137.029

The system has to return 10 live rows to you.  If it needs to search through that many buffers to find 10 live rows you most likely have a large bloating problem going on.  Seems like it is time to vacuum full.

David J.
Dimitrios Apostolou <jimis@gmx.net> writes:
> The question is why this simple query is taking so long to complete.

Do you get the same 10 rows when you repeat the command?

On the basis of the limited info you provided, I'm going to guess that

(1) there are huge subranges of the table containing no live rows,
so that a seqscan might have to pass over many blocks before it finds
some to return;

(2) once you do reach an area having live rows, the next SELECT picks
up scanning in that same area due to the effects of
"synchronize_seqscans", so you get immediate answers until you reach
the next desert of dead tuples.

If turning synchronize_seqscans off changes the behavior, that'd
be a good clue that this is the right theory.  As for a real fix,
it might be time for a VACUUM FULL or CLUSTER on that table.

            regards, tom lane



On Tue, 31 Jan 2023, David G. Johnston wrote:

> On Tue, Jan 31, 2023 at 8:07 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
>  
>           ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual
time=745910.672..745910.677rows=10 loops=1) 
>                 Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n,
test_function_n,test_datatag_n 
>                 Buffers: shared read=2334526
>                 I/O Timings: shared/local read=691137.029
>
>
> The system has to return 10 live rows to you.  If it needs to search through that many buffers to find 10 live rows
youmost likely have a large bloating problem going on.  Seems like it is time to vacuum full. 

I looked up on how to measure bloat, so I run the query found at [1].

[1] https://wiki.postgresql.org/wiki/Show_database_bloat

The first two rows show huge bloat on the two indices of this table:

... ORDER BY wastedbytes DESC LIMIT 2;
  current_database | schemaname |   tablename   | tbloat | wastedbytes |            iname             | ibloat |
wastedibytes

------------------+------------+---------------+--------+-------------+------------------------------+--------+--------------
  coin             | public     | test_runs_raw |    1.8 | 21742305280 | test_runs_raw_pkey           |    1.0 |
   0 
  coin             | public     | test_runs_raw |    1.8 | 21742305280 | test_runs_raw_idx_workitem_n |    0.3 |
   0 
(2 rows)

Is this bloat even affecting queries that do not use the index?

It seems I have to add VACUUM FULL to nightly maintainance. I had run some
schema restructuring (several ADD COLUMN followed by UPDATE SET on all
rows) some days ago, and I was not aware this degraded the table.
Thanks for the useful info!


Dimitris


> On Jan 31, 2023, at 07:40, Dimitrios Apostolou <jimis@gmx.net> wrote:
> Is this bloat even affecting queries that do not use the index?

No, but a bloated index often (although not always) goes along with a bloated table.

> It seems I have to add VACUUM FULL to nightly maintainance.

I wouldn't go that far; that's basically changing your oil every time you get gas.  However, monitoring bloat and
eitherrebuilding the indexes (if they're all that's bloated) or using pg_repack [1] periodically is a good practice. 

[1] https://github.com/reorg/pg_repack


On Tue, Jan 31, 2023 at 8:28 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dimitrios Apostolou <jimis@gmx.net> writes:
> The question is why this simple query is taking so long to complete.

Do you get the same 10 rows when you repeat the command?

On the basis of the limited info you provided, I'm going to guess that

(1) there are huge subranges of the table containing no live rows,
so that a seqscan might have to pass over many blocks before it finds
some to return;

It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead.  A limit without an order is not a query to optimize for of course...


(2) once you do reach an area having live rows, the next SELECT picks
up scanning in that same area due to the effects of
"synchronize_seqscans", so you get immediate answers until you reach
the next desert of dead tuples.

Interesting, I was assuming that the behavior on repeated calls was more due to intelligent buffer pool usage.  When doing a sequential scan (I suppose the key word being sequential...) the system decides which pages to fetch before checking to see if they are already in shared buffers (as opposed to returning the ones in shared buffers first then fetching more if needed)?

David J.

Thanks for the insight on the internals. Regarding your questions:

On Tue, 31 Jan 2023, Tom Lane wrote:

> Do you get the same 10 rows when you repeat the command?

Yes. Just tested with both cold and hot caches. The first 10 rows are
exactly the same, either they return slowly or immediately.

> If turning synchronize_seqscans off changes the behavior, that'd
> be a good clue that this is the right theory.

Turning it off makes the query slow no matter how many times I re-run it.
The system is doing lots of read I/O with both hot and cold caches. Here
is the EXPLAIN output from the hot cache run (that previously had only 14
hits and no reads):

  Limit  (cost=0.00..0.29 rows=10 width=42) (actual time=620510.813..620510.821 rows=10 loops=1)
    Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n, test_function_n,
test_datatag_n
    Buffers: shared hit=64 read=2334462
    I/O Timings: shared/local read=567846.559
    ->  Seq Scan on public.test_runs_raw  (cost=0.00..9250235.80 rows=317603680 width=42) (actual
time=620510.800..620510.804rows=10 loops=1) 
          Output: run_n, test_name_n, workitem_n, started_on, duration_ms, test_result_n, test_executable_n,
test_function_n,test_datatag_n 
          Buffers: shared hit=64 read=2334462
          I/O Timings: shared/local read=567846.559
  Settings: effective_cache_size = '2GB', max_parallel_workers_per_gather = '0', work_mem = '64MB'
  Planning Time: 0.099 ms
  Execution Time: 620510.855 ms


After reading the docs, I'm surprised this setting affects my case given
that I have no parallelism in my setup.


> As for a real fix,
> it might be time for a VACUUM FULL or CLUSTER on that table.

Regarding CLUSTER, would it help with a seqscan on a bloated table?
Furthermore, given that the table is expected to grow every day by a few
million rows, do you suggest running CLUSTER every night? Will postgres
remember that the rows up to N are clustered, even after appending more
rows?


Dimitris



Dimitrios Apostolou <jimis@gmx.net> writes:
> On Tue, 31 Jan 2023, Tom Lane wrote:
>> Do you get the same 10 rows when you repeat the command?

> Yes. Just tested with both cold and hot caches. The first 10 rows are
> exactly the same, either they return slowly or immediately.

Hm.  I don't recall exactly how synchronize_seqscans works --- it
definitely changes the point in the table where a seqscan begins
scanning, but I don't remember where/when that point gets updated
(and I'm too lazy to go look).  This result suggests that the first
query advances the table's sync start point to someplace a bit before
the first live tuple, not after all the returned tuples as I'd been
thinking.  It might be that the start point is always the beginning
of a block, so that if you fetch only a few tuples that are all in
the same block then the next attempt will re-fetch them.  If you
increase the LIMIT to say 1000, does the behavior (with
synchronize_seqscans on) change?

>> If turning synchronize_seqscans off changes the behavior, that'd
>> be a good clue that this is the right theory.

> Turning it off makes the query slow no matter how many times I re-run it.

OK, well that's enough of a smoking gun to say that it is the sync
start point that's the relevant state, and not the contents of your
data caches at all.

Anyway, the follow-on discussion makes it clear that you do indeed
have a badly bloated table, and fixing that situation should fix
this performance issue.

            regards, tom lane



On Tue, 31 Jan 2023, David G. Johnston wrote:
>
> It feels like there is room for improvement here using table statistics and the visibility map to significantly
reducethe number of pages retrieved that turn out to be all dead.  

Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?

> A limit without an order is not a query to optimize for of course...

A good reason is that the web interface I'm using to browse my tables (and
I guess many other too) do a simple limited+unordered SELECT when you
click on a table. Timing out was what made me look deeper. I couldn't
accept that no matter how slow my system is, it has trouble returning a
few rows.

It is also the first query somoebody will type on psql to interactively
get a clue about the contents of a table.


Regards,
Dimitris

On Wed, Feb 1, 2023 at 11:15 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
On Tue, 31 Jan 2023, David G. Johnston wrote:
>
> It feels like there is room for improvement here using table statistics and the visibility map to significantly reduce the number of pages retrieved that turn out to be all dead. 

Yes, I too wish postgres was more clever for a simple unordered query. I
should also mention that this table has two btree indices, one primary key
index and one on another column. Would it be feasible to "descend" an
index in order to immediately find some live data?

That isn't possible since indexes don't have tuple visibility information within them; most of the dead bloat entries present in the table have dead bloat entries pointing to them in the index.

David J.

Thanks everyone for the help.

I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.

My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?

I'm now running VACUUM FULL as everyone suggested. I just tried plain
VACUUM as I was curious if it would work and because it doesn't lock
the table with an operation that takes hours for my setup.


Thanks,
Dimitris




On Wed, Feb 1, 2023 at 11:27 AM Dimitrios Apostolou <jimis@gmx.net> wrote:
I have now run simple VACUUM but it didn't change anything, the simple
SELECT is still slow.

My understanding by reading the docs is that it should reclaim all unused
space, just not return it to the OS. Which is fine by me. Any idea why it
failed to reclaim the space in my case?

The system just isn't that intelligent for "sequential scan", instead it does literally what the label says, goes through the table one page at a time and returns any live rows it finds.  You already had lots of completely dead pages at the start of the table and after vacuum those pages still exist (pages are sequentially numbered and gapless) just with a different kind of dead contents.

In short, say you have 20 pages, 1-10 dead and 11-20 alive.  The only way to change which pages exist and get rid of 1-10 is to rewrite the table putting the contents in 11-20 into the newly recreated pages 1-10.  VACUUM doesn't do that - it just makes it so when new data is written they can be placed into the abandoned 1-10 range.  If for some reason page 21 existed and VACUUM saw it was all dead it would remove page 21 from the end of the relation since that wouldn't affect "sequential and gapless".

David J.


> On Feb 1, 2023, at 10:45, David G. Johnston <david.g.johnston@gmail.com> wrote:
> The system just isn't that intelligent for "sequential scan", instead it does literally what the label says, goes
throughthe table one page at a time and returns any live rows it finds.  

Although this does raise a question: Couldn't it skip pages that are marked appropriately in the visibility map?


Christophe Pettus <xof@thebuild.com> writes:
>> On Feb 1, 2023, at 10:45, David G. Johnston <david.g.johnston@gmail.com> wrote:
>> The system just isn't that intelligent for "sequential scan", instead it does literally what the label says, goes
throughthe table one page at a time and returns any live rows it finds.  

> Although this does raise a question: Couldn't it skip pages that are marked appropriately in the visibility map?

There's no bit in the visibility map for "contains no live tuples".
I think adding one would be optimizing for the wrong thing.

(You might be thinking of the free-space map, but we don't maintain
that rigorously enough to trust it as a guide to skipping pages.)

            regards, tom lane