Re: Sequential scans - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Sequential scans
Date
Msg-id 4639A342.10305@enterprisedb.com
Whole thread Raw
In response to Re: Sequential scans  ("Simon Riggs" <simon@enterprisedb.com>)
List pgsql-hackers
Simon Riggs wrote:
> We need to think of the interaction with partitioning here. People will
> ask whether we would recommend that individual partitions of a large
> table should be larger/smaller than a particular size, to allow these
> optimizations to kick in.
> 
> My thinking is that database designers would attempt to set partition
> size larger than the sync scan limit, whatever it is. That means:
> - they wouldn't want the limit to vary when cache increases, so we *do*
> need a GUC to control the limit. My suggestion now would be
> large_scan_threshold, since it effects both caching and synch scans.

They wouldn't? If you add more memory to your box, so that a table that 
didn't fit in memory before does now fit, surely you want to switch your 
strategy from "don't pollute the cache because it won't fit anyway" to 
"let's keep it in cache, so the next scan won't do I/O".

The basic problem with partitions is that if you have a query like 
"SELECT * FROM partitioned_table", so that you seq scan multiple 
partitions, the size of each partition alone could be below the 
threshold, whatever that is, but since you're scanning them all the net 
result is the same as scanning one large table above the threshold. The 
same could happen in any plan that does multiple seq scans. It could 
even happen at the application level if the application submits multiple 
statements like "SELECT * FROM table1", "SELECT * FROM table2" one after 
each other.

One way to address that would be to manage the recycle buffer ring size 
dynamically. When a backend gets a cache miss, the ring would shrink, 
and when you get cache hits, it would grow. That way, if you do a seq 
scan on a table that fits in cache repeatedly, that table would get more 
buffers from the cache on each iteration until it's completely in cache. 
But if the table or tables being scanned are too big to fit in cache, 
the ring would stay small and not pollute the cache much.

I'm not going to implement that for now, I'm seeing some scary negative 
feedback behavior with that, and it'd need a lot of testing anyway. I'm 
thinking of just using shared_buffers as the limit. One could argue for 
effective_cache_size as well.

> - so there will be lots of partitions, so a hardcoded limit of 1000
> would not be sufficient. A new GUC, or a link to an existing one, is
> probably required.

No matter how many partitions you have, each backend could still be 
scanning only one of them at a time.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: "Pavan Deolasee"
Date:
Subject: Re: Patch queue triage
Next
From: Hannes Eder
Date:
Subject: Re: Subversion repo up