Sequential scans - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Sequential scans
Date
Msg-id 4638918F.90805@enterprisedb.com
Whole thread Raw
Responses Re: Sequential scans
Re: Sequential scans
List pgsql-hackers
Hi,

I'm starting to review the "synchronized scans" and "scan-resistant 
buffer cache" patches. The patches have complex interactions so I'm 
taking a holistic approach.

There's four outstanding issues with the sync scans in particular:

1. The simplistic hash approach. While it's nice to not have a lock, I'm 
worried of collisions. If you had a collision every now and then, it 
wouldn't be that bad, but because the hash value is computed from the 
oid, a collision would be persistent. If you create a database and 
happen to have two frequently seqscanned tables that collide, the only 
way to get rid of the collision is to drop and recreate a table. 
Granted, that'd probably be very rare in practice, but when it happens 
it would be next to impossible to figure out what's going on.

Let's use a normal hash table instead, and use a lock to protect it. If 
we only update it every 10 pages or so, the overhead should be 
negligible. To further reduce contention, we could modify ReadBuffer to 
let the caller know if the read resulted in a physical read or not, and 
only update the entry when a page is physically read in. That way all 
the synchronized scanners wouldn't be updating the same value, just the 
one performing the I/O. And while we're at it, let's use the full 
relfilenode instead of just the table oid in the hash.

2. Under what circumstances does the patch help and when does it hurt? I 
think the patch is safe in that it should never be any worse than what 
we have now. But when does it help? That needs to be looked at together 
with the other patch.

I need to dig the archives for the performance test results you posted 
earlier and try to understand them.

There's six distinct scenarios I've come up with this far that need to 
be looked at:
A. A seq scan on a small table
B. A seq scan on a table that's 110% the size of shared_buffers, but 
smaller than RAM
C. A seq scan on a table that's 110% the size of RAM
D. A seq scan on a huge table
E. Two simultaneous seq scans on a large table starting at the same time
F. Two simultaneous seq scans on a large table, 2nd one starting when 
the 1st one is halfway through

Also, does it change things if you have a bulk update instead of 
read-only query? How about bitmap heap scans and large index scans? And 
vacuums? And the above scenarios need to be considered both alone, and 
in the presence of other OLTP kind of workload.

I realize that we can't have everything, and as long as we get some 
significant benefit in some scenarios, and don't hurt others, the patch 
is worthwhile. But let's try to cover as much as we reasonably can.

One random idea I had to cover B & C without having the offset variable: 
Start scanning *backwards* from the page that's in the shared hash 
table, until you hit a page that's not in buffer cache. Then you 
continue scanning forwards from the page you started from.

This needs more thought but I think we can come up with a pretty simple 
solution that covers the most common cases.

3. By having different backends doing the reads, are we destroying OS 
readahead as Tom suggested? I remember you performed some tests on that, 
and it was a problem on some systems but not on others. This needs some 
thought, there may be some simple way to address that.

4. It fails regression tests. You get an assertion failure on the portal 
test. I believe that changing the direction of a scan isn't handled 
properly; it's probably pretty easy to fix.

Jeff, could you please fix 1 and 4? I'll give 2 and 3 some more thought, 
and take a closer look at the scan-resistant scans patch. Any comments 
and ideas are welcome, of course..

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: strange buildfarm failures
Next
From: Alvaro Herrera
Date:
Subject: Re: Feature freeze progress report