Thread: seq scan cache vs. index cache smackdown

seq scan cache vs. index cache smackdown

From
Mark Aufflick
Date:
Hi All,

I have boiled my situation down to the following simple case: (postgres
version 7.3)

* Query 1 is doing a sequential scan over a table (courtesy of field
ILIKE 'foo%') and index joins to a few others
* Query 2 is doing a functional index scan over the same table
(lower(field) LIKE 'foo%') and index joins to a few others
* neither query has an order by clause
* for the purpose of testing, both queries are designed to return the
same result set

Obviously Q2 is faster than Q1, but if I ever run them both at the same
time (lets say I run two of Q1 and one of Q2 at the same time) then Q2
consistently returns WORSE times than Q1 (explain analyze confirms that
it is using the index).

My assumption is that the sequential scan is blowing the index from any
cache it might live in, and simultaneously stealing all the disk IO
that is needed to access the index on disk (the table has 200,000
rows).

If I simplify the case to not do the index joins (ie. operate on the
one table only) the situation is not as dramatic, but similar.

My thoughts are:

1) kill the sequential scan - but unfortunately I don't have direct
control over that code
2) change the way the server allocates/prioritizes different caches - i
don't know enough about how postgres caches work to do this (if it's
possible)
3) try it on postgres 7.4 - possible, but migrating the system to 7.4
in production will be hard because the above code that I am not
responsible for has a lot of (slightly wacky) implicit date casts
4) ask the fine people on the mailing list for other suggestions!
--
Mark Aufflick
   e  mark@pumptheory.com
   w  www.pumptheory.com (work)
   w  mark.aufflick.com (personal)
   p  +61 438 700 647
   f  +61 2 9436 4737


========================================================================
 iBurst Wireless Broadband from $34.95/month   www.platformnetworks.net
 Forward undetected SPAM to:                   spam@mailsecurity.net.au
========================================================================


Re: seq scan cache vs. index cache smackdown

From
"Iain"
Date:
Hi,

I think there was some discussion about seq scans messing up the cache, and
talk about doing something about it but I don't think it has been addressed
yet. Maybe worth a troll through the archives.

It is certainly true that in many situations, a seq scan is preferable to
using an index. I have been testing a situation here on two versions of the
same database, one of the databases is much bigger than the other
(artificially bloated for testing purposes). Some of the query plans change
to use seq scans on the big database, where they used indexes on the little
database - but either way, in *single user* testing the performance is fine.
My concern is that this kind of testing has very little relevance to the
real world of multiuser processing where contention for the cache becomes an
issue.  It may be that, at least in the current situation, postgres is
giving too much weight to seq scans based on single user, straight line
performance comparisons. If your assumption is correct, then addressing that
might help, though it's bound to have it's compromises too...

regards
Iain




----- Original Message -----
From: "Mark Aufflick" <mark@pumptheory.com>
To: <pgsql-performance@postgresql.org>
Sent: Tuesday, February 15, 2005 8:34 AM
Subject: [PERFORM] seq scan cache vs. index cache smackdown


> Hi All,
>
> I have boiled my situation down to the following simple case: (postgres
> version 7.3)
>
> * Query 1 is doing a sequential scan over a table (courtesy of field ILIKE
> 'foo%') and index joins to a few others
> * Query 2 is doing a functional index scan over the same table
> (lower(field) LIKE 'foo%') and index joins to a few others
> * neither query has an order by clause
> * for the purpose of testing, both queries are designed to return the same
> result set
>
> Obviously Q2 is faster than Q1, but if I ever run them both at the same
> time (lets say I run two of Q1 and one of Q2 at the same time) then Q2
> consistently returns WORSE times than Q1 (explain analyze confirms that it
> is using the index).
>
> My assumption is that the sequential scan is blowing the index from any
> cache it might live in, and simultaneously stealing all the disk IO that
> is needed to access the index on disk (the table has 200,000 rows).
>
> If I simplify the case to not do the index joins (ie. operate on the one
> table only) the situation is not as dramatic, but similar.
>
> My thoughts are:
>
> 1) kill the sequential scan - but unfortunately I don't have direct
> control over that code
> 2) change the way the server allocates/prioritizes different caches - i
> don't know enough about how postgres caches work to do this (if it's
> possible)
> 3) try it on postgres 7.4 - possible, but migrating the system to 7.4 in
> production will be hard because the above code that I am not responsible
> for has a lot of (slightly wacky) implicit date casts
> 4) ask the fine people on the mailing list for other suggestions!
> --
> Mark Aufflick
>   e  mark@pumptheory.com
>   w  www.pumptheory.com (work)
>   w  mark.aufflick.com (personal)
>   p  +61 438 700 647
>   f  +61 2 9436 4737
>
>
> ========================================================================
> iBurst Wireless Broadband from $34.95/month   www.platformnetworks.net
> Forward undetected SPAM to:                   spam@mailsecurity.net.au
> ========================================================================
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings


Re: seq scan cache vs. index cache smackdown

From
Rod Taylor
Date:
> My concern is that this kind of testing has very little relevance to the
> real world of multiuser processing where contention for the cache becomes an
> issue.  It may be that, at least in the current situation, postgres is
> giving too much weight to seq scans based on single user, straight line

To be fair, a large index scan can easily throw the buffers out of whack
as well. An index scan on 0.1% of a table with 1 billion tuples will
have a similar impact to buffers as a sequential scan of a table with 1
million tuples.

Any solution fixing buffers should probably not take into consideration
the method being performed (do you really want to skip caching a
sequential scan of a 2 tuple table because it didn't use an index) but
the volume of data involved as compared to the size of the cache.

I've often wondered if a single 1GB toasted tuple could wipe out the
buffers. I would suppose that toast doesn't bypass them.
--
Rod Taylor <pg@rbt.ca>


Re: seq scan cache vs. index cache smackdown

From
Christopher Browne
Date:
The world rejoiced as mark@pumptheory.com (Mark Aufflick) wrote:
> Hi All,
>
> I have boiled my situation down to the following simple case:
> (postgres version 7.3)
>
> * Query 1 is doing a sequential scan over a table (courtesy of field
> ILIKE 'foo%') and index joins to a few others
> * Query 2 is doing a functional index scan over the same table
> (lower(field) LIKE 'foo%') and index joins to a few others
> * neither query has an order by clause
> * for the purpose of testing, both queries are designed to return the
> same result set
>
> Obviously Q2 is faster than Q1, but if I ever run them both at the
> same time (lets say I run two of Q1 and one of Q2 at the same time)
> then Q2 consistently returns WORSE times than Q1 (explain analyze
> confirms that it is using the index).
>
> My assumption is that the sequential scan is blowing the index from
> any cache it might live in, and simultaneously stealing all the disk
> IO that is needed to access the index on disk (the table has 200,000
> rows).

There's something to be said for that...

> If I simplify the case to not do the index joins (ie. operate on the
> one table only) the situation is not as dramatic, but similar.
>
> My thoughts are:
>
> 1) kill the sequential scan - but unfortunately I don't have direct
> control over that code

This is a good choice, if plausible...

> 2) change the way the server allocates/prioritizes different caches -
> i don't know enough about how postgres caches work to do this (if it's
> possible)

That's what the 8.0 cache changes did...  Patent claim issues are
leading to some changes to the prioritization, which is liable to
change 8.0.something and 8.1.

> 3) try it on postgres 7.4 - possible, but migrating the system to 7.4
> in production will be hard because the above code that I am not
> responsible for has a lot of (slightly wacky) implicit date casts

Moving to 7.4 wouldn't materially change the situation; you'd have to
go all the way to version 8.
--
(format nil "~S@~S" "cbbrowne" "gmail.com")
http://linuxdatabases.info/~cbbrowne/postgresql.html
Rules of the Evil Overlord #32. "I will not fly into a rage and kill a
messenger who brings me bad news  just to illustrate how evil I really
am. Good messengers are hard to come by."
<http://www.eviloverlord.com/>

Re: seq scan cache vs. index cache smackdown

From
"Iain"
Date:
Hi Rod,

> Any solution fixing buffers should probably not take into consideration
> the method being performed (do you really want to skip caching a
> sequential scan of a 2 tuple table because it didn't use an index) but
> the volume of data involved as compared to the size of the cache.

Yes, in fact indexes aren't so different to tables really in that regard.

It sounds like version 8 may help out anyway.

regards
Iain

Re: seq scan cache vs. index cache smackdown

From
Greg Stark
Date:
Mark Aufflick <mark@pumptheory.com> writes:

> Obviously Q2 is faster than Q1,

That's not really obvious at all. If there are lots of records being returned
the index might not be faster than a sequential scan.

> My assumption is that the sequential scan is blowing the index from any cache
> it might live in, and simultaneously stealing all the disk IO that is needed to
> access the index on disk (the table has 200,000 rows).

It kind of sounds to me like you've lowered random_page_cost to reflect the
fact that your indexes are nearly always completely cached. But when they're
not this unrealistic random_page_cost causes indexes to be used when they're
no longer faster.

Perhaps you should post an "EXPLAIN ANALYZE" of your Q1 and Q2 (the latter
preferable with and without enable_indexscan, but since it's a join you may
not be able to get precisely the comparable plan without just that one index
scan.)

> 2) change the way the server allocates/prioritizes different caches - i don't
> know enough about how postgres caches work to do this (if it's possible)

Postgres keeps one set of shared buffers, not separate pools . Normally you
only allocate a small amount of your memory for Postgres and let the OS handle
disk caching.

What is your shared_buffers set to and how much memory do you have?

> 3) try it on postgres 7.4 - possible, but migrating the system to 7.4 in
> production will be hard because the above code that I am not responsible for
> has a lot of (slightly wacky) implicit date casts

I can't think of any 7.4 changes that would affect this directly, but there
were certainly plenty of changes that had broad effects. you never know.

8.0, on the other hand, has a new algorithm that specifically tries to protect
against the shared buffers being blown out by a sequential scan. But that will
only help if it's the shared buffers being thrashed that's hurting you, not
the entire OS file system cache.

--
greg

Re: seq scan cache vs. index cache smackdown

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> 8.0, on the other hand, has a new algorithm that specifically tries to
> protect against the shared buffers being blown out by a sequential
> scan. But that will only help if it's the shared buffers being
> thrashed that's hurting you, not the entire OS file system cache.

Something we ought to think about sometime: what are the performance
implications of the real-world situation that we have another level of
caching sitting underneath us?  AFAIK all the theoretical studies we've
looked at consider only a single level of caching.  But for example,
if our buffer management algorithm recognizes an index page as being
heavily hit and therefore keeps it in cache for a long time, then when
it does fall out of cache you can be sure it's going to need to be read
from disk when it's next used, because the OS-level buffer cache has not
seen a call for that page in a long time.  Contrariwise a page that we
think is only on the fringe of usefulness is going to stay in the OS
cache because we repeatedly drop it and then have to ask for it again.

I have no idea how to model this situation, but it seems like it needs
some careful thought.

            regards, tom lane

Re: seq scan cache vs. index cache smackdown

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> Greg Stark <gsstark@mit.edu> writes:
> > 8.0, on the other hand, has a new algorithm that specifically tries to
> > protect against the shared buffers being blown out by a sequential
> > scan. But that will only help if it's the shared buffers being
> > thrashed that's hurting you, not the entire OS file system cache.
>
> Something we ought to think about sometime: what are the performance
> implications of the real-world situation that we have another level of
> caching sitting underneath us?

It seems inevitable that Postgres will eventually eliminate that redundant
layer of buffering. Since mmap is not workable, that means using O_DIRECT to
read table and index data.

Every other database eventually goes this direction, and for good reason.
Having two layers of caching and buffering is inherently inefficient. It also
makes it impossible for Postgres to offer any application-specific hints to
the caching replacement algorithms.

In that world you would configure Postgres much like you configure Oracle,
with shared_buffers taking up as much of your memory as you can afford. And
the OS file system cache is kept entirely out of the loop.

> AFAIK all the theoretical studies we've looked at consider only a single
> level of caching. But for example, if our buffer management algorithm
> recognizes an index page as being heavily hit and therefore keeps it in
> cache for a long time, then when it does fall out of cache you can be sure
> it's going to need to be read from disk when it's next used, because the
> OS-level buffer cache has not seen a call for that page in a long time.
> Contrariwise a page that we think is only on the fringe of usefulness is
> going to stay in the OS cache because we repeatedly drop it and then have to
> ask for it again.

Hum. Is it clear that that's bad? By the same logic it's the ones on the
fringe that you're likely to have to read again anyways. The ones that are
being heavily used are likely not to have to be read again anyways.

--
greg

Re: seq scan cache vs. index cache smackdown

From
"Merlin Moncure"
Date:
> It seems inevitable that Postgres will eventually eliminate that
redundant
> layer of buffering. Since mmap is not workable, that means using
O_DIRECT
> to
> read table and index data.

What about going the other way and simply letting the o/s do all the
caching?  How bad (or good) would the performance really be?

Merlin

Re: seq scan cache vs. index cache smackdown

From
Josh Berkus
Date:
Tom, Greg, Merlin,

>  But for example,
> if our buffer management algorithm recognizes an index page as being
> heavily hit and therefore keeps it in cache for a long time, then when
> it does fall out of cache you can be sure it's going to need to be read
> from disk when it's next used, because the OS-level buffer cache has not
> seen a call for that page in a long time.  Contrariwise a page that we
> think is only on the fringe of usefulness is going to stay in the OS
> cache because we repeatedly drop it and then have to ask for it again.

Now you can see why other DBMSs don't use the OS disk cache.  There's other
issues as well; for example, as long as we use the OS disk cache, we can't
eliminate checkpoint spikes, at least on Linux.  No matter what we do with
the bgwriter, fsyncing the OS disk cache causes heavy system activity.

> It seems inevitable that Postgres will eventually eliminate that redundant
> layer of buffering. Since mmap is not workable, that means using O_DIRECT
> to read table and index data.

Why is mmap not workable?    It would require far-reaching changes to our code
-- certainly -- but I don't think it can be eliminated from consideration.

> What about going the other way and simply letting the o/s do all the
> caching?  How bad (or good) would the performance really be?  

Pretty bad.   You can simulate this easily by turning your shared_buffers way
down ...

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: seq scan cache vs. index cache smackdown

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> Why is mmap not workable?

We can't control write order.  There are other equally bad problems,
but that one alone eliminates it from consideration.  See past discussions.

            regards, tom lane

Re: seq scan cache vs. index cache smackdown

From
"Merlin Moncure"
Date:
Josh Berkus wrote:
> Now you can see why other DBMSs don't use the OS disk cache.  There's
> other
> issues as well; for example, as long as we use the OS disk cache, we
can't
> eliminate checkpoint spikes, at least on Linux.  No matter what we do
with
> the bgwriter, fsyncing the OS disk cache causes heavy system activity.

MS SQL server uses the O/S disk cache...the database is very tightly
integrated with the O/S.  Write performance is one of the few things SQL
server can do better than most other databases despite running on a
mid-grade kernel and a low-grade filesystem...what does that say?
ReadFileScatter() and ReadFileGather() were added to the win32 API
specifically for SQL server...this is somewhat analogous to transaction
based writing such as in Reisfer4.  I'm not arguing ms sql server is
better in any way, IIRC they are still using table locks (!).

> > It seems inevitable that Postgres will eventually eliminate that
> redundant
> > layer of buffering. Since mmap is not workable, that means using
> O_DIRECT
> > to read table and index data.

IMO, The O_DIRECT argument makes assumptions about storage and o/s
technology that are moving targets.  Not sure about mmap().

Merlin

Re: seq scan cache vs. index cache smackdown

From
Greg Stark
Date:
Josh Berkus <josh@agliodbs.com> writes:

> Why is mmap not workable?    It would require far-reaching changes to our code
> -- certainly -- but I don't think it can be eliminated from consideration.

Fundamentally because there is no facility for being notified by the OS before
a page is written to disk. And there's no way to prevent a page from being
written to disk (mlock just prevents it from being flushed from memory, not
from being synced to disk).

So there's no way to guarantee the WAL will be written before the buffer is
synced to disk.



Maybe it could be done by writing and syncing the WAL independently before the
shared buffer is written to at all, but that would be a completely different
model. And it would locking the shared buffer until the sync is done, and
require a private copy of the shared buffer necessitating more copies than the
double buffering in the first place.

--
greg

Re: seq scan cache vs. index cache smackdown

From
"Magnus Hagander"
Date:
>Josh Berkus wrote:
>> Now you can see why other DBMSs don't use the OS disk cache.  There's
>> other
>> issues as well; for example, as long as we use the OS disk cache, we
>can't
>> eliminate checkpoint spikes, at least on Linux.  No matter what we do
>with
>> the bgwriter, fsyncing the OS disk cache causes heavy system
>activity.
>
>MS SQL server uses the O/S disk cache...

No, it doesn't. They open all files with FILE_FLAG_WRITE_THROUGH and
FILE_FLAG_NO_BUFFERING. It scales the size of it dynamically with the
system, but it uses it's own buffer cache.

> the database is very tightly
>integrated with the O/S.

That it is.


>Write performance is one of the few things SQL
>server can do better than most other databases despite running on a
>mid-grade kernel and a low-grade filesystem...what does that say?
>ReadFileScatter() and ReadFileGather() were added to the win32 API
>specifically for SQL server...this is somewhat analogous to transaction
>based writing such as in Reisfer4.

(Those are ReadFileScatter and WriteFileGather)

I don't think that's correct either. Scatter/Gather I/O is used to SQL
Server can issue reads for several blocks from disks into it's own
buffer cache with a single syscall even if these buffers are not
sequential. It did make significant performance improvements when they
added it, though.

(For those not knowing - it's ReadFile/WriteFile where you pass an array
of "this many bytes to this address" as parameters)


> I'm not arguing ms sql server is
>better in any way, IIRC they are still using table locks (!).

Not at all. They use row level locks, escalated to page level, then
escalated to table level. Has been since 7.0. In <= 6.5 they had page
level and table level locks. I think possibly back in 4.2 (this is
16-bit days on OS/2) they had only table level locks, but that's a long
time ago.
They don't do MVCC, though.

(I'm not saying it's better either. At some things it is, at many it is
not)

//Magnus

Re: seq scan cache vs. index cache smackdown

From
Christopher Browne
Date:
In the last exciting episode, merlin.moncure@rcsonline.com ("Merlin Moncure") wrote:
>> It seems inevitable that Postgres will eventually eliminate that
>> redundant layer of buffering. Since mmap is not workable, that
>> means using O_DIRECT to read table and index data.
>
> What about going the other way and simply letting the o/s do all the
> caching?  How bad (or good) would the performance really be?

I'm going to see about taking this story to OLS (Ottawa Linux
Symposium) in July and will see what hearing I can get.  There are
historically some commonalities in the way this situation is regarded,
in that there was _long_ opposition to the notion of having unbuffered
disk devices.

If there's more "story" that definitely needs to be taken, let me
know...
--
output = reverse("moc.enworbbc" "@" "enworbbc")
http://linuxdatabases.info/info/slony.html
Rules of  the Evil Overlord  #90. "I will  not design my  Main Control
Room  so  that  every  workstation  is facing  away  from  the  door."
<http://www.eviloverlord.com/>

Re: seq scan cache vs. index cache smackdown

From
PFC
Date:
    In the 'wishful hand waving' department :

    read index -> determine (tuple id,page) to hit in table -> for each of
these, tell the OS 'I'm gonna need these' via a NON BLOCKING call. Non
blocking because you feed the information to the OS as you read the index,
streaming it.

    Meanwhile, the OS accumulates the requests in an internal FIFO,
reorganizes them according to the order best suited to good disk head
movements, then reads them in clusters, and calls a callback inside the
application when it has data available. Or the application polls it once
in a while to get a bucketload of pages. The 'I'm gonna need these()'
syscall would also sometimes return  'hey, I'm full, read the pages I have
here waiting for you before asking for new ones'.

    A flag would tell the OS if the application wanted the results in any
order, or with order preserved.
    Without order preservation, if the application has requested twice the
same page with different tuple id's, the OS would call the callback only
once, giving it a list of the tuple id's associated with that page.

    It involves a tradeoff between memory and performance : as the size of
the FIFO increases, likelihood of good contiguous disk reading increases.
However, the memory structure would only contain page numbers and tuple
id's, so it could be pretty small.

    Returning the results in-order would also need more memory.

    It could be made very generic if instead of 'tuple id' you read 'opaque
application data', and instead of 'page' you read '(offset, length)'.

    This structure actually exists already in the Linux Kernel, it's called
the Elevator or something, but it works for scheduling reads between
threads.

    You can also read 'internal not yet developed postgres cache manager'
instead of OS if you don't feel like talking kernel developers into
implementing this thing.


> (Those are ReadFileScatter and WriteFileGather)
>

Re: seq scan cache vs. index cache smackdown

From
Greg Stark
Date:
PFC <lists@boutiquenumerique.com> writes:

>     You can also read 'internal not yet developed postgres cache manager'
> instead of OS if you don't feel like talking kernel developers into
> implementing this thing.

It exists already, it's called aio.

But there are a *lot* of details you skipped over.
And as always, the devil is in the details.

--
greg

Re: seq scan cache vs. index cache smackdown

From
Ron Mayer
Date:
Josh Berkus wrote:
>
> Now you can see why other DBMSs don't use the OS disk cache.  ...
> ...as long as we use the OS disk cache, we can't
> eliminate checkpoint spikes, at least on Linux.

Wouldn't the VM settings like the ones under /proc/sys/vm
and/or the commit=XXX mount option if using ext3 be a good
place to control this?

It seems if you wanted, by setting /proc/sys/vm/dirty_background_ratio
and /proc/sys/vm/dirty_expire_centisecs very low you'd be constantly
flushing dirty pages.


Has anyone experimented with these kinds of values:
/proc/sys/vm/dirty_ratio
    /* the generator of dirty data writes back at this ratio */
/proc/sys/vm/dirty_background_ratio
   /* start background writeback */
/proc/sys/vm/dirty_writeback_centisecs
   /* the interval between [some style of] writebacks */
/proc/sys/vm/dirty_expire_centisecs
   /* the number of centiseconds that data is allowed to remain dirty


I tried these to workaround the opposite kind of problem.... on a
laptop running linux under vmware I wanted to avoid having it do writes
quickly to make each individual transaction go faster; at the expense
of a big spike in IO that the sales guy would trigger explicitly before
talking a while.  Setting each of those very high and using a
commit=600 mount option made the whole demo run with very little
IO except for the explicit sync; but I never took the time
to understand which setting mattered to me or why.


>>It seems inevitable that Postgres will eventually eliminate that redundant
>>layer of buffering. Since mmap is not workable, that means using O_DIRECT
>>to read table and index data.

Re: seq scan cache vs. index cache smackdown

From
Bruce Momjian
Date:
Magnus Hagander wrote:
> I don't think that's correct either. Scatter/Gather I/O is used to SQL
> Server can issue reads for several blocks from disks into it's own
> buffer cache with a single syscall even if these buffers are not
> sequential. It did make significant performance improvements when they
> added it, though.
>
> (For those not knowing - it's ReadFile/WriteFile where you pass an array
> of "this many bytes to this address" as parameters)

Isn't that like the BSD writev()/readv() that Linux supports also?  Is
that something we should be using on Unix if it is supported by the OS?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: seq scan cache vs. index cache smackdown

From
"Magnus Hagander"
Date:
>> I don't think that's correct either. Scatter/Gather I/O is
>used to SQL
>> Server can issue reads for several blocks from disks into it's own
>> buffer cache with a single syscall even if these buffers are not
>> sequential. It did make significant performance improvements
>when they
>> added it, though.
>>
>> (For those not knowing - it's ReadFile/WriteFile where you
>pass an array
>> of "this many bytes to this address" as parameters)
>
>Isn't that like the BSD writev()/readv() that Linux supports also?  Is
>that something we should be using on Unix if it is supported by the OS?

Yes, they certainly seem very similar. The win32 functions are
explicitly designed for async I/O (they were after all created
specifically for SQL Server), so they put harder requirements on the
parameters. Specifically, it writes complete system pages only, and each
pointer has to point to only one page.
In a file opened without buffering it will also write all buffers out
and then wait for I/O completion from the device instead of one for
each. Not sure what the writev/readv ones do (not clear from my linux
man page).


Now wether this is something we could make use of - I'll leave that up
to those who know the buffer manager a lot better than I do.

//Magnus

Re: seq scan cache vs. index cache smackdown

From
"Merlin Moncure"
Date:
> Magnus Hagander wrote:
> > I don't think that's correct either. Scatter/Gather I/O is used to
SQL
> > Server can issue reads for several blocks from disks into it's own
> > buffer cache with a single syscall even if these buffers are not
> > sequential. It did make significant performance improvements when
they
> > added it, though.
> >
> > (For those not knowing - it's ReadFile/WriteFile where you pass an
array
> > of "this many bytes to this address" as parameters)
>
> Isn't that like the BSD writev()/readv() that Linux supports also?  Is
> that something we should be using on Unix if it is supported by the
OS?

readv and writev are in the single unix spec...and yes they are
basically just like the win32 versions except that that are synchronous
(and therefore better, IMO).

On some systems they might just be implemented as a loop inside the
library, or even as a macro.

http://www.opengroup.org/onlinepubs/007908799/xsh/sysuio.h.html

On operating systems that optimize vectored read operations, it's pretty
reasonable to assume good or even great performance gains, in addition
to (or instead of) recent changes to xlog.c to group writes together for
a file...it just takes things one stop further.

Is there a reason why readv/writev have not been considered in the past?

Merlin

Re: seq scan cache vs. index cache smackdown

From
Tom Lane
Date:
"Merlin Moncure" <merlin.moncure@rcsonline.com> writes:
> Is there a reason why readv/writev have not been considered in the past?

Lack of portability, and lack of obvious usefulness that would justify
dealing with the lack of portability.

I don't think there's any value in trying to write ordinary buffers this
way; making the buffer manager able to write multiple buffers at once
sounds like a great deal of complexity and deadlock risk in return for
not much.  It might be an alternative to the existing proposed patch for
writing multiple WAL buffers at once, but frankly I consider that patch
a waste of effort.  In real scenarios you seldom get to write more than
one WAL page without a forced sync occurring because someone committed.
Even if *your* transaction is long, any other backend committing a small
transaction still fsyncs.  On top of that, the bgwriter will be flushing
WAL in order to maintain the write-ahead rule any time it dumps a dirty
buffer.  I have a personal to-do item to make the bgwriter explicitly
responsible for writing completed WAL pages as part of its duties, but
I haven't done anything about it because I think that it will write lots
of such pages without any explicit code, thanks to the bufmgr's LSN
interlock.  Even if it doesn't get it done that way, the simplest answer
is to add a little bit of code to make sure bgwriter generally does the
writes, and then we don't care.

If you want to experiment with writev, feel free, but I'll want to see
demonstrable performance benefits before any such code actually goes in.

            regards, tom lane

Re: seq scan cache vs. index cache smackdown

From
Ron Mayer
Date:
Merlin Moncure wrote:
>
> readv and writev are in the single unix spec...and yes ...
>
> On some systems they might just be implemented as a loop inside the
> library, or even as a macro.

You sure?

Requirements like this:
  http://www.opengroup.org/onlinepubs/007908799/xsh/write.html
  "Write requests of {PIPE_BUF} bytes or less will not be
   interleaved with data from other processes doing writes
   on the same pipe."
make me think that it couldn't be just a macro; and if it
were a loop in the library it seems it'd still have to
make sure it's done with a single write system call.

(yeah, I know that requirement is just for pipes; and I
suppose they could write a loop for normal files and a
different special case for pipes; but I'd be surprised).

Re: seq scan cache vs. index cache smackdown

From
PFC
Date:
>> (For those not knowing - it's ReadFile/WriteFile where you pass an array
>> of "this many bytes to this address" as parameters)
>
> Isn't that like the BSD writev()/readv() that Linux supports also?  Is
> that something we should be using on Unix if it is supported by the OS?

    Nope, readv()/writev() read/write from/to the file sequentially to/from a
list of buffers in memory. The Windows calls read/write at random file
offsets to/from a list of buffers.