From:
Eliot Gable
Date:

Just curious if this would apply to PostgreSQL: http://queue.acm.org/detail.cfm?id=1814327

Now that I've read it, it seems like a no-brainer. So, how does PostgreSQL deal with the different latencies involved in accessing data on disk for searches / sorts vs. accessing data in memory? Is it allocated in a similar way as described in the article such that disk access is reduced to a minimum? 


From:
Heikki Linnakangas
Date:

On 15/06/10 06:21, Eliot Gable wrote:
> Just curious if this would apply to PostgreSQL:
> http://queue.acm.org/detail.cfm?id=1814327
>
> <http://queue.acm.org/detail.cfm?id=1814327>Now that I've read it, it seems
> like a no-brainer. So, how does PostgreSQL deal with the different latencies
> involved in accessing data on disk for searches / sorts vs. accessing data
> in memory? Is it allocated in a similar way as described in the article such
> that disk access is reduced to a minimum?

I don't think we have any binary heap structures that are large enough
for this to matter. We use a binary heap when merging tapes in the
tuplesort code, for example, but that's tiny.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

From:
Greg Smith
Date:

Eliot Gable wrote:
> Just curious if this would apply to
> PostgreSQL: http://queue.acm.org/detail.cfm?id=1814327

It's hard to take this seriously at all when it's so ignorant of actual
research in this area.  Take a look at
http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/BFJ01.pdf
for a second, specifically page 9.  See the "van Emde Boas" layout?
That's basically the same as what this article is calling a B-heap, and
the idea goes back to at least 1977.  As you can see from that paper,
the idea of using it to optimize for multi-level caches goes back to at
least 2001.  Based on the performance number, it seems a particularly
helpful optimization for the type of in-memory caching that his Varnish
tool is good at, so kudos for reinventing the right wheel.  But that's
an environment with one level of cache:  you're delivering something
from memory, or not.  You can't extrapolate from what works for that
very far.

> So, how does PostgreSQL deal with the different latencies involved in
> accessing data on disk for searches / sorts vs. accessing data in
> memory? Is it allocated in a similar way as described in the article
> such that disk access is reduced to a minimum?

PostgreSQL is modeling a much more complicated situation where there are
many levels of caches, from CPU to disk.  When executing a query, the
database tries to manage that by estimating the relative costs for CPU
operations, row operations, sequential disk reads, and random disk
reads.  Those fundamental operations are then added up to build more
complicated machinery like sorting.  To minimize query execution cost,
various query plans are considered, the cost computed for each one, and
the cheapest one gets executed.  This has to take into account a wide
variety of subtle tradeoffs related to whether memory should be used for
things that would otherwise happen on disk.  There are three primary
ways to search for a row, three main ways to do a join, two for how to
sort, and they all need to have cost estimates made for them that
balance CPU time, memory, and disk access.

The problem Varnish is solving is most like how PostgreSQL decides what
disk pages to keep in memory, specifically the shared_buffers
structure.  Even there the problem the database is trying to solve is
quite a bit more complicated than what a HTTP cache has to deal with.
For details about what the database does there, see "Inside the
PostgreSQL Buffer Cache" at http://projects.2ndquadrant.com/talks

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
"A. Kretschmer"
Date:

In response to Greg Smith :
> For details about what the database does there, see "Inside the
> PostgreSQL Buffer Cache" at http://projects.2ndquadrant.com/talks

Nice paper, btw., thanks for that!


Regards, Andreas
--
Andreas Kretschmer
Kontakt:  Heynitz: 035242/47150,   D1: 0160/7141639 (mehr: -> Header)
GnuPG: 0x31720C99, 1006 CCB4 A326 1D42 6431  2EB0 389D 1DC2 3172 0C99

From:
Matthew Wakeling
Date:

On Mon, 14 Jun 2010, Eliot Gable wrote:
> Just curious if this would apply to PostgreSQL:
> http://queue.acm.org/detail.cfm?id=1814327

Absolutely, and I said in
http://archives.postgresql.org/pgsql-performance/2010-03/msg00272.php
but applied to the Postgres B-tree indexes instead of heaps. It's a pretty
obvious performance improvement really - the principle is that when you do
have to fetch a page from a slower medium, you may as well make it count
for a lot.

Lots of research has already been done on this - the paper linked above is
rather behind the times.

However, AFAIK, Postgres has not implemented this in any of its indexing
systems.

Matthew

--
 An ant doesn't have a lot of processing power available to it. I'm not trying
 to be speciesist - I wouldn't want to detract you from such a wonderful
 creature, but, well, there isn't a lot there, is there?
                                        -- Computer Science Lecturer

From:
Yeb Havinga
Date:

Greg Smith wrote:
> Eliot Gable wrote:
>> Just curious if this would apply to PostgreSQL:
>> http://queue.acm.org/detail.cfm?id=1814327
>
> It's hard to take this seriously at all when it's so ignorant of
> actual research in this area.  Take a look at
> http://www.cc.gatech.edu/~bader/COURSES/UNM/ece637-Fall2003/papers/BFJ01.pdf
> for a second
Interesting paper, thanks for the reference!
> PostgreSQL is modeling a much more complicated situation where there
> are many levels of caches, from CPU to disk.  When executing a query,
> the database tries to manage that by estimating the relative costs for
> CPU operations, row operations, sequential disk reads, and random disk
> reads.  Those fundamental operations are then added up to build more
> complicated machinery like sorting.  To minimize query execution cost,
> various query plans are considered, the cost computed for each one,
> and the cheapest one gets executed.  This has to take into account a
> wide variety of subtle tradeoffs related to whether memory should be
> used for things that would otherwise happen on disk.  There are three
> primary ways to search for a row, three main ways to do a join, two
> for how to sort, and they all need to have cost estimates made for
> them that balance CPU time, memory, and disk access.
Do you think that the cache oblivious algorithm described in the paper
could speed up index scans hitting the disk Postgres (and os/hardware)
multi level memory case? (so e.g. random page cost could go down?)

regards,
Yeb Havinga

From:
Robert Haas
Date:

On Tue, Jun 15, 2010 at 8:23 AM, Matthew Wakeling <> wrote:
> Absolutely, and I said in
> http://archives.postgresql.org/pgsql-performance/2010-03/msg00272.php
> but applied to the Postgres B-tree indexes instead of heaps.

This is an interesting idea.  I would guess that you could simulate
this to some degree by compiling PG with a larger block size.  Have
you tried this to see whether/how much/for what kind of workloads it
helps?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:
Matthew Wakeling
Date:

On Fri, 18 Jun 2010, Robert Haas wrote:
> On Tue, Jun 15, 2010 at 8:23 AM, Matthew Wakeling <> wrote:
>> Absolutely, and I said in
>> http://archives.postgresql.org/pgsql-performance/2010-03/msg00272.php
>> but applied to the Postgres B-tree indexes instead of heaps.
>
> This is an interesting idea.  I would guess that you could simulate
> this to some degree by compiling PG with a larger block size.  Have
> you tried this to see whether/how much/for what kind of workloads it
> helps?

To a degree, that is the case. However, if you follow the thread a bit
further back, you will find evidence that when the index is in memory,
increasing the page size actually decreases the performance, because it
uses more CPU.

To make it clear - 8kB is not an optimal page size for either fully cached
data or sparsely cached data. For disc access, large pages are
appropriate, on the order of 256kB. If the page size is much lower than
that, then the time taken to fetch it doesn't actually decrease much, and
we are trying to get the maximum amount of work done per fetch without
slowing fetches down significantly.

Given such a large page size, it would then be appropriate to have a
better data structure inside the page. Currently, our indexes (at least
the GiST ones - I haven't looked at the Btree ones) use a simple linear
array in the index page. Using a proper tree inside the index page would
improve the CPU usage of the index lookups.

One detail that would need to be sorted out is the cache eviction policy.
I don't know if it is best to evict whole 256kB pages, or to evict 8kB
pages. Probably the former, which would involve quite a bit of change to
the shared memory cache. I can see the cache efficiency decreasing as a
result of this, which is the only disadvantage I can see.

This sort of thing has been fairly well researched at an academic level,
but has not been implemented in that many real world situations. I would
encourage its use in Postgres.

Matthew

--
 Failure is not an option. It comes bundled with your Microsoft product.
                                                 -- Ferenc Mantfeld

From:
Greg Smith
Date:

Matthew Wakeling wrote:
> This sort of thing has been fairly well researched at an academic
> level, but has not been implemented in that many real world
> situations. I would encourage its use in Postgres.

I guess, but don't forget that work on PostgreSQL is driven by what
problems people are actually running into.  There's a long list of
performance improvements sitting in the TODO list waiting for people to
find time to work on them, ones that we're quite certain are useful.
That anyone is going to chase after any of these speculative ideas from
academic research instead of one of those is unlikely.  Your
characterization of the potential speed up here is "Using a proper tree
inside the index page would improve the CPU usage of the index lookups",
which seems quite reasonable.  Regardless, when I consider "is that
something I have any reason to suspect is a bottleneck on common
workloads?", I don't think of any, and return to working on one of
things I already know is instead.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us


From:
Yeb Havinga
Date:

Greg Smith wrote:
> Matthew Wakeling wrote:
>> This sort of thing has been fairly well researched at an academic
>> level, but has not been implemented in that many real world
>> situations. I would encourage its use in Postgres.
>
> I guess, but don't forget that work on PostgreSQL is driven by what
> problems people are actually running into.  There's a long list of
> performance improvements sitting in the TODO list waiting for people
> to find time to work on them, ones that we're quite certain are
> useful.  That anyone is going to chase after any of these speculative
> ideas from academic research instead of one of those is unlikely.
> Your characterization of the potential speed up here is "Using a
> proper tree inside the index page would improve the CPU usage of the
> index lookups", which seems quite reasonable.  Regardless, when I
> consider "is that something I have any reason to suspect is a
> bottleneck on common workloads?", I don't think of any, and return to
> working on one of things I already know is instead.
>
There are two different things concerning gist indexes:

1) with larger block sizes and hence, larger # entries per gist page,
results in more generic keys of those pages. This in turn results in a
greater number of hits, when the index is queried, so a larger part of
the index is scanned. NB this has nothing to do with caching / cache
sizes; it holds for every IO model. Tests performed by me showed
performance improvements of over 200%. Since then implementing a speedup
has been on my 'want to do list'.

2) there are several approaches to get the # entries per page down. Two
have been suggested in the thread referred to by Matthew (virtual pages
(but how to order these?) and tree within a page). It is interesting to
see if ideas from Prokop's cache oblivous algorithms match with this
problem to find a suitable virtual page format.

regards,
Yeb Havinga


From:
"Kevin Grittner"
Date:

Yeb Havinga <> wrote:

> concerning gist indexes:
>
> 1) with larger block sizes and hence, larger # entries per gist
> page, results in more generic keys of those pages. This in turn
> results in a greater number of hits, when the index is queried, so
> a larger part of the index is scanned. NB this has nothing to do
> with caching / cache sizes; it holds for every IO model. Tests
> performed by me showed performance improvements of over 200%.
> Since then implementing a speedup has been on my 'want to do
> list'.

As I recall, the better performance in your tests was with *smaller*
GiST pages, right?  (The above didn't seem entirely clear on that.)

-Kevin

From:
Tom Lane
Date:

Greg Smith <> writes:
> Your characterization of the potential speed up here is "Using a proper tree
> inside the index page would improve the CPU usage of the index lookups",
> which seems quite reasonable.  Regardless, when I consider "is that
> something I have any reason to suspect is a bottleneck on common
> workloads?", I don't think of any, and return to working on one of
> things I already know is instead.

Note also that this doesn't do a thing for b-tree indexes, which already
have an intelligent within-page structure.  So that instantly makes it
not a mainstream issue.  Perhaps somebody will be motivated to work on
it, but most of us are chasing other things.

            regards, tom lane

From:
Yeb Havinga
Date:

Kevin Grittner wrote:
> Yeb Havinga <> wrote:
>
>
>> concerning gist indexes:
>>
>> 1) with larger block sizes and hence, larger # entries per gist
>> page, results in more generic keys of those pages. This in turn
>> results in a greater number of hits, when the index is queried, so
>> a larger part of the index is scanned. NB this has nothing to do
>> with caching / cache sizes; it holds for every IO model. Tests
>> performed by me showed performance improvements of over 200%.
>> Since then implementing a speedup has been on my 'want to do
>> list'.
>>
>
> As I recall, the better performance in your tests was with *smaller*
> GiST pages, right?  (The above didn't seem entirely clear on that.)
>
Yes, making pages smaller made index scanning faster.

-- Yeb


From:
Robert Haas
Date:

On Fri, Jun 18, 2010 at 1:53 PM, Greg Smith <> wrote:
> Matthew Wakeling wrote:
>>
>> This sort of thing has been fairly well researched at an academic level,
>> but has not been implemented in that many real world situations. I would
>> encourage its use in Postgres.
>
> I guess, but don't forget that work on PostgreSQL is driven by what problems
> people are actually running into.  There's a long list of performance
> improvements sitting in the TODO list waiting for people to find time to
> work on them, ones that we're quite certain are useful.  That anyone is
> going to chase after any of these speculative ideas from academic research
> instead of one of those is unlikely.  Your characterization of the potential
> speed up here is "Using a proper tree inside the index page would improve
> the CPU usage of the index lookups", which seems quite reasonable.
>  Regardless, when I consider "is that something I have any reason to suspect
> is a bottleneck on common workloads?", I don't think of any, and return to
> working on one of things I already know is instead.

This is drifting a bit off-topic for this thread, but it's not so easy
to figure out from looking at the TODO which things are actually
important.  Performance-related improvements are mixed in with
non-performance related improvements, which are mixed in with things
that are probably not improvements at all.  And even to the extent
that you can identify the stuff that's performance-related, it's far
from obvious which things are most important.  Any thoughts on that?

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise Postgres Company

From:
Greg Smith
Date:

Robert Haas wrote:
> This is drifting a bit off-topic for this thread, but it's not so easy
> to figure out from looking at the TODO which things are actually
> important.  Performance-related improvements are mixed in with
> non-performance related improvements, which are mixed in with things
> that are probably not improvements at all.  And even to the extent
> that you can identify the stuff that's performance-related, it's far
> from obvious which things are most important.  Any thoughts on that

I don't think it's off topic at all actually, and as usually I'll be
happy to argue why.  Reorganizing the TODO so that it's easier for
newcomers to consume is certainly a worthwhile but hard to "fund" (find
time to do relative to more important things) effort itself.  My point
was more that statistically, *anything* on that list is likely a better
candidate for something to work on usefully than one of the random
theoretical performance improvements from research that pop on the lists
from time to time.  People get excited about these papers and blog posts
sometimes, but the odds of those actually being in the critical path
where it represents a solution to a current PostgreSQL bottleneck is
dramatically lower than that you'll find one reading the list of *known*
issues.  Want to improve PostgreSQL performance?  Spend more time
reading the TODO, less looking around elsewhere for problems the
database may or may not have.

I have a major time sink I'm due to free myself from this week, and the
idea of providing some guidance for a "low hanging performance fruit"
section of the TODO is a good one I should take a look at.  I have a
personal list of that sort already I should probably just make public,
since the ideas for improving things are not the valuable part I should
worry about keeping private anyway.

--
Greg Smith  2ndQuadrant US  Baltimore, MD
PostgreSQL Training, Services and Support
   www.2ndQuadrant.us