Thread: [PERFORM] A Better External Sort?

[PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

26 September 2005, 17:46:56

>From: Ron Peacetree <rjpeace@earthlink.net>
>Sent: Sep 24, 2005 6:30 AM
>Subject: Re: [HACKERS] [PERFORM] Releasing memory during External sorting?
>
>... the amount of IO done is the most
>important of the things that you should be optimizing for in
>choosing an external sorting algorithm.
>
> <snip>
>
>Since sorting is a fundamental operation in many parts of a DBMS,
>this is a Big Deal.
>
>This discussion has gotten my creative juices flowing.  I'll post
>some Straw Man algorithm sketches after I've done some more
>thought.
>
As a thought exeriment, I've been considering the best way to sort 1TB
(2^40B) of 2-4KB (2^11-2^12B) records.  That's 2^28-2^29 records.

Part I: A Model of the System
The performance of such external sorts is limited by HD IO, then
memory IO, and finally CPU throughput.

On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
case) to ~1/128 (single HD best case. No more than one seek every
~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.

RAID HD IO will be in the range from as low as a single HD (RAID 1) to
~1/8 (a RAID system saturating the external IO bus) the throughput of
RAM.

RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
pathways internal to the CPU.

This model suggests that HD IO will greatly dominate every other
factor, particuarly if we are talking about a single HD rather than a
peripheral bus saturating RAID subsystem. If at all possible, we want
to access the HD subsystem only once for each data item, and we want
to avoid seeking more than the critical number of seeks implied above
when doing it.  It also suggests that at a minimum, it's worth it to
spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
Far more than that if we are talking about a single random access.

It's worth spending ~128 CPU operations to avoid a single random RAM
access, and literally 10's or even 100's of thousands of CPU operations to
avoid a random HD access.  In addition, there are many indications in
current ECE and IT literature that the performance gaps between these
pieces of computer systems are increasing and expected to continue to do
so for the forseeable future.  In short, _internal_ sorts have some, and are
going to increasingly have more, of the same IO problems usually
associated with external sorts.


Part II: a Suggested Algorithm
The simplest case is one where we have to order the data using a key that
only has two values.

Given 2^40B of data using 2KB or 4KB per record, the most compact
representation we can make of such a data set is to assign a 32b= 4B RID
or Rptr for location + a 1b key for each record.  Just the RID's would take up
1.25GB (250M records) or 2.5GB (500M records).  Enough space that even
an implied ordering of records may not fit into RAM.

Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive in terms
of IO operations than sorting the actual 1TB of data.

That IO cost can be lowered even further if instead of actually physically
sorting the RIDs, we assign a RID to the appropriate catagory inside the CPU
as we scan the data set and append the entries in a catagory from CPU cache
to a RAM file in one IO burst whenever said catagory gets full inside the CPU.
We can do the same with either RAM file to HD whenever they get full.  The
sorted order of the data is found by concatenating the appropriate files at the
end of the process.

As simple as this example is, it has many of the characteristics we are looking for:
A= We access each piece of data once on HD and in RAM.
B= We do the minimum amount of RAM and HD IO, and almost no random IO in
either case.
C= We do as much work as possible within the CPU.
D= This process is stable.  Equal keys stay in the original order they are encountered.

To generalize this method, we first need our 1b Key to become a sufficiently large
enough Key or KeyPrefix to be useful, yet not so big as to be CPU cache unfriendly.

Cache lines (also sometimes called "blocks") are usually 64B= 512b in size.
Therefore our RID+Key or KeyPrefix should never be larger than this.  For a 2^40B
data set, a 5B RID leaves us with potentially as much as 59B of Key or KeyPrefix.
Since the data can't take on more than 40b worth different values (actually 500M= 29b
for our example), we have more than adequate space for Key or KeyPrefix.  We just
have to figure out how to use it effectively.
A typical CPU L2 cache can hold 10's or 100's of thousands of such cache lines.
That's enough that we should be able to do a significant amount of useful work within
the CPU w/o having to go off-die.

The data structure we are using to represent the sorted data also needs to be
generalized.  We want a space efficient DS that allows us to find any given element in
as few accesses as possible and that allows us to insert new elements or reorganize
the DS as efficiently as possible.  This being a DB discussion list, a B+ tree seems like
a fairly obvious suggestion ;-)

A B+ tree where each element is no larger than a cache line and no node is larger than
what fits into L2 cache can be created dynamically as we scan the data set via any of
the fast, low IO methods well known for doing so.  Since the L2 cache can hold 10's of
thousands of cache lines, it should be easy to make sure that the B+ tree has something
like 1000 elements per node (making the base of the logarithm for access being at least
1000).  The log base 1000 of 500M is ~2.9, so that means that even in the absolute
worst case where every one of the 500M records is unique we can find any given
element in less than 3 accesses of the B+ tree.  Increasing the order of the B+ tree is
an option to reduce average accesses even further.

Since the DS representing the sorted order of the data is a B+ tree, it's very "IO friendly"
if we need to store part or all of it on HD.

In an multiprocessor environment, we can assign chunks of the data set to different
CPUs, let them build their independant B+ trees to represent the data in sorted order from
their POV, and then merge the B+ trees very efficiently into one overall DS to represent
the sorted order of the entire data set.

Finally, since these are B+ trees, we can keep them around and easily update them at will
for frequent used sorting conditions.

What do people think?

Ron

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

26 September 2005, 18:15:42


> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Ron Peacetree
> Sent: Monday, September 26, 2005 10:47 AM
> To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
> Subject: [HACKERS] [PERFORM] A Better External Sort?
>
> >From: Ron Peacetree <rjpeace@earthlink.net>
> >Sent: Sep 24, 2005 6:30 AM
> >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
> sorting?
> >
> >... the amount of IO done is the most
> >important of the things that you should be optimizing for in
> >choosing an external sorting algorithm.
> >
> > <snip>
> >
> >Since sorting is a fundamental operation in many parts of a DBMS,
> >this is a Big Deal.
> >
> >This discussion has gotten my creative juices flowing.  I'll post
> >some Straw Man algorithm sketches after I've done some more
> >thought.
> >
> As a thought exeriment, I've been considering the best way to sort 1TB
> (2^40B) of 2-4KB (2^11-2^12B) records.  That's 2^28-2^29 records.
>
> Part I: A Model of the System
> The performance of such external sorts is limited by HD IO, then
> memory IO, and finally CPU throughput.
>
> On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
> case) to ~1/128 (single HD best case. No more than one seek every
> ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.
>
> RAID HD IO will be in the range from as low as a single HD (RAID 1) to
> ~1/8 (a RAID system saturating the external IO bus) the throughput of
> RAM.
>
> RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
> pathways internal to the CPU.
>
> This model suggests that HD IO will greatly dominate every other
> factor, particuarly if we are talking about a single HD rather than a
> peripheral bus saturating RAID subsystem. If at all possible, we want
> to access the HD subsystem only once for each data item,

If you can achieve that, I think you should be given a Nobel Prize, and
I mean that sincerely.  I also think that your analysis is interesting.

> and we want
> to avoid seeking more than the critical number of seeks implied above
> when doing it.  It also suggests that at a minimum, it's worth it to
> spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
> Far more than that if we are talking about a single random access.
>
> It's worth spending ~128 CPU operations to avoid a single random RAM
> access, and literally 10's or even 100's of thousands of CPU
operations to
> avoid a random HD access.  In addition, there are many indications in
> current ECE and IT literature that the performance gaps between these
> pieces of computer systems are increasing and expected to continue to
do
> so for the forseeable future.  In short, _internal_ sorts have some,
and
> are
> going to increasingly have more, of the same IO problems usually
> associated with external sorts.

Knuth has made the observation (confirmed by others) that 40% of
mainframe CPU cycles are spent on sorting.  Hence, any sort of
optimization in this area is a potential for enormous savings.

> Part II: a Suggested Algorithm
> The simplest case is one where we have to order the data using a key
that
> only has two values.

I suggest testing against a very large class of distributions.  All of
the common statistical models are a start (Gaussian, Poisson, etc.) and
also single value, two distinct values, to some limit.

> Given 2^40B of data using 2KB or 4KB per record, the most compact
> representation we can make of such a data set is to assign a 32b= 4B
RID
> or Rptr for location + a 1b key for each record.  Just the RID's would
> take up
> 1.25GB (250M records) or 2.5GB (500M records).  Enough space that even
> an implied ordering of records may not fit into RAM.
>
> Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive
in
> terms
> of IO operations than sorting the actual 1TB of data.
>
> That IO cost can be lowered even further if instead of actually
physically
> sorting the RIDs, we assign a RID to the appropriate catagory inside
the
> CPU
> as we scan the data set and append the entries in a catagory from CPU
> cache
> to a RAM file in one IO burst whenever said catagory gets full inside
the
> CPU.
> We can do the same with either RAM file to HD whenever they get full.
The
> sorted order of the data is found by concatenating the appropriate
files
> at the
> end of the process.
>
> As simple as this example is, it has many of the characteristics we
are
> looking for:
> A= We access each piece of data once on HD and in RAM.
> B= We do the minimum amount of RAM and HD IO, and almost no random IO
in
> either case.
> C= We do as much work as possible within the CPU.
> D= This process is stable.  Equal keys stay in the original order they
are
> encountered.
>
> To generalize this method, we first need our 1b Key to become a
> sufficiently large
> enough Key or KeyPrefix to be useful, yet not so big as to be CPU
cache
> unfriendly.
>
> Cache lines (also sometimes called "blocks") are usually 64B= 512b in
> size.
> Therefore our RID+Key or KeyPrefix should never be larger than this.
For
> a 2^40B
> data set, a 5B RID leaves us with potentially as much as 59B of Key or
> KeyPrefix.
> Since the data can't take on more than 40b worth different values
> (actually 500M= 29b
> for our example), we have more than adequate space for Key or
KeyPrefix.
> We just
> have to figure out how to use it effectively.
> A typical CPU L2 cache can hold 10's or 100's of thousands of such
cache
> lines.
> That's enough that we should be able to do a significant amount of
useful
> work within
> the CPU w/o having to go off-die.
>
> The data structure we are using to represent the sorted data also
needs to
> be
> generalized.  We want a space efficient DS that allows us to find any
> given element in
> as few accesses as possible and that allows us to insert new elements
or
> reorganize
> the DS as efficiently as possible.  This being a DB discussion list, a
B+
> tree seems like
> a fairly obvious suggestion ;-)
>
> A B+ tree where each element is no larger than a cache line and no
node is
> larger than
> what fits into L2 cache can be created dynamically as we scan the data
set
> via any of
> the fast, low IO methods well known for doing so.  Since the L2 cache
can
> hold 10's of
> thousands of cache lines, it should be easy to make sure that the B+
tree
> has something
> like 1000 elements per node (making the base of the logarithm for
access
> being at least
> 1000).  The log base 1000 of 500M is ~2.9, so that means that even in
the
> absolute
> worst case where every one of the 500M records is unique we can find
any
> given
> element in less than 3 accesses of the B+ tree.  Increasing the order
of
> the B+ tree is
> an option to reduce average accesses even further.
>
> Since the DS representing the sorted order of the data is a B+ tree,
it's
> very "IO friendly"
> if we need to store part or all of it on HD.
>
> In an multiprocessor environment, we can assign chunks of the data set
to
> different
> CPUs, let them build their independant B+ trees to represent the data
in
> sorted order from
> their POV, and then merge the B+ trees very efficiently into one
overall
> DS to represent
> the sorted order of the entire data set.
>
> Finally, since these are B+ trees, we can keep them around and easily
> update them at will
> for frequent used sorting conditions.
>
> What do people think?

I think that your analysis is very interesting.  I would like to see the
result of the experiment.

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet.  Can you draw a picture of it for me?  (I am
dyslexic and understand things far better when I can visualize it).

Re: [PERFORM] A Better External Sort?

From

"Jonah H. Harris"

Date:

26 September 2005, 18:27:39

Ron,

Having rested my brain for the last few days, your theory made for interesting reading... Rather than argue the technical specs, I'd love to see an implementation :)

-Jonah

On 9/26/05, Dann Corbit <DCorbit@connx.com> wrote:

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Ron Peacetree
> Sent: Monday, September 26, 2005 10:47 AM
> To: pgsql-hackers@postgresql.org; pgsql-performance@postgresql.org
> Subject: [HACKERS] [PERFORM] A Better External Sort?
>
> >From: Ron Peacetree <rjpeace@earthlink.net>
> >Sent: Sep 24, 2005 6:30 AM
> >Subject: Re: [HACKERS] [PERFORM] Releasing memory during External
> sorting?
> >
> >... the amount of IO done is the most
> >important of the things that you should be optimizing for in
> >choosing an external sorting algorithm.
> >
> > <snip>
> >
> >Since sorting is a fundamental operation in many parts of a DBMS,
> >this is a Big Deal.
> >
> >This discussion has gotten my creative juices flowing.  I'll post
> >some Straw Man algorithm sketches after I've done some more
> >thought.
> >
> As a thought exeriment, I've been considering the best way to sort 1TB
> (2^40B) of 2-4KB (2^11-2^12B) records.  That's 2^28-2^29 records.
>
> Part I: A Model of the System
> The performance of such external sorts is limited by HD IO, then
> memory IO, and finally CPU throughput.
>
> On commodity HW, single HD IO is ~1/2048 (single HD realistic worst
> case) to ~1/128 (single HD best case. No more than one seek every
> ~14.7ms for a ~50MB/s 7200rpm SATA II HD) the throughtput of RAM.
>
> RAID HD IO will be in the range from as low as a single HD (RAID 1) to
> ~1/8 (a RAID system saturating the external IO bus) the throughput of
> RAM.
>
> RAM is ~1/8-1/16 the throughput and ~128x the latency of the data
> pathways internal to the CPU.
>
> This model suggests that HD IO will greatly dominate every other
> factor, particuarly if we are talking about a single HD rather than a
> peripheral bus saturating RAID subsystem. If at all possible, we want
> to access the HD subsystem only once for each data item,

If you can achieve that, I think you should be given a Nobel Prize, and
I mean that sincerely.  I also think that your analysis is interesting.

> and we want
> to avoid seeking more than the critical number of seeks implied above
> when doing it.  It also suggests that at a minimum, it's worth it to
> spend ~8 memory operations or ~64 CPU operations to avoid a HD access.
> Far more than that if we are talking about a single random access.
>
> It's worth spending ~128 CPU operations to avoid a single random RAM
> access, and literally 10's or even 100's of thousands of CPU
operations to
> avoid a random HD access.  In addition, there are many indications in
> current ECE and IT literature that the performance gaps between these
> pieces of computer systems are increasing and expected to continue to
do
> so for the forseeable future.  In short, _internal_ sorts have some,
and
> are
> going to increasingly have more, of the same IO problems usually
> associated with external sorts.

Knuth has made the observation (confirmed by others) that 40% of
mainframe CPU cycles are spent on sorting.  Hence, any sort of
optimization in this area is a potential for enormous savings.

> Part II: a Suggested Algorithm
> The simplest case is one where we have to order the data using a key
that
> only has two values.

I suggest testing against a very large class of distributions.  All of
the common statistical models are a start (Gaussian, Poisson, etc.) and
also single value, two distinct values, to some limit.

> Given 2^40B of data using 2KB or 4KB per record, the most compact
> representation we can make of such a data set is to assign a 32b= 4B
RID
> or Rptr for location + a 1b key for each record.  Just the RID's would
> take up
> 1.25GB (250M records) or 2.5GB (500M records).  Enough space that even
> an implied ordering of records may not fit into RAM.
>
> Still, sorting 1.25GB or 2.5GB of RIDs is considerably less expensive
in
> terms
> of IO operations than sorting the actual 1TB of data.
>
> That IO cost can be lowered even further if instead of actually
physically
> sorting the RIDs, we assign a RID to the appropriate catagory inside
the
> CPU
> as we scan the data set and append the entries in a catagory from CPU
> cache
> to a RAM file in one IO burst whenever said catagory gets full inside
the
> CPU.
> We can do the same with either RAM file to HD whenever they get full.
The
> sorted order of the data is found by concatenating the appropriate
files
> at the
> end of the process.
>
> As simple as this example is, it has many of the characteristics we
are
> looking for:
> A= We access each piece of data once on HD and in RAM.
> B= We do the minimum amount of RAM and HD IO, and almost no random IO
in
> either case.
> C= We do as much work as possible within the CPU.
> D= This process is stable.  Equal keys stay in the original order they
are
> encountered.
>
> To generalize this method, we first need our 1b Key to become a
> sufficiently large
> enough Key or KeyPrefix to be useful, yet not so big as to be CPU
cache
> unfriendly.
>
> Cache lines (also sometimes called "blocks") are usually 64B= 512b in
> size.
> Therefore our RID+Key or KeyPrefix should never be larger than this.
For
> a 2^40B
> data set, a 5B RID leaves us with potentially as much as 59B of Key or
> KeyPrefix.
> Since the data can't take on more than 40b worth different values
> (actually 500M= 29b
> for our example), we have more than adequate space for Key or
KeyPrefix.
> We just
> have to figure out how to use it effectively.
> A typical CPU L2 cache can hold 10's or 100's of thousands of such
cache
> lines.
> That's enough that we should be able to do a significant amount of
useful
> work within
> the CPU w/o having to go off-die.
>
> The data structure we are using to represent the sorted data also
needs to
> be
> generalized.  We want a space efficient DS that allows us to find any
> given element in
> as few accesses as possible and that allows us to insert new elements
or
> reorganize
> the DS as efficiently as possible.  This being a DB discussion list, a
B+
> tree seems like
> a fairly obvious suggestion ;-)
>
> A B+ tree where each element is no larger than a cache line and no
node is
> larger than
> what fits into L2 cache can be created dynamically as we scan the data
set
> via any of
> the fast, low IO methods well known for doing so.  Since the L2 cache
can
> hold 10's of
> thousands of cache lines, it should be easy to make sure that the B+
tree
> has something
> like 1000 elements per node (making the base of the logarithm for
access
> being at least
> 1000).  The log base 1000 of 500M is ~2.9, so that means that even in
the
> absolute
> worst case where every one of the 500M records is unique we can find
any
> given
> element in less than 3 accesses of the B+ tree.  Increasing the order
of
> the B+ tree is
> an option to reduce average accesses even further.
>
> Since the DS representing the sorted order of the data is a B+ tree,
it's
> very "IO friendly"
> if we need to store part or all of it on HD.
>
> In an multiprocessor environment, we can assign chunks of the data set
to
> different
> CPUs, let them build their independant B+ trees to represent the data
in
> sorted order from
> their POV, and then merge the B+ trees very efficiently into one
overall
> DS to represent
> the sorted order of the entire data set.
>
> Finally, since these are B+ trees, we can keep them around and easily
> update them at will
> for frequent used sorting conditions.
>
> What do people think?

I think that your analysis is very interesting.  I would like to see the
result of the experiment.

I think that the btrees are going to be O(n*log(n)) in construction of
the indexes in disk access unless you memory map them [which means you
would need stupendous memory volume] and so I cannot say that I really
understand your idea yet.  Can you draw a picture of it for me?  (I am
dyslexic and understand things far better when I can visualize it).

---------------------------(end of broadcast)---------------------------
TIP 4: Have you searched our list archives?

               http://archives.postgresql.org

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: [PERFORM] A Better External Sort?

From

Tom Lane

Date:

26 September 2005, 22:42:24

Ron Peacetree <rjpeace@earthlink.net> writes:
> Let's start by assuming that an element is <= in size to a cache line and a
> node fits into L1 DCache.  [ much else snipped ]

So far, you've blithely assumed that you know the size of a cache line,
the sizes of L1 and L2 cache, and that you are working with sort keys
that you can efficiently pack into cache lines.  And that you know the
relative access speeds of the caches and memory so that you can schedule
transfers, and that the hardware lets you get at that transfer timing.
And that the number of distinct key values isn't very large.

I don't see much prospect that anything we can actually use in a
portable fashion is going to emerge from this line of thought.

            regards, tom lane

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

27 September 2005, 09:30:20

>From: Dann Corbit <DCorbit@connx.com>
>Sent: Sep 26, 2005 5:13 PM
>To: Ron Peacetree <rjpeace@earthlink.net>, pgsql-hackers@postgresql.org,
>    pgsql-performance@postgresql.org
>Subject: RE: [HACKERS] [PERFORM] A Better External Sort?
>
>I think that the btrees are going to be O(n*log(n)) in construction of
>the indexes in disk access unless you memory map them [which means you
>would need stupendous memory volume] and so I cannot say that I really
>understand your idea yet.
>
Traditional algorithms for the construction of Btree variants (B, B+, B*, ...)
don't require O(nlgn) HD accesses.  These shouldn't either.

Let's start by assuming that an element is <= in size to a cache line and a
node fits into L1 DCache.  To make the discussion more concrete, I'll use a
64KB L1 cache + a 1MB L2 cache only as an example.

Simplest case: the Key has few enough distinct values that all Keys or
KeyPrefixes fit into L1 DCache (for a 64KB cache with 64B lines, that's
 <= 1000 different values.  More if we can fit more than 1 element into
each cache line.).

As we scan the data set coming in from HD, we compare the Key or KeyPrefix
to the sorted list of Key values in the node.  This can be done in O(lgn) using
Binary Search or O(lglgn) using a variation of Interpolation Search.
If the Key value exists, we append this RID to the list of RIDs having the
same Key:
  If the RAM buffer of this list of RIDs is full we append it and the current
  RID to the HD list of these RIDs.
Else we insert this new key value into its proper place in the sorted list of Key
values in the node and start a new list for this value of RID.

We allocate room for a CPU write buffer so we can schedule RAM writes to
the RAM lists of RIDs so as to minimize the randomness of them.

When we are finished scanning the data set from HD, the sorted node with
RID lists for each Key value contains the sort order for the whole data set.

Notice that almost all of the random data access is occuring within the CPU
rather than in RAM or HD, and that we are accessing RAM or HD only when
absolutely needed.

Next simplest case: Multiple nodes, but they all fit in the CPU cache(s).
In the given example CPU, we will be able to fit at least 1000 elements per
node and 2^20/2^16= up to 16 such nodes in this CPU.  We use a node's
worth of space as a RAM write buffer, so we end up with room for 15 such
nodes in this CPU.  This is enough for a 2 level index to at least 15,000
distinct Key value lists.

All of the traditional tricks for splitting a Btree node and redistributing
elements within them during insertion or splitting for maximum node
utilization can be used here.

The most general case: There are too many nodes to fit within the CPU
cache(s).  The root node now points to a maximum of at least 1000 nodes
since each element in the root node points to another node.  A full 2 level
index is now enough to point to at least 10^6 distinct Key value lists, and
3 levels will index more distinct Key values than is possible in our 1TB,
500M record example.

We can use some sort of node use prediction algorithm like LFU to decide
which node should be moved out of CPU when we have to replace one of
the nodes in the CPU.  The nodes in RAM or on HD can be arranged to
maximize streaming IO behavior and minimize random access IO
behavior.

As you can see, both the RAM and HD IO are as minimized as possible,
and what such IO there is has been optimized for streaming behavior.


>Can you draw a picture of it for me?  (I am dyslexic and understand things
>far better when I can visualize it).
>
Not much for pictures.  Hopefully the explanation helps?

Ron

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

27 September 2005, 09:34:08

SECOND ATTEMPT AT POST.  Web mailer appears to have
eaten first one.  I apologize in advance if anyone gets two
versions of this post.
=r

>From: Tom Lane <tgl@sss.pgh.pa.us>
>Sent: Sep 26, 2005 9:42 PM
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>So far, you've blithely assumed that you know the size of a cache line,
>the sizes of L1 and L2 cache,
>
NO.  I used exact values only as examples.  Realistic examples drawn
from an extensive survey of past, present, and what I could find out
about future systems; but only examples nonetheless.  For instance,
Hennessy and Patterson 3ed points out that 64B cache lines are
optimally performing for caches between 16KB and 256KB.  The same
source as well as sources specifically on CPU memory hierarchy
design points out that we are not likely to see L1 caches larger than
256KB in the forseeable future.

The important point was the idea of an efficient Key, rather than
Record, sort using a CPU cache friendly data structure with provably
good space and IO characteristics based on a reasonable model of
current and likely future single box computer architecture (although
it would be fairly easy to extend it to include the effects of
networking.)

No apriori exact or known values are required for the method to work.


>and that you are working with sort keys that you can efficiently pack
>into cache lines.
>
Not "pack".  "map".  n items can not take on more than n values.  n
values can be represented in lgn bits.  Less efficient mappings can
also work.  Either way I demonstrated that we have plenty of space in
a likely and common cache line size.  Creating a mapping function
to represent m values in lgm bits is a well known hack, and if we keep
track of minimum and maximum values for fields during insert and
delete operations, we can even create mapping functions fairly easily.
(IIRC, Oracle does keep track of minimum and maximum field
values.)


>And that you know the relative access speeds of the caches and
>memory so that you can schedule transfers,
>
Again, no.  I created a reasonable model of a computer system that
holds remarkably well over a _very_ wide range of examples.  I
don't need the numbers to be exactly right to justify my approach
to this problem or understand why other approaches may have
downsides.  I just have to get the relative performance of the
system components and the relative performance gap between them
reasonably correct.  The stated model does that very well.

Please don't take my word for it.  Go grab some random box:
laptop, desktop, unix server, etc and try it for yourself.  Part of the
reason I published the model was so that others could examine it.


>and that the hardware lets you get at that transfer timing.
>
Never said anything about this, and in fact I do not need any such.


>And that the number of distinct key values isn't very large.
>
Quite the opposite in fact.  I went out of my way to show that the
method still works well even if every Key is distinct.  It is _more
efficient_ when the number of distinct keys is small compared to
the number of data items, but it works as well as any other Btree
would when all n of the Keys are distinct.  This is just a CPU cache
and more IO friendly Btree, not some magical and unheard of
technique.  It's just as general purpose as Btrees usually are.

I'm simply looking at the current and likely future state of computer
systems architecture and coming up with a slight twist on how to use
already well known and characterized techniques. not trying to start
a revolution.


I'm trying very hard NOT to waste anyone's time around here.
Including my own
Ron

Re: [PERFORM] A Better External Sort?

From

"Jonah H. Harris"

Date:

27 September 2005, 09:49:29

Ron,

Again, if you feel strongly enough about the theory to argue it, I recommend that you spend your time constructively; create an implemenation of it. Citing academics is cool and all, but code speaks louder than theory in this case. As Tom mentioned, this has to be portable. Making assumptions about computing architectures (especially those in the future), is fine for theory, but not practical for something that needs to be maintained in the real-world. Go forth and write thy code.

-Jonah

On 9/27/05, Ron Peacetree <rjpeace@earthlink.net> wrote:

SECOND ATTEMPT AT POST.  Web mailer appears to have
eaten first one.  I apologize in advance if anyone gets two
versions of this post.
=r

>From: Tom Lane <tgl@sss.pgh.pa.us >
>Sent: Sep 26, 2005 9:42 PM
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>So far, you've blithely assumed that you know the size of a cache line,
>the sizes of L1 and L2 cache,
>
NO.  I used exact values only as examples.  Realistic examples drawn
from an extensive survey of past, present, and what I could find out
about future systems; but only examples nonetheless.  For instance,
Hennessy and Patterson 3ed points out that 64B cache lines are
optimally performing for caches between 16KB and 256KB.  The same
source as well as sources specifically on CPU memory hierarchy
design points out that we are not likely to see L1 caches larger than
256KB in the forseeable future.

The important point was the idea of an efficient Key, rather than
Record, sort using a CPU cache friendly data structure with provably
good space and IO characteristics based on a reasonable model of
current and likely future single box computer architecture (although
it would be fairly easy to extend it to include the effects of
networking.)

No apriori exact or known values are required for the method to work.

>and that you are working with sort keys that you can efficiently pack
>into cache lines.
>
Not "pack".  "map".  n items can not take on more than n values.  n
values can be represented in lgn bits.  Less efficient mappings can
also work.  Either way I demonstrated that we have plenty of space in
a likely and common cache line size.  Creating a mapping function
to represent m values in lgm bits is a well known hack, and if we keep
track of minimum and maximum values for fields during insert and
delete operations, we can even create mapping functions fairly easily.
(IIRC, Oracle does keep track of minimum and maximum field
values.)

>And that you know the relative access speeds of the caches and
>memory so that you can schedule transfers,
>
Again, no.  I created a reasonable model of a computer system that
holds remarkably well over a _very_ wide range of examples.  I
don't need the numbers to be exactly right to justify my approach
to this problem or understand why other approaches may have
downsides.  I just have to get the relative performance of the
system components and the relative performance gap between them
reasonably correct.  The stated model does that very well.

Please don't take my word for it.  Go grab some random box:
laptop, desktop, unix server, etc and try it for yourself.  Part of the
reason I published the model was so that others could examine it.

>and that the hardware lets you get at that transfer timing.
>
Never said anything about this, and in fact I do not need any such.

>And that the number of distinct key values isn't very large.
>
Quite the opposite in fact.  I went out of my way to show that the
method still works well even if every Key is distinct.  It is _more
efficient_ when the number of distinct keys is small compared to
the number of data items, but it works as well as any other Btree
would when all n of the Keys are distinct.  This is just a CPU cache
and more IO friendly Btree, not some magical and unheard of
technique.  It's just as general purpose as Btrees usually are.

I'm simply looking at the current and likely future state of computer
systems architecture and coming up with a slight twist on how to use
already well known and characterized techniques. not trying to start
a revolution.

I'm trying very hard NOT to waste anyone's time around here.
Including my own
Ron

---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings

--
Respectfully,

Jonah H. Harris, Database Internals Architect
EnterpriseDB Corporation
http://www.enterprisedb.com/

Re: [PERFORM] A Better External Sort?

From

Josh Berkus

Date:

27 September 2005, 13:16:02

Ron,

I've somehow missed part of this thread, which is a shame since this is
an area of primary concern for me.

Your suggested algorithm seems to be designed to relieve I/O load by
making more use of the CPU.   (if I followed it correctly).  However,
that's not PostgreSQL's problem; currently for us external sort is a
*CPU-bound* operation, half of which is value comparisons.  (oprofiles
available if anyone cares)

So we need to look, instead, at algorithms which make better use of
work_mem to lower CPU activity, possibly even at the expense of I/O.

--Josh Berkus

Re: [PERFORM] A Better External Sort?

From

"Jeffrey W. Baker"

Date:

27 September 2005, 14:27:01

On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:

> That Btree can be used to generate a physical reordering of the data
> in one pass, but that's the weakest use for it.  The more powerful
> uses involve allowing the Btree to persist and using it for more
> efficient re-searches or combining it with other such Btrees (either as
> a step in task distribution across multiple CPUs or as a more efficient
> way to do things like joins by manipulating these Btrees rather than
> the actual records.)

Maybe you could describe some concrete use cases.  I can see what you
are getting at, and I can imagine some advantageous uses, but I'd like
to know what you are thinking.

Specifically I'd like to see some cases where this would beat sequential
scan.  I'm thinking that in your example of a terabyte table with a
column having only two values, all the queries I can think of would be
better served with a sequential scan.

Perhaps I believe this because you can now buy as much sequential I/O as
you want.  Random I/O is the only real savings.

-jwb

Re: [PERFORM] A Better External Sort?

From

"Kevin Grittner"

Date:

28 September 2005, 10:17:56

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches.  That's pretty much the low end of what I see when the
server is under any significant load.

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

28 September 2005, 10:22:49

>From: Josh Berkus <josh@agliodbs.com>
>ent: Sep 27, 2005 12:15 PM
>To: Ron Peacetree <rjpeace@earthlink.net>
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>I've somehow missed part of this thread, which is a shame since this is
>an area of primary concern for me.
>
>Your suggested algorithm seems to be designed to relieve I/O load by
>making more use of the CPU.   (if I followed it correctly).
>
The goal is to minimize all IO load.  Not just HD IO load, but also RAM
IO load.  Particularly random access IO load of any type (for instance:
"the pointer chasing problem").

In addition, the design replaces explicit data or explicit key manipulation
with the creation of a smaller, far more CPU and IO efficient data
structure (essentially a CPU cache friendly Btree index) of the sorted
order of the data.

That Btree can be used to generate a physical reordering of the data
in one pass, but that's the weakest use for it.  The more powerful
uses involve allowing the Btree to persist and using it for more
efficient re-searches or combining it with other such Btrees (either as
a step in task distribution across multiple CPUs or as a more efficient
way to do things like joins by manipulating these Btrees rather than
the actual records.)


>However, that's not PostgreSQL's problem; currently for us external
>sort is a *CPU-bound* operation, half of which is value comparisons.
>(oprofiles available if anyone cares)
>
>So we need to look, instead, at algorithms which make better use of
>work_mem to lower CPU activity, possibly even at the expense of I/O.
>
I suspect that even the highly efficient sorting code we have is
suffering more pessimal CPU IO behavior than what I'm presenting.
Jim Gray's external sorting contest web site points out that memory IO
has become a serious problem for most of the contest entries.

Also, I'll bet the current code manipulates more data.

Finally, there's the possibilty of reusing the product of this work to a
degree and in ways that we can't with our current sorting code.


Now all we need is resources and time to create a prototype.
Since I'm not likely to have either any time soon, I'm hoping that
I'll be able to explain this well enough that others can test it.

*sigh* I _never_ have enough time or resources any more...
Ron

Re: [PERFORM] A Better External Sort?

From

"Jeffrey W. Baker"

Date:

29 September 2005, 01:27:28

On Wed, 2005-09-28 at 12:03 -0400, Ron Peacetree wrote:
> >From: "Jeffrey W. Baker" <jwbaker@acm.org>
> >Sent: Sep 27, 2005 1:26 PM
> >To: Ron Peacetree <rjpeace@earthlink.net>
> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> >
> >On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
> >
> >>That Btree can be used to generate a physical reordering of the data
> >>in one pass, but that's the weakest use for it.  The more powerful
> >>uses involve allowing the Btree to persist and using it for more
> >>efficient re-searches or combining it with other such Btrees (either as
> >>a step in task distribution across multiple CPUs or as a more efficient
> >>way to do things like joins by manipulating these Btrees rather than
> >>the actual records.)
> >
> >Maybe you could describe some concrete use cases.  I can see what
> >you are getting at, and I can imagine some advantageous uses, but
> >I'd like to know what you are thinking.
> >
> >Specifically I'd like to see some cases where this would beat sequential
> >scan.  I'm thinking that in your example of a terabyte table with a
> >column having only two values, all the queries I can think of would be
> >better served with a sequential scan.
> >
> In my original example, a sequential scan of the 1TB of 2KB or 4KB
> records, => 250M or 500M records of data, being sorted on a binary
> value key will take ~1000x more time than reading in the ~1GB Btree
> I described that used a Key+RID (plus node pointers) representation
> of the data.

You are engaging in a length and verbose exercise in mental
masturbation, because you have not yet given a concrete example of a
query where this stuff would come in handy.  A common, general-purpose
case would be the best.

We can all see that the method you describe might be a good way to sort
a very large dataset with some known properties, which would be fine if
you are trying to break the terasort benchmark.  But that's not what
we're doing here.  We are designing and operating relational databases.
So please explain the application.

Your main example seems to focus on a large table where a key column has
constrained values.  This case is interesting in proportion to the
number of possible values.  If I have billions of rows, each having one
of only two values, I can think of a trivial and very fast method of
returning the table "sorted" by that key: make two sequential passes,
returning the first value on the first pass and the second value on the
second pass.  This will be faster than the method you propose.

I think an important aspect you have failed to address is how much of
the heap you must visit after the sort is complete.  If you are
returning every tuple in the heap then the optimal plan will be very
different from the case when you needn't.

-jwb

PS: Whatever mailer you use doesn't understand or respect threading nor
attribution.  Out of respect for the list's readers, please try a mailer
that supports these 30-year-old fundamentals of electronic mail.

Re: [PERFORM] A Better External Sort?

From

Pailloncy Jean-Gerard

Date:

29 September 2005, 08:12:01

>> Your main example seems to focus on a large table where a key
>> column has
>> constrained values.  This case is interesting in proportion to the
>> number of possible values.  If I have billions of rows, each
>> having one
>> of only two values, I can think of a trivial and very fast method of
>> returning the table "sorted" by that key: make two sequential passes,
>> returning the first value on the first pass and the second value
>> on the
>> second pass.  This will be faster than the method you propose.
>>
>>
> 1= No that was not my main example.  It was the simplest example
> used to
> frame the later more complicated examples.  Please don't get hung
> up on it.
>
> 2= You are incorrect.  Since IO is the most expensive operation we
> can do,
> any method that makes two passes through the data at top scanning
> speed
> will take at least 2x as long as any method that only takes one
> such pass.
You do not get the point.
As the time you get the sorted references to the tuples, you need to
fetch the tuples themself, check their visbility, etc. and returns
them to the client.

So,
if there is only 2 values in the column of big table that is larger
than available RAM,
two seq scans of the table without any sorting
is the fastest solution.

Cordialement,
Jean-Gérard Pailloncy

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

29 September 2005, 09:58:13

>From: "Jeffrey W. Baker" <jwbaker@acm.org>
>Sent: Sep 29, 2005 12:27 AM
>To: Ron Peacetree <rjpeace@earthlink.net>
>Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>You are engaging in a length and verbose exercise in mental
>masturbation, because you have not yet given a concrete example of a
>query where this stuff would come in handy.  A common, general-purpose
>case would be the best.
>
??? I posted =3= specific classes of common, general-purpose query
operations where OES and the OES Btrees look like they should be
superior to current methods:
1= when splitting sorting or other operations across multiple CPUs
2= when doing joins of different tables by doing the join on these Btrees
rather than the original tables.
3= when the opportunity arises to reuse OES Btree results of previous
sorts for different keys in the same table.  Now we can combine the
existing Btrees to obtain the new order based on the composite key
without ever manipulating the original, much larger, table.

In what way are these examples not "concrete"?


>We can all see that the method you describe might be a good way to sort
>a very large dataset with some known properties, which would be fine if
>you are trying to break the terasort benchmark.  But that's not what
>we're doing here.  We are designing and operating relational databases.
>So please explain the application.
>
This is a GENERAL method.  It's based on CPU cache efficient Btrees that
use variable length prefix keys and RIDs.
It assumes NOTHING about the data or the system in order to work.
I gave some concrete examples for the sake of easing explanation, NOT
as an indication of assumptions or limitations of the method.  I've even
gone out of my way to prove that no such assumptions or limitations exist.
Where in the world are you getting such impressions?


>Your main example seems to focus on a large table where a key column has
>constrained values.  This case is interesting in proportion to the
>number of possible values.  If I have billions of rows, each having one
>of only two values, I can think of a trivial and very fast method of
>returning the table "sorted" by that key: make two sequential passes,
>returning the first value on the first pass and the second value on the
>second pass.  This will be faster than the method you propose.
>
1= No that was not my main example.  It was the simplest example used to
frame the later more complicated examples.  Please don't get hung up on it.

2= You are incorrect.  Since IO is the most expensive operation we can do,
any method that makes two passes through the data at top scanning speed
will take at least 2x as long as any method that only takes one such pass.


>I think an important aspect you have failed to address is how much of
>the heap you must visit after the sort is complete.  If you are
>returning every tuple in the heap then the optimal plan will be very
>different from the case when you needn't.
>
Hmmm.  Not sure which "heap" you are referring to, but the OES Btree
index is provably the lowest (in terms of tree height) and smallest
possible CPU cache efficient data structure that one can make and still
have all of the traditional benefits associated with a Btree representation
of a data set.

Nonetheless, returning a RID, or all RIDs with(out) the same Key, or all
RIDs (not) within a range of Keys, or simply all RIDs in sorted order is
efficient.  Just as should be for a Btree (actually it's a B+ tree variant to
use Knuth's nomenclature).  I'm sure someone posting from acm.org
recognizes how each of these Btree operations maps to various SQL
features...

I haven't been talking about query plans because they are orthogonal to
the issue under discussion?  If we use a layered model for PostgreSQL's
architecture, this functionality is more primal than that of a query
planner.  ALL query plans that currently involve sorts will benefit from a
more efficient way to do, or avoid, sorts.


>PS: Whatever mailer you use doesn't understand or respect threading nor
>attribution.  Out of respect for the list's readers, please try a mailer
>that supports these 30-year-old fundamentals of electronic mail.
>
That is an issue of infrastructure on the recieving side, not on the sending
(my) side since even my web mailer seems appropriately RFC conformant.
Everything seems to be going in the correct places and being properly
organized on archival.postgres.org ...

Ron

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

29 September 2005, 09:58:22

If I've done this correctly, there should not be anywhere near
the number of context switches we currently see while sorting.

Each unscheduled context switch represents something unexpected
occuring or things not being where they are needed when they are
needed.  Reducing such circumstances to the absolute minimum
was one of the design goals.

Reducing the total amount of IO to the absolute minimum should
help as well.

Ron


-----Original Message-----
From: Kevin Grittner <Kevin.Grittner@wicourts.gov>
Sent: Sep 27, 2005 11:21 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

I can't help wondering how a couple thousand context switches per
second would affect the attempt to load disk info into the L1 and
L2 caches.  That's pretty much the low end of what I see when the
server is under any significant load.

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

29 September 2005, 10:03:05

>From: "Jeffrey W. Baker" <jwbaker@acm.org>
>Sent: Sep 27, 2005 1:26 PM
>To: Ron Peacetree <rjpeace@earthlink.net>
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>On Tue, 2005-09-27 at 13:15 -0400, Ron Peacetree wrote:
>
>>That Btree can be used to generate a physical reordering of the data
>>in one pass, but that's the weakest use for it.  The more powerful
>>uses involve allowing the Btree to persist and using it for more
>>efficient re-searches or combining it with other such Btrees (either as
>>a step in task distribution across multiple CPUs or as a more efficient
>>way to do things like joins by manipulating these Btrees rather than
>>the actual records.)
>
>Maybe you could describe some concrete use cases.  I can see what
>you are getting at, and I can imagine some advantageous uses, but
>I'd like to know what you are thinking.
>
1= In a 4P box, we split the data in RAM into 4 regions and create
a CPU cache friendly Btree using the method I described for each CPU.
The 4 Btrees can be merged in a more time and space efficient manner
than the original records to form a Btree that represents the sorted order
of the entire data set.  Any of these Btrees can be allowed to persist to
lower the cost of doing similar operations in the future (Updating the
Btrees during inserts and deletes is cheaper than updating the original
data files and then redoing the same sort from scratch in the future.)
Both the original sort and future such sorts are made more efficient
than current methods.

2= We use my method to sort two different tables.  We now have these
very efficient representations of a specific ordering on these tables.  A
join operation can now be done using these Btrees rather than the
original data tables that involves less overhead than many current
methods.

3=  We have multiple such Btrees for the same data set representing
sorts done using different fields (and therefore different Keys).
Calculating a sorted order for the data based on a composition of
those Keys is now cheaper than doing the sort based on the composite
Key from scratch.  When some of the Btrees exist and some of them
do not, there is a tradeoff calculation to be made.  Sometimes it will be
cheaper to do the sort from scratch using the composite Key.

>Specifically I'd like to see some cases where this would beat sequential
>scan.  I'm thinking that in your example of a terabyte table with a
>column having only two values, all the queries I can think of would be
>better served with a sequential scan.
>
In my original example, a sequential scan of the 1TB of 2KB or 4KB
records, => 250M or 500M records of data, being sorted on a binary
value key will take ~1000x more time than reading in the ~1GB Btree
I described that used a Key+RID (plus node pointers) representation
of the data.

Just to clarify the point further,
1TB of 1B records => 2^40 records of at most 256 distinct values.
1TB of 2B records => 2^39 records of at most 2^16 distinct values.
1TB of 4B records => 2^38 records of at most 2^32 distinct values.
1TB of 5B records => 200B records of at most 200B distinct values.
From here on, the number of possible distinct values is limited by the
number of records.
100B records are used in the "Indy" version of Jim Gray's sorting
contests, so 1TB => 10B records.
2KB-4KB is the most common record size I've seen in enterprise
class DBMS (so I used this value to make my initial example more
realistic).

Therefore the vast majority of the time representing a data set by Key
will use less space that the original record.  Less space used means
less IO to scan the data set, which means faster scan times.

This is why index files work in the first place, right?

>Perhaps I believe this because you can now buy as much sequential I/O
>as you want.  Random I/O is the only real savings.
>
1= No, you can not "buy as much sequential IO as you want".  Even if
with an infinite budget, there are physical and engineering limits.  Long
before you reach those limits, you will pay exponentially increasing costs
for linearly increasing performance gains.  So even if you _can_ buy a
certain level of sequential IO, it may not be the most efficient way to
spend money.

2= Most RW IT professionals have far from an infinite budget.  Just traffic
on these lists shows how severe the typical cost constraints usually are.
OTOH, if you have an inifinite IT budget, care to help a few less fortunate
than yourself?  After all, a even a large constant substracted from infinity
is still infinity... ;-)

3= No matter how fast you can do IO, IO remains the most expensive
part of the performance equation.  The fastest and cheapest IO you can
do is _no_ IO.  As long as we trade cheaper RAM and even cheaoer CPU
operations for IO correctly, more space efficient data representations will
always be a Win because of this.

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

29 September 2005, 10:03:19

In the interest of efficiency and "not reinventing the wheel", does anyone know
where I can find C or C++ source code for a Btree variant with the following
properties:

A= Data elements (RIDs) are only stored in the leaves, Keys (actually
KeyPrefixes; see "D" below) and Node pointers are only stored in the internal
nodes of the Btree.

B= Element redistribution is done as an alternative to node splitting in overflow
conditions during Inserts whenever possible.

C= Variable length Keys are supported.

D= Node buffering with a reasonable replacement policy is supported.

E= Since we will know beforehand exactly how many RID's will be stored, we
will know apriori how much space will be needed for leaves, and will know the
worst case for how much space will be required for the Btree internal nodes
as well.  This implies that we may be able to use an array, rather than linked
list, implementation of the Btree.  Less pointer chasing at the expense of more
CPU calculations, but that's a trade-off in the correct direction.

Such source would be a big help in getting a prototype together.

Thanks in advance for any pointers or source,
Ron

Re: [PERFORM] A Better External Sort?

From

"Zeugswetter Andreas DAZ SD"

Date:

29 September 2005, 10:35:48

> In my original example, a sequential scan of the 1TB of 2KB
> or 4KB records, => 250M or 500M records of data, being sorted
> on a binary value key will take ~1000x more time than reading
> in the ~1GB Btree I described that used a Key+RID (plus node
> pointers) representation of the data.

Imho you seem to ignore the final step your algorithm needs of
collecting the
data rows. After you sorted the keys the collect step will effectively
access the
tuples in random order (given a sufficiently large key range).

This random access is bad. It effectively allows a competing algorithm
to read the
whole data at least 40 times sequentially, or write the set 20 times
sequentially.
(Those are the random/sequential ratios of modern discs)

Andreas

Re: [PERFORM] A Better External Sort?

From

Josh Berkus

Date:

29 September 2005, 13:54:11

Jeff, Ron,

First off, Jeff, please take it easy.  We're discussing 8.2 features at
this point and there's no reason to get stressed out at Ron.  You can
get plenty stressed out when 8.2 is near feature freeze.  ;-)


Regarding use cases for better sorts:

The biggest single area where I see PostgreSQL external sort sucking is
  on index creation on large tables.   For example, for free version of
TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's
hardware, but over 3 hours to create each index on that table.  This
means that over all our load into TPCH takes 4 times as long to create
the indexes as it did to bulk load the data.

Anyone restoring a large database from pg_dump is in the same situation.
  Even worse, if you have to create a new index on a large table on a
production database in use, because the I/O from the index creation
swamps everything.

Following an index creation, we see that 95% of the time required is the
external sort, which averages 2mb/s.  This is with seperate drives for
the WAL, the pg_tmp, the table and the index.  I've confirmed that
increasing work_mem beyond a small minimum (around 128mb) had no benefit
on the overall index creation speed.


--Josh Berkus

Re: [PERFORM] A Better External Sort?

From

"Luke Lonergan"

Date:

29 September 2005, 14:07:14

Josh,

On 9/29/05 9:54 AM, "Josh Berkus" <josh@agliodbs.com> wrote:

> Following an index creation, we see that 95% of the time required is the
> external sort, which averages 2mb/s.  This is with seperate drives for
> the WAL, the pg_tmp, the table and the index.  I've confirmed that
> increasing work_mem beyond a small minimum (around 128mb) had no benefit
> on the overall index creation speed.

Yuuuup!  That about sums it up - regardless of taking 1 or 2 passes through
the heap being sorted, 1.5 - 2 MB/s is the wrong number.  This is not
necessarily an algorithmic problem, but is a optimization problem with
Postgres that must be fixed before it can be competitive.

We read/write to/from disk at 240MB/s and so 2 passes would run at a net
rate of 120MB/s through the sort set if it were that efficient.

Anyone interested in tackling the real performance issue? (flame bait, but
for a worthy cause :-)

- Luke

Re: [PERFORM] A Better External Sort?

From

David Fetter

Date:

29 September 2005, 14:18:30

On Thu, Sep 29, 2005 at 10:06:52AM -0700, Luke Lonergan wrote:
> Josh,
>
> On 9/29/05 9:54 AM, "Josh Berkus" <josh@agliodbs.com> wrote:
>
> > Following an index creation, we see that 95% of the time required
> > is the external sort, which averages 2mb/s.  This is with seperate
> > drives for the WAL, the pg_tmp, the table and the index.  I've
> > confirmed that increasing work_mem beyond a small minimum (around
> > 128mb) had no benefit on the overall index creation speed.
>
> Yuuuup!  That about sums it up - regardless of taking 1 or 2 passes
> through the heap being sorted, 1.5 - 2 MB/s is the wrong number.
> This is not necessarily an algorithmic problem, but is a
> optimization problem with Postgres that must be fixed before it can
> be competitive.
>
> We read/write to/from disk at 240MB/s and so 2 passes would run at a
> net rate of 120MB/s through the sort set if it were that efficient.
>
> Anyone interested in tackling the real performance issue? (flame
> bait, but for a worthy cause :-)

I'm not sure that it's flamebait, but what do I know?  Apart from the
nasty number (1.5-2 MB/s), what other observations do you have to
hand?  Any ideas about what things are not performing here?  Parts of
the code that could bear extra scrutiny?  Ideas on how to fix same in
a cross-platform way?

Cheers,
D
--
David Fetter david@fetter.org http://fetter.org/
phone: +1 510 893 6100   mobile: +1 415 235 3778

Remember to vote!

Re: [PERFORM] A Better External Sort?

From

"Jeffrey W. Baker"

Date:

29 September 2005, 14:44:46

On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
> Josh,
>
> On 9/29/05 9:54 AM, "Josh Berkus" <josh@agliodbs.com> wrote:
>
> > Following an index creation, we see that 95% of the time required is the
> > external sort, which averages 2mb/s.  This is with seperate drives for
> > the WAL, the pg_tmp, the table and the index.  I've confirmed that
> > increasing work_mem beyond a small minimum (around 128mb) had no benefit
> > on the overall index creation speed.
>
> Yuuuup!  That about sums it up - regardless of taking 1 or 2 passes through
> the heap being sorted, 1.5 - 2 MB/s is the wrong number.

Yeah this is really bad ... approximately the speed of GNU sort.

Josh, do you happen to know how many passes are needed in the multiphase
merge on your 60GB table?

Looking through tuplesort.c, I have a couple of initial ideas.  Are we
allowed to fork here?  That would open up the possibility of using the
CPU and the I/O in parallel.  I see that tuplesort.c also suffers from
the kind of postgresql-wide disease of calling all the way up and down a
big stack of software for each tuple individually.  Perhaps it could be
changed to work on vectors.

I think the largest speedup will be to dump the multiphase merge and
merge all tapes in one pass, no matter how large M.  Currently M is
capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
the tape.  It could be done in a single pass heap merge with N*log(M)
comparisons, and, more importantly, far less input and output.

I would also recommend using an external processes to asynchronously
feed the tuples into the heap during the merge.

What's the timeframe for 8.2?

-jwb

Re: [PERFORM] A Better External Sort?

From

Josh Berkus

Date:

29 September 2005, 14:59:43

Jeff,

> Josh, do you happen to know how many passes are needed in the multiphase
> merge on your 60GB table?

No, any idea how to test that?

> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M.  Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape.  It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

Yes, but the evidence suggests that we're actually not using the whole 1GB
of RAM ... maybe using only 32MB of it which would mean over 200 passes
(I'm not sure of the exact match).  Just fixing our algorithm so that it
used all of the work_mem permitted might improve things tremendously.

> I would also recommend using an external processes to asynchronously
> feed the tuples into the heap during the merge.
>
> What's the timeframe for 8.2?

Too far out to tell yet.  Probably 9mo to 1 year, that's been our history.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [PERFORM] A Better External Sort?

From

"Jeffrey W. Baker"

Date:

29 September 2005, 15:16:04

On Thu, 2005-09-29 at 11:03 -0700, Josh Berkus wrote:
> Jeff,
> 
> > Josh, do you happen to know how many passes are needed in the multiphase
> > merge on your 60GB table?
> 
> No, any idea how to test that?

I would just run it under the profiler and see how many times
beginmerge() is called.

-jwb

Re: [PERFORM] A Better External Sort?

From

Josh Berkus

Date:

29 September 2005, 15:24:12

Jeff,

> I would just run it under the profiler and see how many times
> beginmerge() is called.

Hmm, I'm not seeing it at all in the oprofile results on a 100million-row 
sort.

-- 
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

29 September 2005, 15:32:19

If I were to be nosy and poke around in this, what patches of code would
I be interested in?

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Josh Berkus
> Sent: Thursday, September 29, 2005 11:28 AM
> To: pgsql-hackers@postgresql.org
> Cc: Jeffrey W. Baker
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
> Jeff,
>
> > I would just run it under the profiler and see how many times
> > beginmerge() is called.
>
> Hmm, I'm not seeing it at all in the oprofile results on a
100million-row
> sort.
>
> --
> --Josh
>
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 4: Have you searched our list archives?
>
>                http://archives.postgresql.org

Re: [PERFORM] A Better External Sort?

From

"Luke Lonergan"

Date:

30 September 2005, 01:22:53

Jeff,

On 9/29/05 10:44 AM, "Jeffrey W. Baker" <jwbaker@acm.org> wrote:

> On Thu, 2005-09-29 at 10:06 -0700, Luke Lonergan wrote:
> Looking through tuplesort.c, I have a couple of initial ideas.  Are we
> allowed to fork here?  That would open up the possibility of using the
> CPU and the I/O in parallel.  I see that tuplesort.c also suffers from
> the kind of postgresql-wide disease of calling all the way up and down a
> big stack of software for each tuple individually.  Perhaps it could be
> changed to work on vectors.

Yes!

> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M.  Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape.  It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

Yes again, see above.

> I would also recommend using an external processes to asynchronously
> feed the tuples into the heap during the merge.

Simon Riggs is working this idea a bit - it's slightly less interesting to
us because we already have a multiprocessing executor.  Our problem is that
4 x slow is still far too slow.

> What's the timeframe for 8.2?

Let's test it out in Bizgres!

- Luke

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

30 September 2005, 10:03:32

>From: Pailloncy Jean-Gerard <jg@rilk.com>
>Sent: Sep 29, 2005 7:11 AM
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>>>Jeff Baker:
>>>Your main example seems to focus on a large table where a key
>>>column has constrained values.  This case is interesting in
>>>proportion to the number of possible values.  If I have billions
>>>of rows, each having one of only two values, I can think of a
>>>trivial and very fast method of returning the table "sorted" by
>>>that key: make two sequential passes, returning the first value
>>>on the first pass and the second value on the second pass.
>>> This will be faster than the method you propose.
>>
>>Ron Peacetree:
>>1= No that was not my main example.  It was the simplest example
>>used to frame the later more complicated examples.  Please don't
>>get hung up on it.
>>
>>2= You are incorrect.  Since IO is the most expensive operation we
>>can do, any method that makes two passes through the data at top
>>scanning speed will take at least 2x as long as any method that only
>>takes one such pass.
>
>You do not get the point.
>As the time you get the sorted references to the tuples, you need to
>fetch the tuples themself, check their visbility, etc. and returns
>them to the client.
>
As PFC correctly points out elsewhere in this thread, =maybe= you
have to do all that.  The vast majority of the time people are not
going to want to look at a detailed record by record output of that
much data.

The most common usage is to calculate or summarize some quality
or quantity of the data and display that instead or to use the tuples
or some quality of the tuples found as an intermediate step in a
longer query process such as a join.

Sometimes there's a need to see _some_ of the detailed records; a
random sample or a region in a random part of the table or etc.
It's rare that there is a RW need to actually list every record in a
table of significant size.

On the rare occasions where one does have to return or display all
records in such large table, network IO and/or display IO speeds
are the primary performance bottleneck.  Not HD IO.

Nonetheless, if there _is_ such a need, there's nothing stopping us
from rearranging the records in RAM into sorted order in one pass
through RAM (using at most space for one extra record) after
constructing the cache conscious Btree index.  Then the sorted
records can be written to HD in RAM buffer sized chunks very
efficiently.
Repeating this process until we have stepped through the entire
data set will take no more HD IO than one HD scan of the data
and leave us with a permanent result that can be reused for
multiple purposes.  If the sorted records are written in large
enough chunks, rereading them at any later time can be done
at maximum HD throughput

In a total of two HD scans (one to read the original data, one
to write out the sorted data) we can make a permanent
rearrangement of the data.  We've essentially created a
cluster index version of the data.


>So, if there is only 2 values in the column of big table that is larger
>than available RAM, two seq scans of the table without any sorting
>is the fastest solution.
>
If you only need to do this once, yes this wins.  OTOH, if you have
to do this sort even twice, my method is better.

regards,
Ron

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

30 September 2005, 10:08:05

>From: Zeugswetter Andreas DAZ SD <ZeugswetterA@spardat.at>
>Sent: Sep 29, 2005 9:28 AM
>Subject: RE: [HACKERS] [PERFORM] A Better External Sort?
>
>>In my original example, a sequential scan of the 1TB of 2KB
>>or 4KB records, => 250M or 500M records of data, being sorted
>>on a binary value key will take ~1000x more time than reading
>>in the ~1GB Btree I described that used a Key+RID (plus node
>>pointers) representation of the data.
>
>Imho you seem to ignore the final step your algorithm needs of
>collecting the data rows. After you sorted the keys the collect
>step will effectively access the tuples in random order (given a
>sufficiently large key range).
>
"Collecting" the data rows can be done for each RAM buffer full of
of data in one pass through RAM after we've built the Btree.  Then
if desired those data rows can be read out to HD in sorted order
in essentially one streaming burst.  This combination of index build
+ RAM buffer rearrangement + write results to HD can be repeat
as often as needed until we end up with an overall Btree index and
a set of sorted sublists on HD.  Overall HD IO for the process is only
two effectively sequential passes through the data.

Subsequent retrieval of the sorted information from HD can be
done at full HD streaming speed and whatever we've decided to
save to HD can be reused later if we desire.

Hope this helps,
Ron

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

30 September 2005, 10:09:51

>From: Josh Berkus <josh@agliodbs.com>
>Sent: Sep 29, 2005 12:54 PM
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>The biggest single area where I see PostgreSQL external
>sort sucking is on index creation on large tables.   For
>example, for free version of TPCH, it takes only 1.5 hours to
>load a 60GB Lineitem table on OSDL's hardware, but over 3
>hours to create each index on that table.  This means that
>over all our load into TPCH takes 4 times as long to create
>the indexes as it did to bulk load the data.
>
Hmmm.
60GB/5400secs= 11MBps.  That's ssllooww.  So the first
problem is evidently our physical layout and/or HD IO layer
sucks.

Creating the table and then creating the indexes on the table
is going to require more physical IO than if we created the
table and the indexes concurrently in chunks and then
combined the indexes on the chunks into the overall indexes
for the whole table, so there's a potential speed-up.

The method I've been talking about is basically a recipe for
creating indexes as fast as possible with as few IO operations,
HD or RAM, as possible and nearly no random ones, so it
could help as well.

OTOH, HD IO rate is the fundamental performance metric.
As long as our HD IO rate is pessimal, so will the performance
of everything else be.   Why can't we load a table at closer to
the peak IO rate of the HDs?


>Anyone restoring a large database from pg_dump is in the
>same situation.  Even worse, if you have to create a new
>index on a large table on a production database in use,
>because the I/O from the index creation swamps everything.
>
Fix for this in the works ;-)


>Following an index creation, we see that 95% of the time
>required is the external sort, which averages 2mb/s.
>
Assuming decent HD HW, this is HORRIBLE.

What's kind of instrumenting and profiling has been done of
the code involved?


>This is with seperate drives for the WAL, the pg_tmp, the table
>and the index.  I've confirmed that increasing work_mem
>beyond a small minimum (around 128mb) had no benefit on
>the overall index creation speed.
>
No surprise.  The process is severely limited by the abyssmally
slow HD IO.

Ron

Re: [PERFORM] A Better External Sort?

From

PFC

Date:

30 September 2005, 10:13:29

    Just to add a little anarchy in your nice debate...

    Who really needs all the results of a sort on your terabyte table ?

    I guess not many people do a SELECT from such a table and want all the
results. So, this leaves :
    - Really wanting all the results, to fetch using a cursor,
    - CLUSTER type things, where you really want everything in order,
    - Aggregates (Sort->GroupAggregate), which might really need to sort the
whole table.
    - Complex queries where the whole dataset needs to be examined, in order
to return a few values
    - Joins (again, the whole table is probably not going to be selected)
    - And the ones I forgot.

    However,

    Most likely you only want to SELECT N rows, in some ordering :
    - the first N (ORDER BY x LIMIT N)
    - last N (ORDER BY x DESC LIMIT N)
    - WHERE x>value ORDER BY x LIMIT N
    - WHERE x<value ORDER BY x DESC LIMIT N
    - and other variants

    Or, you are doing a Merge JOIN against some other table ; in that case,
yes, you might need the whole sorted terabyte table, but most likely there
are WHERE clauses in the query that restrict the set, and thus, maybe we
can get some conditions or limit values on the column to sort.

    Also the new, optimized hash join, which is more memory efficient, might
cover this case.

    Point is, sometimes, you only need part of the results of your sort. And
the bigger the sort, the most likely it becomes that you only want part of
the results. So, while we're in the fun hand-waving, new algorithm trying
mode, why not consider this right from the start ? (I know I'm totally in
hand-waving mode right now, so slap me if needed).

    I'd say your new, fancy sort algorithm needs a few more input values :

    - Range of values that must appear in the final result of the sort :
        none, minimum, maximum, both, or even a set of values from the other
side of the join, hashed, or sorted.
    - LIMIT information (first N, last N, none)
    - Enhanced Limit information (first/last N values of the second column to
sort, for each value of the first column) (the infamous "top10 by
category" query)
    - etc.

    With this, the amount of data that needs to be kept in memory is
dramatically reduced, from the whole table (even using your compressed
keys, that's big) to something more manageable which will be closer to the
size of the final result set which will be returned to the client, and
avoid a lot of effort.

    So, this would not be useful in all cases, but when it applies, it would
be really useful.

    Regards !

Re: [PERFORM] A Better External Sort?

From

Josh Berkus

Date:

30 September 2005, 14:23:35

Ron,

> Hmmm.
> 60GB/5400secs= 11MBps.  That's ssllooww.  So the first
> problem is evidently our physical layout and/or HD IO layer
> sucks.

Actually, it's much worse than that, because the sort is only dealing
with one column.  As I said, monitoring the iostat our top speed was
2.2mb/s.

--Josh

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

30 September 2005, 14:44:41

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of PFC
> Sent: Thursday, September 29, 2005 9:10 AM
> To: rjpeace@earthlink.net
> Cc: Pg Hackers; pgsql-performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>
>     Just to add a little anarchy in your nice debate...
>
>     Who really needs all the results of a sort on your terabyte
table ?

Reports with ORDER BY/GROUP BY, and many other possibilities.  40% of
mainframe CPU cycles are spent sorting.  That is because huge volumes of
data require lots of energy to be meaningfully categorized.  Let's
suppose that instead of a terabyte of data (or a petabyte or whatever)
we have 10% of it.  That's still a lot of data.

>     I guess not many people do a SELECT from such a table and want
all
> the
> results.

What happens when they do?  The cases where it is already fast are not
very important.  The cases where things go into the crapper are the ones
that need attention.

>So, this leaves :
>     - Really wanting all the results, to fetch using a cursor,
>     - CLUSTER type things, where you really want everything in
order,
>     - Aggregates (Sort->GroupAggregate), which might really need to
sort
> the
> whole table.
>     - Complex queries where the whole dataset needs to be examined,
in
> order
> to return a few values
>     - Joins (again, the whole table is probably not going to be
> selected)
>     - And the ones I forgot.
>
>     However,
>
>     Most likely you only want to SELECT N rows, in some ordering :
>     - the first N (ORDER BY x LIMIT N)
>     - last N (ORDER BY x DESC LIMIT N)

For these, the QuickSelect algorithm is what is wanted.  For example:
#include <stdlib.h>
typedef double  Etype;

extern Etype    RandomSelect(Etype * A, size_t p, size_t r, size_t i);
extern size_t   RandRange(size_t a, size_t b);
extern size_t   RandomPartition(Etype * A, size_t p, size_t r);
extern size_t   Partition(Etype * A, size_t p, size_t r);

/*
**
** In the following code, every reference to CLR means:
**
**    "Introduction to Algorithms"
**    By Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest
**    ISBN 0-07-013143-0
*/


/*
** CLR, page 187
*/
Etype           RandomSelect(Etype A[], size_t p, size_t r, size_t i)
{
    size_t          q,
                    k;
    if (p == r)
        return A[p];
    q = RandomPartition(A, p, r);
    k = q - p + 1;

    if (i <= k)
        return RandomSelect(A, p, q, i);
    else
        return RandomSelect(A, q + 1, r, i - k);
}

size_t          RandRange(size_t a, size_t b)
{
    size_t          c = (size_t) ((double) rand() / ((double) RAND_MAX +
1) * (b - a));
    return c + a;
}

/*
** CLR, page 162
*/
size_t          RandomPartition(Etype A[], size_t p, size_t r)
{
    size_t          i = RandRange(p, r);
    Etype           Temp;
    Temp = A[p];
    A[p] = A[i];
    A[i] = Temp;
    return Partition(A, p, r);
}

/*
** CLR, page 154
*/
size_t          Partition(Etype A[], size_t p, size_t r)
{
    Etype           x,
                    temp;
    size_t          i,
                    j;

    x = A[p];
    i = p - 1;
    j = r + 1;

    for (;;) {
        do {
            j--;
        } while (!(A[j] <= x));
        do {
            i++;
        } while (!(A[i] >= x));
        if (i < j) {
            temp = A[i];
            A[i] = A[j];
            A[j] = temp;
        } else
            return j;
    }
}

>     - WHERE x>value ORDER BY x LIMIT N
>     - WHERE x<value ORDER BY x DESC LIMIT N
>     - and other variants
>
>     Or, you are doing a Merge JOIN against some other table ; in
that
> case,
> yes, you might need the whole sorted terabyte table, but most likely
there
> are WHERE clauses in the query that restrict the set, and thus, maybe
we
> can get some conditions or limit values on the column to sort.

Where clause filters are to be applied AFTER the join operations,
according to the SQL standard.

>     Also the new, optimized hash join, which is more memory
efficient,
> might
> cover this case.

For == joins.  Not every order by is applied to joins.  And not every
join is an equal join.

>     Point is, sometimes, you only need part of the results of your
sort.
> And
> the bigger the sort, the most likely it becomes that you only want
part of
> the results.

That is an assumption that will sometimes be true, and sometimes not.
It is not possible to predict usage patterns for a general purpose
database system.

> So, while we're in the fun hand-waving, new algorithm trying
> mode, why not consider this right from the start ? (I know I'm totally
in
> hand-waving mode right now, so slap me if needed).
>
>     I'd say your new, fancy sort algorithm needs a few more input
values
> :
>
>     - Range of values that must appear in the final result of the
sort :
>         none, minimum, maximum, both, or even a set of values
from the
> other
> side of the join, hashed, or sorted.

That will already happen (or it certainly ought to)

>     - LIMIT information (first N, last N, none)

That will already happen (or it certainly ought to -- I would be pretty
surprised if it does not happen)

>     - Enhanced Limit information (first/last N values of the second
> column to
> sort, for each value of the first column) (the infamous "top10 by
> category" query)
>     - etc.

All the filters will (at some point) be applied to the data unless they
cannot be applied to the data by formal rule.

>     With this, the amount of data that needs to be kept in memory is
> dramatically reduced, from the whole table (even using your compressed
> keys, that's big) to something more manageable which will be closer to
the
> size of the final result set which will be returned to the client, and
> avoid a lot of effort.

Sorting the minimal set is a good idea.  Sometimes there is a big
savings there. I would be pretty surprised if a large fraction of data
that does not have to be included is actually processed during the
sorts.

>     So, this would not be useful in all cases, but when it applies,
it
> would
> be really useful.

No argument there.  And if an algorithm is being reworked, it is a good
idea to look at things like filtering to see if all filtering that is
allowed by the language standard before the sort takes place is applied.

Re: [PERFORM] A Better External Sort?

From

Josh Berkus

Date:

30 September 2005, 17:37:39

Ron,

> That 11MBps was your =bulk load= speed.  If just loading a table
> is this slow, then there are issues with basic physical IO, not just
> IO during sort operations.

Oh, yeah.  Well, that's separate from sort.  See multiple posts on this
list from the GreenPlum team, the COPY patch for 8.1, etc.  We've been
concerned about I/O for a while.

Realistically, you can't do better than about 25MB/s on a single-threaded
I/O on current Linux machines, because your bottleneck isn't the actual
disk I/O.   It's CPU.   Databases which "go faster" than this are all, to
my knowledge, using multi-threaded disk I/O.

(and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's
another thread ... )

> As I said, the obvious candidates are inefficient physical layout
> and/or flawed IO code.

Yeah, that's what I thought too.   But try sorting an 10GB table, and
you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're
CPU-bound, because sort is being really inefficient about something. I
just don't know what yet.

If we move that CPU-binding to a higher level of performance, then we can
start looking at things like async I/O, O_Direct, pre-allocation etc. that
will give us incremental improvements.   But what we need now is a 5-10x
improvement and that's somewhere in the algorithms or the code.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [PERFORM] A Better External Sort?

From

"Luke Lonergan"

Date:

30 September 2005, 17:39:21

Ron,

On 9/30/05 1:20 PM, "Ron Peacetree" <rjpeace@earthlink.net> wrote:

> That 11MBps was your =bulk load= speed.  If just loading a table
> is this slow, then there are issues with basic physical IO, not just
> IO during sort operations.

Bulk loading speed is irrelevant here - that is dominated by parsing, which
we have covered copiously (har har) previously and have sped up by 500%,
which still makes Postgres < 1/2 the loading speed of MySQL.

> As I said, the obvious candidates are inefficient physical layout
> and/or flawed IO code.

Yes.

> Until the basic IO issues are addressed, we could replace the
> present sorting code with infinitely fast sorting code and we'd
> still be scrod performance wise.

Postgres' I/O path has many problems that must be micro-optimized away.  Too
small of an operand size compared to disk caches, memory, etc etc are the
common problem.  Another is lack of micro-parallelism (loops) with long
enough runs to let modern processors pipeline and superscale.

The net problem here is that a simple "select blah from blee order
by(blah.a);" runs at 1/100 of the sequential scan rate.

- Luke

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

30 September 2005, 18:31:41

I see the following routines that seem to be related to sorting.

If I were to examine these routines to consider ways to improve it, what
routines should I key in on?  I am guessing that tuplesort.c is the hub
of activity for database sorting.

Directory of U:\postgresql-snapshot\src\backend\access\nbtree

08/11/2005  06:22 AM            24,968 nbtsort.c
               1 File(s)         24,968 bytes

 Directory of U:\postgresql-snapshot\src\backend\executor

03/16/2005  01:38 PM             7,418 nodeSort.c
               1 File(s)          7,418 bytes

 Directory of U:\postgresql-snapshot\src\backend\utils\sort

09/23/2005  08:36 AM            67,585 tuplesort.c
               1 File(s)         67,585 bytes

 Directory of U:\postgresql-snapshot\src\bin\pg_dump

06/29/2005  08:03 PM            31,620 pg_dump_sort.c
               1 File(s)         31,620 bytes

 Directory of U:\postgresql-snapshot\src\port

07/27/2005  09:03 PM             5,077 qsort.c
               1 File(s)          5,077 bytes

Re: [PERFORM] A Better External Sort?

From

PFC

Date:

30 September 2005, 20:02:39


> Bulk loading speed is irrelevant here - that is dominated by parsing,
> which
> we have covered copiously (har har) previously and have sped up by 500%,
> which still makes Postgres < 1/2 the loading speed of MySQL.

    Let's ask MySQL 4.0

> LOAD DATA INFILE blah
0 errors, 666 warnings
> SHOW WARNINGS;
not implemented. upgrade to 4.1

duhhhhhhhhhhhhhhhhhhhhh

Re: [PERFORM] A Better External Sort?

From

"Jignesh K. Shah"

Date:

30 September 2005, 20:10:25

I have seen similar performance as Josh and my reasoning is as follows:

* WAL is the biggest bottleneck with its default size of 16MB. Many
people hate to recompile the code   to change its default, and
increasing checkpoint segments help but still there is lot of overhead
in the rotation of WAL files (Even putting WAL on tmpfs shows that it is
still slow). Having an option for bigger size is helpful to a small
extent percentagewise (and frees up CPU a bit in doing file rotation)

* Growing files: Even though this is OS dependent but it does spend lot
of time doing small 8K block increases to grow files. If we can signal
bigger chunks to grow or "pre-grow"  to expected size of  data files
that will help a lot in such cases.

* COPY command had restriction but that has been fixed to a large
extent.(Great job)

But ofcourse I have lost touch with programming and can't begin to
understand PostgreSQL code to change it myself.

Regards,
Jignesh

Ron Peacetree wrote:

>That 11MBps was your =bulk load= speed.  If just loading a table
>is this slow, then there are issues with basic physical IO, not just
>IO during sort operations.
>
>As I said, the obvious candidates are inefficient physical layout
>and/or flawed IO code.
>
>Until the basic IO issues are addressed, we could replace the
>present sorting code with infinitely fast sorting code and we'd
>still be scrod performance wise.
>
>So why does basic IO suck so badly?
>
>Ron
>
>
>-----Original Message-----
>From: Josh Berkus <josh@agliodbs.com>
>Sent: Sep 30, 2005 1:23 PM
>To: Ron Peacetree <rjpeace@earthlink.net>
>Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
>Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
>Ron,
>
>
>
>>Hmmm.
>>60GB/5400secs= 11MBps.  That's ssllooww.  So the first
>>problem is evidently our physical layout and/or HD IO layer
>>sucks.
>>
>>
>
>Actually, it's much worse than that, because the sort is only dealing
>with one column.  As I said, monitoring the iostat our top speed was
>2.2mb/s.
>
>--Josh
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: if posting/reading through Usenet, please send an appropriate
>       subscribe-nomail command to majordomo@postgresql.org so that your
>       message can get through to the mailing list cleanly
>
>

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

30 September 2005, 20:10:29

That 11MBps was your =bulk load= speed.  If just loading a table
is this slow, then there are issues with basic physical IO, not just
IO during sort operations.

As I said, the obvious candidates are inefficient physical layout
and/or flawed IO code.

Until the basic IO issues are addressed, we could replace the
present sorting code with infinitely fast sorting code and we'd
still be scrod performance wise.

So why does basic IO suck so badly?

Ron


-----Original Message-----
From: Josh Berkus <josh@agliodbs.com>
Sent: Sep 30, 2005 1:23 PM
To: Ron Peacetree <rjpeace@earthlink.net>
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

> Hmmm.
> 60GB/5400secs= 11MBps.  That's ssllooww.  So the first
> problem is evidently our physical layout and/or HD IO layer
> sucks.

Actually, it's much worse than that, because the sort is only dealing
with one column.  As I said, monitoring the iostat our top speed was
2.2mb/s.

--Josh

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

30 September 2005, 20:40:34

25MBps should not be a CPU bound limit for IO, nor should it be
an OS limit.  It should be something ~100x (Single channel RAM)
to ~200x (dual channel RAM) that.

For an IO rate of 25MBps to be pegging the CPU at 100%, the CPU
is suffering some combination of
A= lot's of cache misses ("cache thrash"),
B= lot's of random rather than sequential IO (like pointer chasing)
C= lot's of wasteful copying
D= lot's of wasteful calculations

In fact, this is crappy enough performance that the whole IO layer
should be rethought and perhaps reimplemented from scratch.
Optimization of the present code is unlikely to yield a 100-200x
improvement.

On the HD side, the first thing that comes to mind is that DBs are
-NOT- like ordinary filesystems in a few ways:
1= the minimum HD IO is a record that is likely to be larger than
a HD sector.  Therefore, the FS we use should be laid out with
physical segments of max(HD sector size, record size)

2= DB files (tables) are usually considerably larger than any other
kind of files stored.  Therefore the FS we should use should be laid
out using LARGE physical pages.  64KB-256KB at a _minimum_.

3= The whole "2GB striping" of files idea needs to be rethought.
Our tables are significantly different in internal structure from the
usual FS entity.

4= I'm sure we are paying all sorts of nasty overhead for essentially
emulating the pg "filesystem" inside another filesystem.  That means
~2x as much overhead to access a particular piece of data.

The simplest solution is for us to implement a new VFS compatible
filesystem tuned to exactly our needs: pgfs.

We may be able to avoid that by some amount of hacking or
modifying of the current FSs we use, but I suspect it would be more
work for less ROI.

Ron


-----Original Message-----
From: Josh Berkus <josh@agliodbs.com>
Sent: Sep 30, 2005 4:41 PM
To: Ron Peacetree <rjpeace@earthlink.net>
Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

Ron,

> That 11MBps was your =bulk load= speed.  If just loading a table
> is this slow, then there are issues with basic physical IO, not just
> IO during sort operations.

Oh, yeah.  Well, that's separate from sort.  See multiple posts on this
list from the GreenPlum team, the COPY patch for 8.1, etc.  We've been
concerned about I/O for a while.

Realistically, you can't do better than about 25MB/s on a single-threaded
I/O on current Linux machines, because your bottleneck isn't the actual
disk I/O.   It's CPU.   Databases which "go faster" than this are all, to
my knowledge, using multi-threaded disk I/O.

(and I'd be thrilled to get a consistent 25mb/s on PostgreSQL, but that's
another thread ... )

> As I said, the obvious candidates are inefficient physical layout
> and/or flawed IO code.

Yeah, that's what I thought too.   But try sorting an 10GB table, and
you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're
CPU-bound, because sort is being really inefficient about something. I
just don't know what yet.

If we move that CPU-binding to a higher level of performance, then we can
start looking at things like async I/O, O_Direct, pre-allocation etc. that
will give us incremental improvements.   But what we need now is a 5-10x
improvement and that's somewhere in the algorithms or the code.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

30 September 2005, 20:52:46

I have perused the tuple sort stuff.

The good:
The documentation of the sort algorithm from Knuth's TAOCP was
beautifully done.  Everyone who writes an algorithm should credit the
original source like this, and also where it deviates.  That was done
very nicely.

The bad:
With random access, tape style merging is not necessary.  A priority
queue based merge will be a lot faster.

The UGLY:
Please, someone, tell me I was hallucinating.  Is that code really
READING AND WRITING THE WHOLE TUPLE with every sort exchange?!  Maybe
there is a layer of abstraction that I am somehow missing.  I just can't
imagine that is what it is really doing.  If (somehow) it really is
doing that, a pointer based sort which forms a permutation based upon
the keys, would be a lot better.

The fundamental algorithm itself could also be improved somewhat.

Here is a {public domain} outline for an introspective quick sort that
Pete Filander and I wrote some time ago and contributed to FastDB.  It
is written as a C++ template, but it will take no effort to make it a
simple C routine.  It assumes that e_type has comparison operators, so
in C you would use a compare function instead.

/*
** Sorting stuff by Dann Corbit and Pete Filandr.
** (dcorbit@connx.com and pfilandr@mindspring.com)
** Use it however you like.
*/

//
//  The insertion sort template is used for small partitions.
//

template < class e_type >
void insertion_sort(e_type * array, size_t nmemb)
{
    e_type temp,
          *last,
          *first,
          *middle;
    if (nmemb > 1) {
        first = middle = 1 + array;
        last = nmemb - 1 + array;
        while (first != last) {
            ++first;
            if ((*(middle) > *(first))) {
                middle = first;
            }
        }
        if ((*(array) > *(middle))) {
            ((void) ((temp) = *(array), *(array) = *(middle), *(middle)
= (temp)));
        }
        ++array;
        while (array != last) {
            first = array++;
            if ((*(first) > *(array))) {
                middle = array;
                temp = *middle;
                do {
                    *middle-- = *first--;
                } while ((*(first) > *(&temp)));
                *middle = temp;
            }
        }
    }
}

//
// The median estimate is used to choose pivots for the quicksort
algorithm
//

template < class e_type >
void median_estimate(e_type * array, size_t n)
{
    e_type          temp;
    long unsigned   lu_seed = 123456789LU;
    const size_t    k = ((lu_seed) = 69069 * (lu_seed) + 362437) % --n;
    ((void) ((temp) = *(array), *(array) = *(array + k), *(array + k) =
(temp)));
    if ((*((array + 1)) > *((array)))) {
        (temp) = *(array + 1);
        if ((*((array + n)) > *((array)))) {
            *(array + 1) = *(array);
            if ((*(&(temp)) > *((array + n)))) {
                *(array) = *(array + n);
                *(array + n) = (temp);
            } else {
                *(array) = (temp);
            }
        } else {
            *(array + 1) = *(array + n);
            *(array + n) = (temp);
        }
    } else {
        if ((*((array)) > *((array + n)))) {
            if ((*((array + 1)) > *((array + n)))) {
                (temp) = *(array + 1);
                *(array + 1) = *(array + n);
                *(array + n) = *(array);
                *(array) = (temp);
            } else {
                ((void) (((temp)) = *((array)), *((array)) = *((array +
n)), *((array + n)) = ((temp))));
            }
        }
    }
}


//
// This is the heart of the quick sort algorithm used here.
// If the sort is going quadratic, we switch to heap sort.
// If the partition is small, we switch to insertion sort.
//

template < class e_type >
void qloop(e_type * array, size_t nmemb, size_t d)
{
    e_type temp,
          *first,
          *last;
    while (nmemb > 50) {
        if (sorted(array, nmemb)) {
            return;
        }
        if (!d--) {
            heapsort(array, nmemb);
            return;
        }
        median_estimate(array, nmemb);
        first = 1 + array;
        last = nmemb - 1 + array;
        do {
            ++first;
        } while ((*(array) > *(first)));
        do {
            --last;
        } while ((*(last) > *(array)));
        while (last > first) {
            ((void) ((temp) = *(last), *(last) = *(first), *(first) =
(temp)));
            do {
                ++first;
            } while ((*(array) > *(first)));
            do {
                --last;
            } while ((*(last) > *(array)));
        }
        ((void) ((temp) = *(array), *(array) = *(last), *(last) =
(temp)));
        qloop(last + 1, nmemb - 1 + array - last, d);
        nmemb = last - array;
    }
    insertion_sort(array, nmemb);
}

//
// This heap sort is better than average because it uses Lamont's heap.
//

template < class e_type >
void heapsort(e_type * array, size_t nmemb)
{
    size_t i,
           child,
           parent;
    e_type temp;
    if (nmemb > 1) {
        i = --nmemb / 2;
        do {
            {
                (parent) = (i);
                (temp) = (array)[(parent)];
                (child) = (parent) * 2;
                while ((nmemb) > (child)) {
                    if ((*((array) + (child) + 1) > *((array) +
(child)))) {
                        ++(child);
                    }
                    if ((*((array) + (child)) > *(&(temp)))) {
                        (array)[(parent)] = (array)[(child)];
                        (parent) = (child);
                        (child) *= 2;
                    } else {
                        --(child);
                        break;
                    }
                }
                if ((nmemb) == (child) && (*((array) + (child)) >
*(&(temp)))) {
                    (array)[(parent)] = (array)[(child)];
                    (parent) = (child);
                }
                (array)[(parent)] = (temp);
            }
        } while (i--);
        ((void) ((temp) = *(array), *(array) = *(array + nmemb), *(array
+ nmemb) = (temp)));
        for (--nmemb; nmemb; --nmemb) {
            {
                (parent) = (0);
                (temp) = (array)[(parent)];
                (child) = (parent) * 2;
                while ((nmemb) > (child)) {
                    if ((*((array) + (child) + 1) > *((array) +
(child)))) {
                        ++(child);
                    }
                    if ((*((array) + (child)) > *(&(temp)))) {
                        (array)[(parent)] = (array)[(child)];
                        (parent) = (child);
                        (child) *= 2;
                    } else {
                        --(child);
                        break;
                    }
                }
                if ((nmemb) == (child) && (*((array) + (child)) >
*(&(temp)))) {
                    (array)[(parent)] = (array)[(child)];
                    (parent) = (child);
                }
                (array)[(parent)] = (temp);
            }
            ((void) ((temp) = *(array), *(array) = *(array + nmemb),
*(array + nmemb) = (temp)));
        }
    }
}

//
// We use this to check to see if a partition is already sorted.
//

template < class e_type >
int sorted(e_type * array, size_t nmemb)
{
    for (--nmemb; nmemb; --nmemb) {
        if ((*(array) > *(array + 1))) {
            return 0;
        }
        ++array;
    }
    return 1;
}

//
// We use this to check to see if a partition is already reverse-sorted.
//

template < class e_type >
int             rev_sorted(e_type * array, size_t nmemb)
{
    for (--nmemb; nmemb; --nmemb) {
        if ((*(array + 1) > *(array))) {
            return 0;
        }
        ++array;
    }
    return 1;
}

//
// We use this to reverse a reverse-sorted partition.
//

template < class e_type >
void rev_array(e_type * array, size_t nmemb)
{
    e_type          temp,
        *end;
    for (end = array + nmemb - 1; end > array; ++array) {
        ((void) ((temp) = *(array), *(array) = *(end), *(end) =
(temp)));
        --end;
    }
}

//
// Introspective quick sort algorithm user entry point.
// You do not need to directly call any other sorting template.
// This sort will perform very well under all circumstances.
//

template < class e_type >
void iqsort(e_type * array, size_t nmemb)
{
    size_t d,
           n;
    if (nmemb > 1 && !sorted(array, nmemb)) {
        if (!rev_sorted(array, nmemb)) {
            n = nmemb / 4;
            d = 2;
            while (n) {
                ++d;
                n /= 2;
            }
            qloop(array, nmemb, 2 * d);
        } else {
            rev_array(array, nmemb);
        }
    }
}

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Jignesh K. Shah
> Sent: Friday, September 30, 2005 1:38 PM
> To: Ron Peacetree
> Cc: Josh Berkus; pgsql-hackers@postgresql.org; pgsql-
> performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
> I have seen similar performance as Josh and my reasoning is as
follows:
>
> * WAL is the biggest bottleneck with its default size of 16MB. Many
> people hate to recompile the code   to change its default, and
> increasing checkpoint segments help but still there is lot of overhead
> in the rotation of WAL files (Even putting WAL on tmpfs shows that it
is
> still slow). Having an option for bigger size is helpful to a small
> extent percentagewise (and frees up CPU a bit in doing file rotation)
>
> * Growing files: Even though this is OS dependent but it does spend
lot
> of time doing small 8K block increases to grow files. If we can signal
> bigger chunks to grow or "pre-grow"  to expected size of  data files
> that will help a lot in such cases.
>
> * COPY command had restriction but that has been fixed to a large
> extent.(Great job)
>
> But ofcourse I have lost touch with programming and can't begin to
> understand PostgreSQL code to change it myself.
>
> Regards,
> Jignesh
>
>
>
>
> Ron Peacetree wrote:
>
> >That 11MBps was your =bulk load= speed.  If just loading a table
> >is this slow, then there are issues with basic physical IO, not just
> >IO during sort operations.
> >
> >As I said, the obvious candidates are inefficient physical layout
> >and/or flawed IO code.
> >
> >Until the basic IO issues are addressed, we could replace the
> >present sorting code with infinitely fast sorting code and we'd
> >still be scrod performance wise.
> >
> >So why does basic IO suck so badly?
> >
> >Ron
> >
> >
> >-----Original Message-----
> >From: Josh Berkus <josh@agliodbs.com>
> >Sent: Sep 30, 2005 1:23 PM
> >To: Ron Peacetree <rjpeace@earthlink.net>
> >Cc: pgsql-hackers@postgresql.org, pgsql-performance@postgresql.org
> >Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
> >
> >Ron,
> >
> >
> >
> >>Hmmm.
> >>60GB/5400secs= 11MBps.  That's ssllooww.  So the first
> >>problem is evidently our physical layout and/or HD IO layer
> >>sucks.
> >>
> >>
> >
> >Actually, it's much worse than that, because the sort is only dealing
> >with one column.  As I said, monitoring the iostat our top speed was
> >2.2mb/s.
> >
> >--Josh
> >
> >
> >---------------------------(end of
broadcast)---------------------------
> >TIP 1: if posting/reading through Usenet, please send an appropriate
> >       subscribe-nomail command to majordomo@postgresql.org so that
your
> >       message can get through to the mailing list cleanly
> >
> >
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

Re: [PERFORM] A Better External Sort?

From

Gregory Maxwell

Date:

30 September 2005, 22:44:29

On 9/30/05, Ron Peacetree <rjpeace@earthlink.net> wrote:
> 4= I'm sure we are paying all sorts of nasty overhead for essentially
> emulating the pg "filesystem" inside another filesystem.  That means
> ~2x as much overhead to access a particular piece of data.
>
> The simplest solution is for us to implement a new VFS compatible
> filesystem tuned to exactly our needs: pgfs.
>
> We may be able to avoid that by some amount of hacking or
> modifying of the current FSs we use, but I suspect it would be more
> work for less ROI.

On this point, Reiser4 fs already implements a number of things which
would be desirable for PostgreSQL. For example: write()s to reiser4
filesystems are atomic, so there is no risk of torn pages (this is
enabled because reiser4 uses WAFL like logging where data is not
overwritten but rather relocated). The filesystem is modular and
extensible so it should be easy to add whatever additional semantics
are needed.  I would imagine that all that would be needed is some
more atomicity operations (single writes are already atomic, but I'm
sure it would be useful to batch many writes into a transaction),some
layout and packing controls, and some flush controls.  A step further
would perhaps integrate multiversioning directly into the FS (the
wandering logging system provides the write side of multiversioning, a
little read side work would be required.). More importantly: the file
system was intended to be extensible for this sort of application.

It might make a good 'summer of code' project for someone next year,
... presumably by then reiser4 will have made it into the mainline
kernel by then. :)

Re: [PERFORM] A Better External Sort?

From

Gregory Maxwell

Date:

30 September 2005, 23:07:20

On 9/28/05, Ron Peacetree <rjpeace@earthlink.net> wrote:
> 2= We use my method to sort two different tables.  We now have these
> very efficient representations of a specific ordering on these tables.  A
> join operation can now be done using these Btrees rather than the
> original data tables that involves less overhead than many current
> methods.

If we want to make joins very fast we should implement them using RD
trees. For the example cases where a join against a very large table
will produce a much smaller output, a RD tree will provide pretty much
the optimal behavior at a very low memory cost.

On the subject of high speed tree code for in-core applications, you
should check out http://judy.sourceforge.net/ . The performance
(insert, remove, lookup, AND storage) is really quite impressive.
Producing cache friendly code is harder than one might expect, and it
appears the judy library has already done a lot of the hard work.
Though it is *L*GPLed, so perhaps that might scare some here away from
it. :) and good luck directly doing joins with a LC-TRIE. ;)

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

01 October 2005, 02:32:22

Judy definitely rates a WOW!!

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Gregory Maxwell
> Sent: Friday, September 30, 2005 7:07 PM
> To: Ron Peacetree
> Cc: Jeffrey W. Baker; pgsql-hackers@postgresql.org; pgsql-
> performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
> On 9/28/05, Ron Peacetree <rjpeace@earthlink.net> wrote:
> > 2= We use my method to sort two different tables.  We now have these
> > very efficient representations of a specific ordering on these
tables.
> A
> > join operation can now be done using these Btrees rather than the
> > original data tables that involves less overhead than many current
> > methods.
>
> If we want to make joins very fast we should implement them using RD
> trees. For the example cases where a join against a very large table
> will produce a much smaller output, a RD tree will provide pretty much
> the optimal behavior at a very low memory cost.
>
> On the subject of high speed tree code for in-core applications, you
> should check out http://judy.sourceforge.net/ . The performance
> (insert, remove, lookup, AND storage) is really quite impressive.
> Producing cache friendly code is harder than one might expect, and it
> appears the judy library has already done a lot of the hard work.
> Though it is *L*GPLed, so perhaps that might scare some here away from
> it. :) and good luck directly doing joins with a LC-TRIE. ;)
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: [PERFORM] A Better External Sort?

From

Tom Lane

Date:

01 October 2005, 03:02:16

"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M.  Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape.  It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

            regards, tom lane

Re: [PERFORM] A Better External Sort?

From

"Dann Corbit"

Date:

01 October 2005, 03:32:44

> -----Original Message-----
> From: pgsql-hackers-owner@postgresql.org [mailto:pgsql-hackers-
> owner@postgresql.org] On Behalf Of Tom Lane
> Sent: Friday, September 30, 2005 11:02 PM
> To: Jeffrey W. Baker
> Cc: Luke Lonergan; Josh Berkus; Ron Peacetree; pgsql-
> hackers@postgresql.org; pgsql-performance@postgresql.org
> Subject: Re: [HACKERS] [PERFORM] A Better External Sort?
>
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M.  Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes
over
> > the tape.  It could be done in a single pass heap merge with
N*log(M)
> > comparisons, and, more importantly, far less input and output.
>
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here.

I believe I made the exact same suggestion several days ago.

>The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much
greater
> than number of tape drives.  But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.
>
> I don't believe we can simply legislate that there be only one merge
> pass.  That would mean that, if we end up with N runs after the
initial
> run-forming phase, we need to fit N tuples in memory --- no matter how
> large N is, or how small work_mem is.  But it seems like a good idea
to
> try to use an N-way merge where N is as large as work_mem will allow.
> We'd not have to decide on the value of N until after we've completed
> the run-forming phase, at which time we've already seen every tuple
> once, and so we can compute a safe value for N as work_mem divided by
> largest_tuple_size.  (Tape I/O buffers would have to be counted too
> of course.)

You only need to hold the sort column(s) in memory, except for the queue
you are exhausting at the time.  [And of those columns, only the values
for the smallest one in a sub-list.]  Of course, the more data from each
list that you can hold at once, the fewer the disk reads and seeks.

Another idea (not sure if it is pertinent):
Instead of having a fixed size for the sort buffers, size it to the
query.  Given a total pool of size M, give a percentage according to the
difficulty of the work to perform.  So a query with 3 small columns and
a cardinality of 1000 gets a small percentage and a query with 10 GB of
data gets a big percentage of available sort mem.

> It's been a good while since I looked at the sort code, and so I don't
> recall if there are any fundamental reasons for having a compile-time-
> constant value of the merge order rather than choosing it at runtime.
> My guess is that any inefficiencies added by making it variable would
> be well repaid by the potential savings in I/O.
>
>             regards, tom lane
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: explain analyze is your friend

Re: [PERFORM] A Better External Sort?

From

Simon Riggs

Date:

01 October 2005, 05:21:16

On Fri, 2005-09-30 at 13:41 -0700, Josh Berkus wrote:
> Yeah, that's what I thought too.   But try sorting an 10GB table, and
> you'll see: disk I/O is practically idle, while CPU averages 90%+.   We're
> CPU-bound, because sort is being really inefficient about something. I
> just don't know what yet.
>
> If we move that CPU-binding to a higher level of performance, then we can
> start looking at things like async I/O, O_Direct, pre-allocation etc. that
> will give us incremental improvements.   But what we need now is a 5-10x
> improvement and that's somewhere in the algorithms or the code.

I'm trying to keep an open mind about what the causes are, and I think
we need to get a much better characterisation of what happens during a
sort before we start trying to write code. It is always too easy to jump
in and tune the wrong thing, which is not a good use of time.

The actual sort algorithms looks damn fine to me and the code as it
stands is well optimised. That indicates to me that we've come to the
end of the current line of thinking and we need a new approach, possibly
in a number of areas.

For myself, I don't wish to be drawn further on solutions at this stage
but I am collecting performance data, so any test results are most
welcome.

Best Regards, Simon Riggs

Re: [PERFORM] A Better External Sort?

From

Simon Riggs

Date:

01 October 2005, 06:29:35

On Sat, 2005-10-01 at 02:01 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M.  Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> > the tape.  It could be done in a single pass heap merge with N*log(M)
> > comparisons, and, more importantly, far less input and output.
>
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here.  The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much greater
> than number of tape drives.  But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.
>
> I don't believe we can simply legislate that there be only one merge
> pass.  That would mean that, if we end up with N runs after the initial
> run-forming phase, we need to fit N tuples in memory --- no matter how
> large N is, or how small work_mem is.  But it seems like a good idea to
> try to use an N-way merge where N is as large as work_mem will allow.
> We'd not have to decide on the value of N until after we've completed
> the run-forming phase, at which time we've already seen every tuple
> once, and so we can compute a safe value for N as work_mem divided by
> largest_tuple_size.  (Tape I/O buffers would have to be counted too
> of course.)
>
> It's been a good while since I looked at the sort code, and so I don't
> recall if there are any fundamental reasons for having a compile-time-
> constant value of the merge order rather than choosing it at runtime.
> My guess is that any inefficiencies added by making it variable would
> be well repaid by the potential savings in I/O.

Well, perhaps Knuth is not untouchable!

So we merge R runs with N variable rather than N=6.

Pick N so that N >= 6 and N <= R, with N limited by memory, sufficient
to allow long sequential reads from the temp file.

Looking at the code, in selectnewtape() we decide on the connection
between run number and tape number. This gets executed during the
writing of initial runs, which was OK when the run->tape mapping was
known ahead of time because of fixed N.

To do this it sounds like we'd be better to write each run out to its
own personal runtape, taking the assumption that N is very large. Then
when all runs are built, re-assign the run numbers to tapes for the
merge. That is likely to be a trivial mapping unless N isn't large
enough to fit in memory. That idea should be easily possible because the
tape numbers were just abstract anyway.

Right now, I can't see any inefficiencies from doing this. It uses
memory better and Knuth shows that using more tapes is better anyhow.
Keeping track of more tapes isn't too bad, even for hundreds or even
thousands of runs/tapes.

Tom, its your idea, so you have first dibs. I'm happy to code this up if
you choose not to, once I've done my other immediate chores.

That just leaves these issues for a later time:
- CPU and I/O interleaving
- CPU cost of abstract data type comparison operator invocation

Best Regards, Simon Riggs

Re: [PERFORM] A Better External Sort?

From

Michael Stone

Date:

01 October 2005, 08:15:37

On Fri, Sep 30, 2005 at 01:41:22PM -0700, Josh Berkus wrote:
>Realistically, you can't do better than about 25MB/s on a single-threaded
>I/O on current Linux machines,

What on earth gives you that idea? Did you drop a zero?

Mike Stone

Re: [PERFORM] A Better External Sort?

From

Hannu Krosing

Date:

01 October 2005, 11:04:01

On R, 2005-09-30 at 13:38 -0700, Luke Lonergan wrote:

>
> Bulk loading speed is irrelevant here - that is dominated by parsing, which
> we have covered copiously (har har) previously and have sped up by 500%,
> which still makes Postgres < 1/2 the loading speed of MySQL.

Is this < 1/2 of MySQL with WAL on different spindle and/or WAL
disabled ?

--
Hannu Krosing <hannu@skype.net>

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

01 October 2005, 11:22:41

*blink* Tapes?!  I thought that was a typo...
If our sort is code based on sorting tapes, we've made a mistake.  HDs
are not tapes, and Polyphase Merge Sort and it's brethren are not the
best choices for HD based sorts.

Useful references to this point:
Knuth, Vol 3 section 5.4.9, (starts p356 of 2ed)
Tharp, ISBN 0-471-60521-2, starting p352
Folk, Zoellick, and Riccardi, ISBN 0-201-87401-6, chapter 8 (starts p289)

The winners of the "Daytona" version of Jim Gray's sorting contest, for
general purpose external sorting algorithms that are of high enough quality
to be offered commercially, also demonstrate a number of better ways to
attack external sorting using HDs.

The big take aways from all this are:
1= As in Polyphase Merge Sort, optimum External HD Merge Sort
performance is obtained by using Replacement Selection and creating
buffers of different lengths for later merging.  The values are different.

2= Using multiple HDs split into different functions, IOW _not_ simply
as RAIDs, is a big win.
A big enough win that we should probably consider having a config option
to pg that allows the use of HD(s) or RAID set(s) dedicated as temporary
work area(s).

3= If the Key is small compared record size, Radix or Distribution
Counting based algorithms are worth considering.

The good news is all this means it's easy to demonstrate that we can
improve the performance of our sorting functionality.

Assuming we get the abyssmal physical IO performance fixed...
(because until we do, _nothing_ is going to help us as much)

Ron

-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Oct 1, 2005 2:01 AM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> I think the largest speedup will be to dump the multiphase merge and
> merge all tapes in one pass, no matter how large M.  Currently M is
> capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> the tape.  It could be done in a single pass heap merge with N*log(M)
> comparisons, and, more importantly, far less input and output.

I had more or less despaired of this thread yielding any usable ideas
:-( but I think you have one here.  The reason the current code uses a
six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
edition) shows that there's not much incremental gain from using more
tapes ... if you are in the regime where number of runs is much greater
than number of tape drives.  But if you can stay in the regime where
only one merge pass is needed, that is obviously a win.

I don't believe we can simply legislate that there be only one merge
pass.  That would mean that, if we end up with N runs after the initial
run-forming phase, we need to fit N tuples in memory --- no matter how
large N is, or how small work_mem is.  But it seems like a good idea to
try to use an N-way merge where N is as large as work_mem will allow.
We'd not have to decide on the value of N until after we've completed
the run-forming phase, at which time we've already seen every tuple
once, and so we can compute a safe value for N as work_mem divided by
largest_tuple_size.  (Tape I/O buffers would have to be counted too
of course.)

It's been a good while since I looked at the sort code, and so I don't
recall if there are any fundamental reasons for having a compile-time-
constant value of the merge order rather than choosing it at runtime.
My guess is that any inefficiencies added by making it variable would
be well repaid by the potential savings in I/O.

Re: [PERFORM] A Better External Sort?

From

Andrew Dunstan

Date:

01 October 2005, 12:19:34

Ron Peacetree wrote:

>The good news is all this means it's easy to demonstrate that we can
>improve the performance of our sorting functionality.
>
>Assuming we get the abyssmal physical IO performance fixed...
>(because until we do, _nothing_ is going to help us as much)
>
>
>

I for one would be paying more attention if such a demonstration were
forthcoming, in the form of a viable patch and some benchmark results.

cheers

andrew

Re: [PERFORM] A Better External Sort?

From

Tom Lane

Date:

01 October 2005, 12:48:11

Josh Berkus <josh@agliodbs.com> writes:
> The biggest single area where I see PostgreSQL external sort sucking is
>   on index creation on large tables.   For example, for free version of
> TPCH, it takes only 1.5 hours to load a 60GB Lineitem table on OSDL's
> hardware, but over 3 hours to create each index on that table.  This
> means that over all our load into TPCH takes 4 times as long to create
> the indexes as it did to bulk load the data.
> ...
> Following an index creation, we see that 95% of the time required is the
> external sort, which averages 2mb/s.  This is with seperate drives for
> the WAL, the pg_tmp, the table and the index.  I've confirmed that
> increasing work_mem beyond a small minimum (around 128mb) had no benefit
> on the overall index creation speed.

These numbers don't seem to add up.  You have not provided any details
about the index key datatypes or sizes, but I'll take a guess that the
raw data for each index is somewhere around 10GB.  The theory says that
the runs created during the first pass should on average be about twice
work_mem, so at 128mb work_mem there should be around 40 runs to be
merged, which would take probably three passes with six-way merging.
Raising work_mem to a gig should result in about five runs, needing only
one pass, which is really going to be as good as it gets.  If you could
not see any difference then I see little hope for the idea that reducing
the number of merge passes will help.

Umm ... you were raising maintenance_work_mem, I trust, not work_mem?

We really need to get some hard data about what's going on here.  The
sort code doesn't report any internal statistics at the moment, but it
would not be hard to whack together a patch that reports useful info
in the form of NOTICE messages or some such.

            regards, tom lane

Re: [PERFORM] A Better External Sort?

From

Greg Stark

Date:

01 October 2005, 13:17:52

Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > I think the largest speedup will be to dump the multiphase merge and
> > merge all tapes in one pass, no matter how large M.  Currently M is
> > capped at 6, so a sort of 60GB with 1GB sort memory needs 13 passes over
> > the tape.  It could be done in a single pass heap merge with N*log(M)
> > comparisons, and, more importantly, far less input and output.
>
> I had more or less despaired of this thread yielding any usable ideas
> :-( but I think you have one here.  The reason the current code uses a
> six-way merge is that Knuth's figure 70 (p. 273 of volume 3 first
> edition) shows that there's not much incremental gain from using more
> tapes ... if you are in the regime where number of runs is much greater
> than number of tape drives.  But if you can stay in the regime where
> only one merge pass is needed, that is obviously a win.

Is that still true when the multiple tapes are being multiplexed onto a single
actual file on disk?

That brings up one of my pet features though. The ability to declare multiple
temporary areas on different spindles and then have them be used on a rotating
basis. So a sort could store each tape on a separate spindle and merge them
together at full sequential i/o speed.

This would make the tradeoff between multiway merges and many passes even
harder to find though. The broader the multiway merges the more sort areas
would be used which would increase the likelihood of another sort using the
same sort area and hurting i/o performance.

--
greg

Re: [PERFORM] A Better External Sort?

From

Martijn van Oosterhout

Date:

01 October 2005, 13:19:55

On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote:
> Assuming we get the abyssmal physical IO performance fixed...
> (because until we do, _nothing_ is going to help us as much)

I'm still not convinced this is the major problem. For example, in my
totally unscientific tests on an oldish machine I have here:

Direct filesystem copy to /dev/null
21MB/s    10% user 50% system  (dual cpu, so the system is using a whole CPU)

COPY TO /dev/null WITH binary
13MB/s    55% user 45% system  (ergo, CPU bound)

COPY TO /dev/null
4.4MB/s   60% user 40% system

\copy to /dev/null in psql
6.5MB/s   60% user 40% system

This machine is a bit strange setup, not sure why fs copy is so slow.
As to why \copy is faster than COPY, I have no idea, but it is
repeatable. And actually turning the tuples into a printable format is
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

So, I don't think physical I/O is the problem. It's something further
up the call tree. I wouldn't be surprised at all it it had to do with
the creation and destruction of tuples. The cost of comparing tuples
should not be underestimated.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

01 October 2005, 13:39:04

As I posted earlier, I'm looking for code to base a prototype on now.
I'll test it outside pg to make sure it is bug free and performs as
promised before I hand it off to the core pg developers.

Someone else is going to have to merge it into the pg code base
since I don't know the code intimately enough to make changes this
deep in the core functionality, nor is there enough time for me to
do so if we are going to be timely enough get this into 8.2
(and no, I can't devote 24x7 to doing pg development unless
someone is going to replace my current ways of paying my bills so
that I can.)

Ron


-----Original Message-----
From: Andrew Dunstan <andrew@dunslane.net>
Sent: Oct 1, 2005 11:19 AM
To: Ron Peacetree <rjpeace@earthlink.net>
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?



Ron Peacetree wrote:

>The good news is all this means it's easy to demonstrate that we can
>improve the performance of our sorting functionality.
>
>Assuming we get the abyssmal physical IO performance fixed...
>(because until we do, _nothing_ is going to help us as much)
>
>
>

I for one would be paying more attention if such a demonstration were
forthcoming, in the form of a viable patch and some benchmark results.

cheers

andrew

Re: [PERFORM] A Better External Sort?

From

Ron Peacetree

Date:

01 October 2005, 14:42:53

You have not said anything about what HW, OS version, and pg version
used here, but even at that can't you see that something Smells Wrong?

The most common CPUs currently shipping have clock rates of ~2-3GHz
and have 8B-16B internal pathways.  SPARCs and other like CPUs are
clocked slower but have 16B-32B internal pathways.  In short, these
CPU's have an internal bandwidth of 16+ GBps.

The most common currently shipping mainboards have 6.4GBps RAM
subsystems.  ITRW, their peak is ~80% of that, or ~5.1GBps.

In contrast, the absolute peak bandwidth of a 133MHx 8B PCI-X bus is
1GBps, and ITRW it peaks at ~800-850MBps.  Should anyone ever build
a RAID system that can saturate a PCI-Ex16 bus, that system will be
maxing ITRW at ~3.2GBps.

CPUs should NEVER be 100% utilized during copy IO.  They should be
idling impatiently waiting for the next piece of data to finish being
processed even when the RAM IO subsystem is pegged; and they
definitely should be IO starved rather than CPU bound when doing
HD IO.

Those IO rates are also alarming in all but possibly the first case.  A
single ~50MBps HD doing 21MBps isn't bad, but for even a single
~80MBps HD it starts to be of concern.  If any these IO rates came
from any reasonable 300+MBps RAID array, then they are BAD.

What your simple experiment really does is prove We Have A
Problem (tm) with our IO code at either or both of the OS or the pg
level(s).

Ron

-----Original Message-----
From: Martijn van Oosterhout <kleptog@svana.org>
Sent: Oct 1, 2005 12:19 PM
Subject: Re: [HACKERS] [PERFORM] A Better External Sort?

On Sat, Oct 01, 2005 at 10:22:40AM -0400, Ron Peacetree wrote:
> Assuming we get the abyssmal physical IO performance fixed...
> (because until we do, _nothing_ is going to help us as much)

I'm still not convinced this is the major problem. For example, in my
totally unscientific tests on an oldish machine I have here:

Direct filesystem copy to /dev/null
21MB/s    10% user 50% system  (dual cpu, so the system is using a whole CPU)

COPY TO /dev/null WITH binary
13MB/s    55% user 45% system  (ergo, CPU bound)

COPY TO /dev/null
4.4MB/s   60% user 40% system

\copy to /dev/null in psql
6.5MB/s   60% user 40% system

This machine is a bit strange setup, not sure why fs copy is so slow.
As to why \copy is faster than COPY, I have no idea, but it is
repeatable. And actually turning the tuples into a printable format is
the most expensive. But it does point out that the whole process is
probably CPU bound more than anything else.

So, I don't think physical I/O is the problem. It's something further
up the call tree. I wouldn't be surprised at all it it had to do with
the creation and destruction of tuples. The cost of comparing tuples
should not be underestimated.

Re: [PERFORM] A Better External Sort?

From

Martijn van Oosterhout

Date:

01 October 2005, 18:56:20

[removed -performance, not subscribed]

On Sat, Oct 01, 2005 at 01:42:32PM -0400, Ron Peacetree wrote:
> You have not said anything about what HW, OS version, and pg version
> used here, but even at that can't you see that something Smells Wrong?

Somewhat old machine running 7.3 on Linux 2.4. Not exactly speed
daemons but it's still true that the whole process would be CPU bound
*even* if the O/S could idle while it's waiting. PostgreSQL used a
*whole CPU* which is its limit. My point is that trying to reduce I/O
by increasing CPU usage is not going to be benficial, we need CPU usage
down also.

Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
doing an index create on a 2960296 row table (3 columns, table size
317MB).

The number 1 bottleneck with 41% of user time is comparetup_index. It
was called 95,369,361 times (about 2*ln(N)*N). It used 3 tapes. Another
15% of time went to tuplesort_heap_siftup.

The thing is, I can't see anything in comparetup_index() that could
take much time. The actual comparisons are accounted elsewhere
(inlineApplySortFunction) which amounted to <10% of total time. Since
nocache_index_getattr doesn't feature I can't imagine index_getattr
being a big bottleneck. Any ideas what's going on here?

Other interesting features:
- ~4 memory allocations per tuple, nearly all of which were explicitly
freed
- Things I though would be expensive, like: heapgettup and
myFunctionCall2 didn't really count for much.

Have a nice weekend,
 %   cumulative   self              self     total           time   seconds   seconds    calls   s/call   s/call  name
 43.63    277.81   277.81 95370055     0.00     0.00  comparetup_index16.24    381.24   103.43  5920592     0.00
0.00 tuplesort_heap_siftup 3.76    405.17    23.93 95370055     0.00     0.00  inlineApplySortFunction 3.18    425.42
20.26 95370056     0.00     0.00  btint4cmp 2.82    443.37    17.95 11856219     0.00     0.00  AllocSetAlloc 2.52
459.44   16.07 95370055     0.00     0.00  myFunctionCall2 1.71    470.35    10.91  2960305     0.00     0.00
heapgettup1.26    478.38     8.03 11841204     0.00     0.00  GetMemoryChunkSpace 1.14    485.67     7.29  5920592
0.00    0.00  tuplesort_heap_insert 1.11    492.71     7.04  2960310     0.00     0.00  index_form_tuple 1.09    499.67
   6.96 11855105     0.00     0.00  AllocSetFree 0.97    505.83     6.17 23711355     0.00     0.00  AllocSetFreeIndex
0.84   511.19     5.36  5920596     0.00     0.00  LogicalTapeWrite 0.84    516.51     5.33  2960314     0.00     0.00
slot_deform_tuple
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [PERFORM] A Better External Sort?

From

Tom Lane

Date:

02 October 2005, 00:26:13

Martijn van Oosterhout <kleptog@svana.org> writes:
> Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
> doing an index create on a 2960296 row table (3 columns, table size
> 317MB).

3 columns in the index you mean?  What were the column datatypes?
Any null values?

> The number 1 bottleneck with 41% of user time is comparetup_index.
> ...
> The thing is, I can't see anything in comparetup_index() that could
> take much time.

The index_getattr and heap_getattr macros can be annoyingly expensive.
        regards, tom lane

Re: [PERFORM] A Better External Sort?

From

Martijn van Oosterhout

Date:

02 October 2005, 09:47:48

On Sat, Oct 01, 2005 at 11:26:07PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Anyway, to bring some real info I just profiled PostgreSQL 8.1beta
> > doing an index create on a 2960296 row table (3 columns, table size
> > 317MB).
>
> 3 columns in the index you mean?  What were the column datatypes?
> Any null values?

Nope, three columns in the table, one column in the index, no nulls.
The indexed column was integer. I did it once with around 6500 values
repeated over and over, lots of duplicate kays. And once on a serial
column but it made no descernable difference either way. Although the
comparison function was called less (only 76 million times), presumably
because it was mostly sorted already.

> > The number 1 bottleneck with 41% of user time is comparetup_index.
> > ...
> > The thing is, I can't see anything in comparetup_index() that could
> > take much time.
>
> The index_getattr and heap_getattr macros can be annoyingly expensive.

And yet they are optimised for the common case. nocache_index_getattr
was only called 7 times, which is about what you expect. I'm getting
annotated output now, to determine which line takes the time...
Actually, my previous profile overstated stuff a bit. Profiling turned
off optimisation so I put it back and you get better results but the
order doesn't change much. By line results are below.

The top two are the index_getattr calls in comparetup_index. Third and
fourth are the HEAPCOMPARES in tuplesort_heap_siftup. Then comes the
inlineApplySortFunction call (which isn't being inlined, despite
suggesting it should be, -Winline warns about this).

Looks to me that there are no real gains to be made in this function.
What is needed is an algorithmic change to call this function less
often...

Have a nice weekend,
 %   cumulative   self              self     total           time   seconds   seconds    calls  ms/call  ms/call  name
  9.40     22.56    22.56                             comparetup_index (tuplesort.c:2042 @ 8251060) 5.07     34.73
12.17                            comparetup_index (tuplesort.c:2043 @ 82510c0) 4.73     46.09    11.36
          tuplesort_heap_siftup (tuplesort.c:1648 @ 825074d) 3.48     54.45     8.36
tuplesort_heap_siftup(tuplesort.c:1661 @ 82507a9) 2.80     61.18     6.73                             comparetup_index
(tuplesort.c:2102@ 8251201) 2.68     67.62     6.44                             comparetup_index (tuplesort.c:2048 @
8251120)2.16     72.82     5.20                             tuplesort_heap_siftup (tuplesort.c:1652 @ 825076d) 1.88
77.34    4.52 76025782     0.00     0.00  comparetup_index (tuplesort.c:2016 @ 8251010) 1.82     81.70     4.36
76025782    0.00     0.00  inlineApplySortFunction (tuplesort.c:1833 @ 8251800) 1.73     85.85     4.15
           readtup_heap (tuplesort.c:2000 @ 8250fd8) 1.67     89.86     4.01                             AllocSetAlloc
(aset.c:568@ 824bec0) 1.61     93.72     3.86                             comparetup_index (tuplesort.c:2025 @ 825102f)
1.47    97.25     3.53 76025785     0.00     0.00  btint4cmp (nbtcompare.c:74 @ 80924a0) 1.11     99.92     2.67
                    readtup_datum (tuplesort.c:2224 @ 82517c4) 1.10    102.55     2.64
comparetup_index(tuplesort.c:2103 @ 82511e7) 
 %   cumulative   self              self     total           time   seconds   seconds    calls   s/call   s/call  name
 28.34     68.01    68.01 76025782     0.00     0.00  comparetup_index13.56    100.54    32.53  7148934     0.00
0.00 tuplesort_heap_siftup 8.66    121.33    20.79 76025782     0.00     0.00  inlineApplySortFunction 4.43    131.96
10.63 13084567     0.00     0.00  AllocSetAlloc 3.73    140.90     8.94 76025785     0.00     0.00  btint4cmp 2.15
146.07    5.17  6095625     0.00     0.00  LWLockAcquire 2.02    150.92     4.85  2960305     0.00     0.00  heapgettup
1.98   155.66     4.74  7148934     0.00     0.00  tuplesort_heap_insert 1.78    159.94     4.28  2960312     0.00
0.00 slot_deform_tuple 1.73    164.09     4.15                             readtup_heap 1.67    168.09     4.00
6095642    0.00     0.00  LWLockRelease 1.53    171.76     3.68  2960308     0.00     0.00  index_form_tuple 1.44
175.21    3.45 13083442     0.00     0.00  AllocSetFree 1.28    178.28     3.07  8377285     0.00     0.00
LogicalTapeWrite1.25    181.29     3.01  8377285     0.00     0.00  LogicalTapeRead 1.11    183.96     2.67
               readtup_datum 1.06    186.51     2.55        1     2.55   123.54  IndexBuildHeapScan 

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [PERFORM] A Better External Sort?

From

Martijn van Oosterhout

Date:

02 October 2005, 17:41:15

Ok, I tried two optimisations:

1. By creating a special version of comparetup_index for single key
integer indexes. Create an index_get_attr with byval and len args. By
using fetch_att and specifying the values at compile time, gcc
optimises the whole call to about 12 instructions of assembly rather
than the usual mess.

2. By specifying: -Winline -finline-limit-1500 (only on tuplesort.c).
This causes inlineApplySortFunction() to be inlined, like the code
obviously expects it to be.

default build (baseline)                    235 seconds
-finline only                               217 seconds (7% better)
comparetup_index_fastbyval4 only            221 seconds (6% better)
comparetup_index_fastbyval4 and -finline    203 seconds (13.5% better)

This is indexing the integer sequence column on a 2.7 million row
table. The times are as given by gprof and so exclude system call time.

Basically, I recommend adding "-Winline -finline-limit-1500" to the
default build while we discuss other options.

comparetup_index_fastbyval4 patch attached per example.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Andreas,

On 10/6/05 3:56 AM, "Zeugswetter Andreas DAZ SD" <ZeugswetterA@spardat.at>
wrote:

> pg relys on the OS readahead (== larger block IO) to do efficient IO.
> Basically the pg scan performance should match a dd if=file of=/dev/null
> bs=8k,
> unless CPU bound.

Which it is. Postgres will currently do a maximum of 120MB/s scan rate on
modern CPUs, no matter how fast the underlying I/O subsystem is. Several
people on this list have confirmed that by posting 100MB/s numbers on faster
subsystems. Others have posted that Oracle/DB2/etc run at near disk rates.
IMO, most of the problem is that there are too many small (tuple, single
block) operations in the I/O path and they need to be optimized out.

The first step is recognition that there is a problem worth solving, and it
seems that there are now many who have recognized the issue of poor I/O path
optimization in Postgres. I think a prioritized short list might look like
this:

- Sort performance - The implementation of a possibly fine algorithm yields very poor use of
the underlying hardware. It needs to be optimized to remove redundant I/O
operations and possibly (with different algorithm?) to make better use of
available RAM. This would be practice for the next item on the list:

- Scan performance - Someone needs to trace the life of a tuple through the executor and
implement better buffering at each stage of the process. There was a start
at this, but it only buffered in one node. Yes, I know we have a shared
memory buffer, but it acts as a final destination, this is about the path
through the executor. - I suspect that after profiling (with system routines included), we will
find much more tuple-scale work that is interfering with the optimal flow of
tuples through the executor. It's unlikely that the problem is isolated to
the COUNT() aggregate, but is present in many other nodes. Redundant
allocation/free of memory on a per tuple basis should be replaced with more
persistent buffers on a page or larger basis. - Ultimately, after working through the above two issues, we will reach
a
point of diminishing returns where async I/O is needed to be able to remove
producer/consumer problems in a single threaded executor. This can be
implemented using sliced processes or threading, or by using aio. I believe
that even without async I/O, we should be seeing scan rates of 200-400 MB/s
on modern CPUs for the simple COUNT() aggregation pattern though.

FWIW,

- Luke

Re: [PERFORM] A Better External Sort?

From

Tom Lane

Date:

06 October 2005, 17:09:53

Martijn van Oosterhout <kleptog@svana.org> writes:
> Indeed, one of the things on my list is to remove all the lseeks in
> favour of pread. Halving the number of kernel calls has got to be worth
> something right? Portability is an issue ofcourse...

Being sure that it's not a pessimization is another issue.  I note that
glibc will emulate these functions if the kernel doesn't have them;
which means you could be replacing one kernel call with three.

And I don't think autoconf has any way to determine whether a libc
function represents a native kernel call or not ...

            regards, tom lane

Re: [PERFORM] A Better External Sort?

From

Tom Lane

Date:

06 October 2005, 18:15:22

Martijn van Oosterhout <kleptog@svana.org> writes:
> Are we awfully worried about people still using 2.0 kernels? And it
> would replace two calls with three in the worst case, we currently
> lseek before every read.

That's utterly false.

            regards, tom lane

Re: [PERFORM] A Better External Sort?

From

Alvaro Herrera

Date:

06 October 2005, 18:24:31

On Thu, Oct 06, 2005 at 03:57:38PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Indeed, one of the things on my list is to remove all the lseeks in
> > favour of pread. Halving the number of kernel calls has got to be worth
> > something right? Portability is an issue ofcourse...
>
> Being sure that it's not a pessimization is another issue.  I note that
> glibc will emulate these functions if the kernel doesn't have them;
> which means you could be replacing one kernel call with three.
>
> And I don't think autoconf has any way to determine whether a libc
> function represents a native kernel call or not ...

The problem kernels would be Linux 2.0, which I very much doubt is going
to be present in to-be-deployed database servers.

Unless someone runs glibc on top of some other kernel, I guess.  Is this
a common scenario?  I've never seen it.

--
Alvaro Herrera                  http://www.amazon.com/gp/registry/DXLWNGRJD34
Oh, oh, las chicas galacianas, lo harán por las perlas,
¡Y las de Arrakis por el agua! Pero si buscas damas
Que se consuman como llamas, ¡Prueba una hija de Caladan! (Gurney Halleck)

Re: [PERFORM] A Better External Sort?

From

Martijn van Oosterhout

Date:

06 October 2005, 18:26:50

On Thu, Oct 06, 2005 at 03:57:38PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Indeed, one of the things on my list is to remove all the lseeks in
> > favour of pread. Halving the number of kernel calls has got to be worth
> > something right? Portability is an issue ofcourse...
>
> Being sure that it's not a pessimization is another issue.  I note that
> glibc will emulate these functions if the kernel doesn't have them;
> which means you could be replacing one kernel call with three.

From the linux pread manpage:

HISTORY      The pread and pwrite system calls were added to Linux in version      2.1.60; the entries in the i386
systemcall table were added in      2.1.69.  The libc support (including emulation on older kernels      without the
systemcalls) was added in glibc 2.1. 

Are we awfully worried about people still using 2.0 kernels? And it
would replace two calls with three in the worst case, we currently
lseek before every read.

I don't know about other OSes.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [PERFORM] A Better External Sort?

From

Martijn van Oosterhout

Date:

06 October 2005, 18:40:44

On Thu, Oct 06, 2005 at 04:25:11PM -0400, Tom Lane wrote:
> Martijn van Oosterhout <kleptog@svana.org> writes:
> > Are we awfully worried about people still using 2.0 kernels? And it
> > would replace two calls with three in the worst case, we currently
> > lseek before every read.
>
> That's utterly false.

Oops, you're right. I usually strace during a vacuum or a large query
and my screen fills up with:

lseek()
read()
lseek()
read()
...

So didn't wonder if the straight sequential read was optimised. Still,
I think pread() would be a worthwhile improvement, at least for Linux.

--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Patent. n. Genius is 5% inspiration and 95% perspiration. A patent is a
> tool for doing 5% of the work and then sitting around waiting for someone
> else to do the other 95% so you can sue them.

Re: [PERFORM] A Better External Sort?

From

"Jim C. Nasby"

Date:

08 October 2005, 19:55:03

On Thu, Sep 29, 2005 at 03:28:27PM +0200, Zeugswetter Andreas DAZ SD wrote:
>
> > In my original example, a sequential scan of the 1TB of 2KB
> > or 4KB records, => 250M or 500M records of data, being sorted
> > on a binary value key will take ~1000x more time than reading
> > in the ~1GB Btree I described that used a Key+RID (plus node
> > pointers) representation of the data.
>
> Imho you seem to ignore the final step your algorithm needs of
> collecting the
> data rows. After you sorted the keys the collect step will effectively
> access the
> tuples in random order (given a sufficiently large key range).
>
> This random access is bad. It effectively allows a competing algorithm
> to read the
> whole data at least 40 times sequentially, or write the set 20 times
> sequentially.
> (Those are the random/sequential ratios of modern discs)

True, but there is a compromise... not shuffling full tuples around when
sorting in memory. Do your sorting with pointers, then write the full
tuples out to 'tape' if needed.

Of course the other issue here is that as correlation improves it
becomes better and better to do full pointer-based sorting.
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461