Thread: PageRepairFragmentation performance

PageRepairFragmentation performance

From
Heikki Linnakangas
Date:
When profiling replay the WAL generated by pgbench, I noticed the
PageRepairFragmentation consumes a large fraction of the CPU time:

Per "perf report":

+   33.44%     6.79%  postmaster  postgres            [.]
PageRepairFragmentation

The 33.44% figure includes all the functions called by
PageRepairFragmentation. Looking at the profile closer, most of that
time seems to be spent in sorting the item ids to physical order, so
that they can be memmoved in place:

+   83.86%     0.00%  postmaster  libc-2.19.so        [.] __libc_start_main
+   83.86%     0.00%  postmaster  postgres            [.] main
+   83.86%     0.00%  postmaster  postgres            [.] PostmasterMain
+   83.86%     0.00%  postmaster  postgres            [.] 0x000000000023208d
+   83.86%     0.00%  postmaster  postgres            [.]
AuxiliaryProcessMain
+   83.86%     0.00%  postmaster  postgres            [.] StartupProcessMain
+   83.63%     1.86%  postmaster  postgres            [.] StartupXLOG
+   45.85%     0.10%  postmaster  postgres            [.] heap2_redo
+   33.44%     6.79%  postmaster  postgres            [.]
PageRepairFragmentation
+   24.60%    16.63%  postmaster  postgres            [.] pg_qsort
+   18.04%     0.23%  postmaster  postgres            [.] heap_redo
+   17.07%     1.53%  postmaster  postgres            [.]
XLogReadBufferExtended
+   16.20%     0.30%  postmaster  postgres            [.]
XLogReadBufferForRedoEx
+   14.38%     0.31%  postmaster  postgres            [.] ReadRecord
+   13.90%     1.29%  postmaster  postgres            [.] XLogReadRecord
+   12.40%     1.54%  postmaster  postgres            [.] heap_xlog_update
+   12.08%    12.06%  postmaster  postgres            [.] ValidXLogRecord
+   11.73%     0.10%  postmaster  postgres            [.]
XLogReadBufferForRedo
+   10.89%     0.27%  postmaster  postgres            [.]
ReadBufferWithoutRelcac
+    8.49%     1.07%  postmaster  libc-2.19.so        [.] __GI___libc_read
+    7.61%     0.71%  postmaster  postgres            [.] ReadBuffer_common
+    5.64%     0.48%  postmaster  postgres            [.] smgropen
+    5.48%     5.47%  postmaster  postgres            [.] itemoffcompare
+    5.40%     5.38%  postmaster  postgres            [.]
hash_search_with_hash_v
+    4.70%     4.69%  postmaster  libc-2.19.so        [.]
__memmove_ssse3_back
+    4.30%     0.77%  postmaster  libc-2.19.so        [.]
__GI___libc_lseek64
+    4.29%     0.20%  postmaster  postgres            [.] heap_xlog_insert
+    3.88%     3.87%  postmaster  postgres            [.] swapfunc
+    2.81%     0.09%  postmaster  postgres            [.]
XLogRecordPageWithFreeS
+    2.76%     0.00%          cp  libc-2.19.so        [.] __GI___libc_write
+    2.68%     0.07%  postmaster  postgres            [.] BufTableLookup
+    2.58%     2.58%  postmaster  postgres            [.] LWLockAcquire
+    2.17%     0.14%  postmaster  postgres            [.] tag_hash

So there's clearly some room for improvement here. A couple of ideas:

1. Replace the qsort with something cheaper. The itemid arrays being
sorted are small, a few hundred item at most, usually even smaller. In
this pgbench test case I used, the typical size is about 60. With a
small array a plain insertion sort is cheaper than the generic qsort(),
because it can avoid the function overhead etc. involved with generic
qsort. Or we could use something smarter, like a radix sort, knowing
that we're sorting small integers. Or we could implement an inlineable
version of qsort and use that.

2. Instead of sorting the array and using memmove in-place, we could
copy all the tuples to a temporary buffer in arbitrary order, and
finally copy the temporary copy back to the buffer. That requires two
memory copies per tuple, instead of one memmove, but memcpy() is pretty
darn fast. It would be a loss when there are only a few large tuples on
the page, so that avoiding the sort doesn't help, or when the tuples are
mostly already in the correct places, so that most of the memmove()s are
no-ops. But with a lot of small tuples, it would be a win, and it would
be simple.

The second option would change behaviour slightly, as the tuples would
be placed on the page in different physical order than before. It
wouldn't be visible to to users, though.

I spent some time hacking approach 1, and replaced the qsort() call with
a bucket sort. I'm not sure if a bucket sort is optimal, or better than
a specialized quicksort implementation, but it seemed simple.

With the testcase I've been using - replaying about 2GB of WAL generated
by pgbench - this reduces the replay time from about 55 s to 45 s.

Thoughts? Attached is the patch I put together. It's actually two
patches: the first is just refactoring, putting the common code between
PageRepairFragmentation, PageIndexMultiDelete, and
PageIndexDeleteNoCompact to function. The second replaces the qsort().

- Heikki

Attachment

Re: PageRepairFragmentation performance

From
José Luis Tallón
Date:
On 11/18/2014 07:03 PM, Heikki Linnakangas wrote:
> When profiling replay the WAL generated by pgbench, I noticed the 
> PageRepairFragmentation consumes a large fraction of the CPU time:
> [snip]
>
> 1. Replace the qsort with something cheaper. The itemid arrays being 
> sorted are small, a few hundred item at most, usually even smaller. In 
> this pgbench test case I used, the typical size is about 60. With a 
> small array a plain insertion sort is cheaper than the generic 
> qsort(), because it can avoid the function overhead etc. involved with 
> generic qsort. Or we could use something smarter, like a radix sort, 
> knowing that we're sorting small integers. Or we could implement an 
> inlineable version of qsort and use that.
IIRC, we would have a theoretical complexity of quicksort and radix sort 
should be approximately the same for 256-1024 items... O(n*log(n)) vs 
O(d*n), where d is ~log2(n) or just 16 in this case. However, 
lexicographical ("bitstring-wise" ordering) might not be what we are 
aiming for here

AFAIK, an inlined quicksort should be about the best performing sort 
available (most of the enhancement coming from staying within the I-cache)

> 2. Instead of sorting the array and using memmove in-place, we could 
> copy all the tuples to a temporary buffer in arbitrary order, and 
> finally copy the temporary copy back to the buffer. That requires two 
> memory copies per tuple, instead of one memmove, but memcpy() is 
> pretty darn fast. It would be a loss when there are only a few large 
> tuples on the page, so that avoiding the sort doesn't help, or when 
> the tuples are mostly already in the correct places, so that most of 
> the memmove()s are no-ops. But with a lot of small tuples, it would be 
> a win, and it would be simple.

Memmove *should* be no slower than memcpy.... if both are actually 
translated by the compiler to use intrinsics as opposed to calling the 
functions --- as it seems to be done here (cfr. __memmove_ssse3_back )
A simple "if" in order to eliminate the no-op memmoves might as well do 
it, too.

Just my two (euro) cents, though

> The second option would change behaviour slightly, as the tuples would 
> be placed on the page in different physical order than before. It 
> wouldn't be visible to to users, though.
>
> I spent some time hacking approach 1, and replaced the qsort() call 
> with a bucket sort. I'm not sure if a bucket sort is optimal, or 
> better than a specialized quicksort implementation, but it seemed simple.
>
> With the testcase I've been using - replaying about 2GB of WAL 
> generated by pgbench - this reduces the replay time from about 55 s to 
> 45 s.

Not bad at all... though I suspect most of it might come from staying 
within the I-cache as opposed to regular qsort.
The smaller itemIdSortData structure surely helps a bit, too :)

> Thoughts? Attached is the patch I put together. It's actually two 
> patches: the first is just refactoring, putting the common code 
> between PageRepairFragmentation, PageIndexMultiDelete, and 
> PageIndexDeleteNoCompact to function. The second replaces the qsort(). 

Definitively worth-while, even if just for the refactor. The speed-up 
sounds very good, too.



Thanks,
    / J.L.




Re: PageRepairFragmentation performance

From
Peter Geoghegan
Date:
On Tue, Nov 18, 2014 at 10:03 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> So there's clearly some room for improvement here. A couple of ideas:
>
> 1. Replace the qsort with something cheaper. The itemid arrays being sorted
> are small, a few hundred item at most, usually even smaller. In this pgbench
> test case I used, the typical size is about 60. With a small array a plain
> insertion sort is cheaper than the generic qsort(), because it can avoid the
> function overhead etc. involved with generic qsort. Or we could use
> something smarter, like a radix sort, knowing that we're sorting small
> integers. Or we could implement an inlineable version of qsort and use that.

> I spent some time hacking approach 1, and replaced the qsort() call with a
> bucket sort. I'm not sure if a bucket sort is optimal, or better than a
> specialized quicksort implementation, but it seemed simple.

Nice result.

Most quicksort implementations, including our own, switch to insertion
sort if the number of elements to be sorted is low enough (It occurs
when n less than 7, in our case). Also, specializing qsort() with
integers in isolation showed about a 3x improvement last I checked,
which was a few years ago now, so cache misses (reductions in which we
can gain through compile-time specialization) are likely by far the
dominant consideration here.

pgbench is probably a very conservative way of estimating how big the
array can get, given the structure of those tables. Even 60 seems high
if we're looking for some kind of rough idea of an average.

How did you arrive at this value? It seems rather high:

+#define N_SORT_BUCKETS 32

I think you're not sufficiently clear on why you're using bucket sort
rather than some other alternative. Maybe it's so obvious to you that
you forgot. I *think* it's because it's reasonable to suppose that for
this particular use-case, you know that in general there'll be a more
or less even distribution between the buckets. Right? That seems to
make sense to me. All the work that qsort() does to make sure it has a
good pivot at each level of recursion may be wasted here.

The refactoring patch certainly looks very reasonable.
-- 
Peter Geoghegan



Re: PageRepairFragmentation performance

From
Heikki Linnakangas
Date:
On 01/31/2015 01:49 AM, Peter Geoghegan wrote:
> The refactoring patch certainly looks very reasonable.

Ok, committed the refactoring part for now. Thanks for the review.

- Heikki




Re: PageRepairFragmentation performance

From
Peter Geoghegan
Date:
On Tue, Feb 3, 2015 at 4:11 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 01/31/2015 01:49 AM, Peter Geoghegan wrote:
>> The refactoring patch certainly looks very reasonable.
>
> Ok, committed the refactoring part for now. Thanks for the review.

Where are we on the rest of this, Heikki?

-- 
Peter Geoghegan