Thread: posix advises ...

posix advises ...

From

Hans-Juergen Schoenig

Date:

11 May 2008, 06:53:16

hello everybody,

recently we had a bit of a nightmare with some kernels and concurrent
seq scans.
the thing we encountered was the following: a single "SELECT COUNT(*)
FROM table" on a big table (50 gb) gave us constant 350 mb / sec I/O. as
soon as a second scan dropped in speed dropped to 2 mb / sec. first i
thought that some random I/O dropped in but synchronous scans worked
fine. we found out that there is some madness in some linux kernel /
controller combinations causing this issue.
it did some tests on my local boxes which was clearly not affected by
this problem and I have seen a single SATA disks dropping from 65 mb /
sec to around 45. this is not good.
i found a patch by grep stark implementing posix_fadvise for bitmap
scans. i quickly hacked in suggestions to issue the same advises when a
seq scan is done.
the impact was surprisingly high. single scans went up from 65 mb / sec
to something around 70. concurrent scans are basically at steady, high
speed - no dropping I/O speed anymore until something like 16 scans or so.
even the broken controller when up from "350mb -> 2mb" to "350 -> 50mb".
by replacing the kernel and the driver we see steady behavior here as
well now.

maybe it is worth to discuss posix_fadvise.
we hacked up a simple patch based on greg's work which nicely fixed the
problem for us (brute force).
we also made some simple autoconf hack to check for broken posix_fadvise.
maybe people want to test if they see similar performance differences.
if a patch like that is likely to be accepted we would hack up some more
clean implementation.

    many thanks,

       hans

--
Cybertec Schönig & Schönig GmbH
PostgreSQL Solutions and Support
Gröhrmühlgasse 26, A-2700 Wiener Neustadt
Tel: +43/1/205 10 35 / 340
www.postgresql-support.de, www.postgresql-support.com

Attachment

preread-seq-tunable.diff.gz

Re: posix advises ...

From

Greg Smith

Date:

13 May 2008, 23:04:27

On Sun, 11 May 2008, Hans-Juergen Schoenig wrote:

> we also made some simple autoconf hack to check for broken posix_fadvise.

Because of how you did that, your patch is extremely difficult to even
test.  You really should at least scan the output from diff you're about
to send before submitting a patch to make sure it makes sense--yours is
over 30,000 lines long just to implement a small improvement.  The reason
for that is that you regenerated "configure" using a later version of
Autoconf than the official distribution, and the diff for the result is
gigantic.  You even turned off the check in configure.in that specifically
prevents you from making that mistake so it's not like you weren't warned.

You should just show the necessary modifications to configure.in in your
patch, certainly shouldn't submit a patch that subverts the checks there,
and leave out the resulting configure file if you didn't use the same
version of Autoconf.

I find the concept behind this patch very useful and I'd like to see a
useful one re-submitted.  I'm in the middle of setting up some new
hardware this month and was planning to test the existing fadvise patches
Greg Stark sent out as part of that.  Having another one to test for
accelerating multiple sequential scans would be extremely helpful to add
to that, because then I think I can reuse some of the test cases Jeff
Davis put together for the 8.3 improvements in that area to see how well
it works.  It wasn't as clear to me how to test the existing bitmap scan
patch, yours seems a more straightforward patch to use as a testing ground
for fadvise effectiveness.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: posix advises ...

From

Zoltan Boszormenyi

Date:

19 June 2008, 08:12:45

Greg Smith írta:
> On Sun, 11 May 2008, Hans-Juergen Schoenig wrote:
>
>> we also made some simple autoconf hack to check for broken
>> posix_fadvise.
>
> Because of how you did that, your patch is extremely difficult to even
> test.  You really should at least scan the output from diff you're
> about to send before submitting a patch to make sure it makes
> sense--yours is over 30,000 lines long just to implement a small
> improvement.  The reason for that is that you regenerated "configure"
> using a later version of Autoconf than the official distribution, and
> the diff for the result is gigantic.  You even turned off the check in
> configure.in that specifically prevents you from making that mistake
> so it's not like you weren't warned.

Sorry, that old autoconf version was nowhere to be found for Fedora 8.
Fortunately PostgreSQL 8.4 switched already to autoconf 2.61... :-)

> You should just show the necessary modifications to configure.in in
> your patch, certainly shouldn't submit a patch that subverts the
> checks there, and leave out the resulting configure file if you didn't
> use the same version of Autoconf.
>
> I find the concept behind this patch very useful and I'd like to see a
> useful one re-submitted.  I'm in the middle of setting up some new
> hardware this month and was planning to test the existing fadvise
> patches Greg Stark sent out as part of that.

This patch (revisited and ported to current CVS HEAD) is indeed using
Greg's original patch and also added another patch written by Mark Wong
that helps evicting closed XLOGs from memory faster. Our additions are:
- advise POSIX_FADV_SEQUENTIAL for seqscans
- configure check
- small documentation for Greg's patch and its GUC
- adapt ginget.c that started using tbm_iterate() recently

The configure check implicitely assumes segfaults (which are
returned as exit code 139 here) can be handled. IIRC Tom Lane
talked about unmatched glibc and Linux kernel versions may
segfault when posix_fadvise() was called. (kernel lacked the feature,
glibc erroneously assumed it can use it)

>   Having another one to test for accelerating multiple sequential
> scans would be extremely helpful to add to that, because then I think
> I can reuse some of the test cases Jeff Davis put together for the 8.3
> improvements in that area to see how well it works.  It wasn't as
> clear to me how to test the existing bitmap scan patch, yours seems a
> more straightforward patch to use as a testing ground for fadvise
> effectiveness.
>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>

Best regards,
Zoltán Böszörményi

--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/

Attachment

preread-seq-tunable-8.4-v4.diff.gz

Re: posix advises ...

From

Greg Smith

Date:

19 June 2008, 20:19:59

On Thu, 19 Jun 2008, Zoltan Boszormenyi wrote:

> This patch (revisited and ported to current CVS HEAD) is indeed using
> Greg's original patch and also added another patch written by Mark Wong
> that helps evicting closed XLOGs from memory faster.

Great, that will save me some trouble.  I've got a stack of Linux
performance testing queued up (got stuck behind a kernel bug impacting
pgbench) for the next couple of weeks and I'll include this in that
testing.  I think I've got a similar class of hardware as you tested on
for initial evaluation--I'm getting around 200MB/s sequential I/O right
now out of my small RAID setup,.

I added your patch to the queue for next month's CommitFest and listed
myself as the initial reviewer, but a commit that soon is unlikely.
Performance tests like this usually take a while to converge, and since
this is using a less popular API I expect a round of portability concerns,
too.

Where did Marc's patch come from?  I'd like to be able to separate out
that change from the rest if necessary.

Also, if you have any specific test cases you ran that I could start by
trying to replicate a speedup on, those would be handy as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: posix advises ...

From

Hans-Juergen Schoenig

Date:

20 June 2008, 02:49:59

good morning,

this is wonderful news.

this is pretty much what we observed as well. the kernel has acted as showstopper for many setups recently. this patch fixed most cases related to kernel read ahead and so on for us.

in fact, posix_fadvise was the only way to prevent a big germany company from replacing postgres with oracle.

the problem was that synchronized scans led to a significant decrease of I/O throughput as the kernel was simply confused by processes concurrently reading the same file.

I hope zoltan's autoconf magic fixes the portability issues.

hans

On Jun 20, 2008, at 1:19 AM, Greg Smith wrote:

On Thu, 19 Jun 2008, Zoltan Boszormenyi wrote:

This patch (revisited and ported to current CVS HEAD) is indeed using
Greg's original patch and also added another patch written by Mark Wong
that helps evicting closed XLOGs from memory faster.

Great, that will save me some trouble. I've got a stack of Linux performance testing queued up (got stuck behind a kernel bug impacting pgbench) for the next couple of weeks and I'll include this in that testing. I think I've got a similar class of hardware as you tested on for initial evaluation--I'm getting around 200MB/s sequential I/O right now out of my small RAID setup,.

I added your patch to the queue for next month's CommitFest and listed myself as the initial reviewer, but a commit that soon is unlikely. Performance tests like this usually take a while to converge, and since this is using a less popular API I expect a round of portability concerns, too.

Where did Marc's patch come from? I'd like to be able to separate out that change from the rest if necessary.

Also, if you have any specific test cases you ran that I could start by trying to replicate a speedup on, those would be handy as well.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

--
Sent via pgsql-patches mailing list (pgsql-patches@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-patches

Cybertec Schönig & Schönig GmbH

PostgreSQL Solutions and Support

Gröhrmühlgasse 26, 2700 Wiener Neustadt

Tel: +43/1/205 10 35 / 340

www.postgresql-support.de, www.postgresql-support.com

Re: posix advises ...

From

Zoltan Boszormenyi

Date:

20 June 2008, 07:24:25

Greg Smith írta:
> On Thu, 19 Jun 2008, Zoltan Boszormenyi wrote:
>
>> This patch (revisited and ported to current CVS HEAD) is indeed using
>> Greg's original patch and also added another patch written by Mark Wong
>> that helps evicting closed XLOGs from memory faster.
>
> Great, that will save me some trouble.  I've got a stack of Linux
> performance testing queued up (got stuck behind a kernel bug impacting
> pgbench) for the next couple of weeks and I'll include this in that
> testing.  I think I've got a similar class of hardware as you tested
> on for initial evaluation--I'm getting around 200MB/s sequential I/O
> right now out of my small RAID setup,.
>
> I added your patch to the queue for next month's CommitFest and listed
> myself as the initial reviewer, but a commit that soon is unlikely.
> Performance tests like this usually take a while to converge, and
> since this is using a less popular API I expect a round of portability
> concerns, too.
>
> Where did Marc's patch come from?  I'd like to be able to separate out
> that change from the rest if necessary.

That patch was posted here:
http://archives.postgresql.org/pgsql-patches/2008-03/msg00000.php

> Also, if you have any specific test cases you ran that I could start
> by trying to replicate a speedup on, those would be handy as well.

We experienced synchronized seqscans slowing down after some (10+) clients
which seems to be strange as it should be a strong selling point of 8.3.
With the posix_fadvise() patchs, the dropoff was pushed further.
The query involved multiple tables, it was not a trivial one table
seqscan case.
Without the patch, both a simpler SATA system (each disk at ~63MB/sec)
and a hw RAID with 400+ MB/sec showed slowdown.
The initial 60-63MB/sec with 1-3 clients on the single SATA disk system
quickly dropped to 11-17MB/sec with more clients.
With the patch, it only dropped to 40-47MB/sec.
I cannot recall the RAID figures.

> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>


--
----------------------------------
Zoltán Böszörményi
Cybertec Schönig & Schönig GmbH
http://www.postgresql.at/