Thread: How filesystems matter with PostgreSQL
Some interesting data about different filesystems I tried with PostgreSQL and how it came out. I have an application that is backed in postgres using Java JDBC to access it. The tests were all done on an opensuse 11.2 64-bit machine, on the same hard drive (just ran mkfs between each test) on the same input with the same code base. All filesystems were created with the default options. XFS (logbufs=8): ~4 hours to finish ext4: ~1 hour 50 minutes to finish ext3: 15 minutes to finish ext3 on LVM: 15 minutes to finish -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information.
On Friday 04 June 2010 14:17:35 Jon Schewe wrote: > Some interesting data about different filesystems I tried with > PostgreSQL and how it came out. > > I have an application that is backed in postgres using Java JDBC to > access it. The tests were all done on an opensuse 11.2 64-bit machine, > on the same hard drive (just ran mkfs between each test) on the same > input with the same code base. All filesystems were created with the > default options. > > XFS (logbufs=8): ~4 hours to finish > ext4: ~1 hour 50 minutes to finish > ext3: 15 minutes to finish > ext3 on LVM: 15 minutes to finish > Hi Jon, Any chance you can do the same test with reiserfs? Thanks, Joost
On Friday 04 June 2010 14:17:35 Jon Schewe wrote: > Some interesting data about different filesystems I tried with > PostgreSQL and how it came out. > > I have an application that is backed in postgres using Java JDBC to > access it. The tests were all done on an opensuse 11.2 64-bit machine, > on the same hard drive (just ran mkfs between each test) on the same > input with the same code base. All filesystems were created with the > default options. > > XFS (logbufs=8): ~4 hours to finish > ext4: ~1 hour 50 minutes to finish > ext3: 15 minutes to finish > ext3 on LVM: 15 minutes to finish My guess is that some of the difference comes from barrier differences. ext4 uses barriers by default, ext3 does not. Andres
Andres Freund <andres@anarazel.de> writes: > On Friday 04 June 2010 14:17:35 Jon Schewe wrote: >> XFS (logbufs=8): ~4 hours to finish >> ext4: ~1 hour 50 minutes to finish >> ext3: 15 minutes to finish >> ext3 on LVM: 15 minutes to finish > My guess is that some of the difference comes from barrier differences. ext4 > uses barriers by default, ext3 does not. Or, to put it more clearly: the reason ext3 is fast is that it's unsafe. regards, tom lane
On Friday 04 June 2010 16:25:30 Tom Lane wrote: > Andres Freund <andres@anarazel.de> writes: > > On Friday 04 June 2010 14:17:35 Jon Schewe wrote: > >> XFS (logbufs=8): ~4 hours to finish > >> ext4: ~1 hour 50 minutes to finish > >> ext3: 15 minutes to finish > >> ext3 on LVM: 15 minutes to finish > > > > My guess is that some of the difference comes from barrier differences. > > ext4 uses barriers by default, ext3 does not. > Or, to put it more clearly: the reason ext3 is fast is that it's unsafe. Jon: To verify you can enable it via the barrier=1 option during mounting.. Andres
UFS2 w/ soft updates on FreeBSD might be an interesting addition to the list of test cases
On Fri, Jun 4, 2010 at 9:33 AM, Andres Freund <andres@anarazel.de> wrote:
On Friday 04 June 2010 16:25:30 Tom Lane wrote:Jon: To verify you can enable it via the barrier=1 option during mounting..
> Andres Freund <andres@anarazel.de> writes:
> > On Friday 04 June 2010 14:17:35 Jon Schewe wrote:
> >> XFS (logbufs=8): ~4 hours to finish
> >> ext4: ~1 hour 50 minutes to finish
> >> ext3: 15 minutes to finish
> >> ext3 on LVM: 15 minutes to finish
> >
> > My guess is that some of the difference comes from barrier differences.
> > ext4 uses barriers by default, ext3 does not.
> Or, to put it more clearly: the reason ext3 is fast is that it's unsafe.
Andres
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
I'm running on Linux, so that's not really an option here.
On 6/4/10 1:20 PM, Bryan Hinton wrote:
On 6/4/10 1:20 PM, Bryan Hinton wrote:
UFS2 w/ soft updates on FreeBSD might be an interesting addition to the list of test casesOn Fri, Jun 4, 2010 at 9:33 AM, Andres Freund <andres@anarazel.de> wrote:On Friday 04 June 2010 16:25:30 Tom Lane wrote:Jon: To verify you can enable it via the barrier=1 option during mounting..
> Andres Freund <andres@anarazel.de> writes:
> > On Friday 04 June 2010 14:17:35 Jon Schewe wrote:
> >> XFS (logbufs=8): ~4 hours to finish
> >> ext4: ~1 hour 50 minutes to finish
> >> ext3: 15 minutes to finish
> >> ext3 on LVM: 15 minutes to finish
> >
> > My guess is that some of the difference comes from barrier differences.
> > ext4 uses barriers by default, ext3 does not.
> Or, to put it more clearly: the reason ext3 is fast is that it's unsafe.
Andres
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
-- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information.
On 6/4/10 9:33 AM, Andres Freund wrote: > On Friday 04 June 2010 16:25:30 Tom Lane wrote: > >> Andres Freund <andres@anarazel.de> writes: >> >>> On Friday 04 June 2010 14:17:35 Jon Schewe wrote: >>> >>>> XFS (logbufs=8): ~4 hours to finish >>>> ext4: ~1 hour 50 minutes to finish >>>> ext3: 15 minutes to finish >>>> ext3 on LVM: 15 minutes to finish >>>> >>> My guess is that some of the difference comes from barrier differences. >>> ext4 uses barriers by default, ext3 does not. >>> >> Or, to put it more clearly: the reason ext3 is fast is that it's unsafe. >> > Jon: To verify you can enable it via the barrier=1 option during mounting.. > > > First some details: Linux kernel 2.6.31 postgres version: 8.4.2 More test results: reiserfs: ~1 hour 50 minutes ext3 barrier=1: ~15 minutes ext4 nobarrier: ~15 minutes jfs: ~15 minutes -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information.
What types of journaling on each fs?
On Fri, Jun 4, 2010 at 1:26 PM, Jon Schewe <jpschewe@mtu.net> wrote:
On 6/4/10 9:33 AM, Andres Freund wrote:First some details:
> On Friday 04 June 2010 16:25:30 Tom Lane wrote:
>
>> Andres Freund <andres@anarazel.de> writes:
>>
>>> On Friday 04 June 2010 14:17:35 Jon Schewe wrote:
>>>
>>>> XFS (logbufs=8): ~4 hours to finish
>>>> ext4: ~1 hour 50 minutes to finish
>>>> ext3: 15 minutes to finish
>>>> ext3 on LVM: 15 minutes to finish
>>>>
>>> My guess is that some of the difference comes from barrier differences.
>>> ext4 uses barriers by default, ext3 does not.
>>>
>> Or, to put it more clearly: the reason ext3 is fast is that it's unsafe.
>>
> Jon: To verify you can enable it via the barrier=1 option during mounting..
>
>
>
Linux kernel 2.6.31
postgres version: 8.4.2
More test results:
reiserfs: ~1 hour 50 minutes
ext3 barrier=1: ~15 minutes
ext4 nobarrier: ~15 minutes
jfs: ~15 minutes
--
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.
--Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
I just used standard mkfs for each filesystem and mounted them without options, unless otherwise specified.
On 6/4/10 1:37 PM, Bryan Hinton wrote:
On 6/4/10 1:37 PM, Bryan Hinton wrote:
What types of journaling on each fs?On Fri, Jun 4, 2010 at 1:26 PM, Jon Schewe <jpschewe@mtu.net> wrote:On 6/4/10 9:33 AM, Andres Freund wrote:First some details:
> On Friday 04 June 2010 16:25:30 Tom Lane wrote:
>
>> Andres Freund <andres@anarazel.de> writes:
>>
>>> On Friday 04 June 2010 14:17:35 Jon Schewe wrote:
>>>
>>>> XFS (logbufs=8): ~4 hours to finish
>>>> ext4: ~1 hour 50 minutes to finish
>>>> ext3: 15 minutes to finish
>>>> ext3 on LVM: 15 minutes to finish
>>>>
>>> My guess is that some of the difference comes from barrier differences.
>>> ext4 uses barriers by default, ext3 does not.
>>>
>> Or, to put it more clearly: the reason ext3 is fast is that it's unsafe.
>>
> Jon: To verify you can enable it via the barrier=1 option during mounting..
>
>
>
Linux kernel 2.6.31
postgres version: 8.4.2
More test results:
reiserfs: ~1 hour 50 minutes
ext3 barrier=1: ~15 minutes
ext4 nobarrier: ~15 minutes
jfs: ~15 minutes
--
Jon Schewe | http://mtu.net/~jpschewe
If you see an attachment named signature.asc, this is my digital
signature. See http://www.gnupg.org for more information.
--Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
-- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information.
On Friday 04 June 2010 20:26:27 Jon Schewe wrote: > ext3 barrier=1: ~15 minutes > ext4 nobarrier: ~15 minutes Any message in the kernel log about barriers or similar? Andres
On 6/4/10 1:46 PM, Andres Freund wrote: > On Friday 04 June 2010 20:26:27 Jon Schewe wrote: > >> ext3 barrier=1: ~15 minutes >> ext4 nobarrier: ~15 minutes >> > Any message in the kernel log about barriers or similar? > > No. -- Jon Schewe | http://mtu.net/~jpschewe If you see an attachment named signature.asc, this is my digital signature. See http://www.gnupg.org for more information.
Jon Schewe wrote: > The tests were all done on an opensuse 11.2 64-bit machine, > on the same hard drive (just ran mkfs between each test) on the same > input with the same code base. So no controller card, just the motherboard and a single hard drive? If that's the case, what you've measured is which filesystems are safe because they default to flushing drive cache (the ones that take around 15 minutes) and which do not (the ones that take >=around 2 hours). You can't make ext3 flush the cache correctly no matter what you do with barriers, they just don't work on ext3 the way PostgreSQL needs them to. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 06/05/2010 05:36 PM, Greg Smith wrote: > Jon Schewe wrote: >> The tests were all done on an opensuse 11.2 64-bit machine, >> on the same hard drive (just ran mkfs between each test) on the same >> input with the same code base. > > So no controller card, just the motherboard and a single hard drive? Correct. > If that's the case, what you've measured is which filesystems are > safe because they default to flushing drive cache (the ones that take > around 15 minutes) and which do not (the ones that take >=around 2 > hours). You can't make ext3 flush the cache correctly no matter what > you do with barriers, they just don't work on ext3 the way PostgreSQL > needs them to. > So the 15 minute runs are doing it correctly and safely, but the slow ones are doing the wrong thing? That would imply that ext3 is the safe one. But your last statement suggests that ext3 is doing the wrong thing.
Jon Schewe wrote: >> If that's the case, what you've measured is which filesystems are >> safe because they default to flushing drive cache (the ones that take >> around 15 minutes) and which do not (the ones that take >=around 2 >> hours). You can't make ext3 flush the cache correctly no matter what >> you do with barriers, they just don't work on ext3 the way PostgreSQL >> needs them to. >> >> > So the 15 minute runs are doing it correctly and safely, but the slow > ones are doing the wrong thing? That would imply that ext3 is the safe > one. But your last statement suggests that ext3 is doing the wrong thing. > I goofed and reversed the two times when writing that. As is always the case with this sort of thing, the unsafe runs are the fast ones. ext3 does not ever do the right thing no matter how you configure it, you have to compensate for its limitations with correct hardware setup to make database writes reliable. -- Greg Smith 2ndQuadrant US Baltimore, MD PostgreSQL Training, Services and Support greg@2ndQuadrant.com www.2ndQuadrant.us
On 06/05/2010 05:52 PM, Greg Smith wrote: > Jon Schewe wrote: >>> If that's the case, what you've measured is which filesystems are >>> safe because they default to flushing drive cache (the ones that take >>> around 15 minutes) and which do not (the ones that take >=around 2 >>> hours). You can't make ext3 flush the cache correctly no matter what >>> you do with barriers, they just don't work on ext3 the way PostgreSQL >>> needs them to. >>> >>> >> So the 15 minute runs are doing it correctly and safely, but the slow >> ones are doing the wrong thing? That would imply that ext3 is the safe >> one. But your last statement suggests that ext3 is doing the wrong >> thing. >> > > I goofed and reversed the two times when writing that. As is always > the case with this sort of thing, the unsafe runs are the fast ones. > ext3 does not ever do the right thing no matter how you configure it, > you have to compensate for its limitations with correct hardware setup > to make database writes reliable. > OK, so if I want the 15 minute speed, I need to give up safety (OK in this case as this is just research testing), or see if I can tune postgres better.
On Sat, Jun 5, 2010 at 5:03 PM, Jon Schewe <jpschewe@mtu.net> wrote: > > > On 06/05/2010 05:52 PM, Greg Smith wrote: >> Jon Schewe wrote: >>>> If that's the case, what you've measured is which filesystems are >>>> safe because they default to flushing drive cache (the ones that take >>>> around 15 minutes) and which do not (the ones that take >=around 2 >>>> hours). You can't make ext3 flush the cache correctly no matter what >>>> you do with barriers, they just don't work on ext3 the way PostgreSQL >>>> needs them to. >>>> >>>> >>> So the 15 minute runs are doing it correctly and safely, but the slow >>> ones are doing the wrong thing? That would imply that ext3 is the safe >>> one. But your last statement suggests that ext3 is doing the wrong >>> thing. >>> >> >> I goofed and reversed the two times when writing that. As is always >> the case with this sort of thing, the unsafe runs are the fast ones. >> ext3 does not ever do the right thing no matter how you configure it, >> you have to compensate for its limitations with correct hardware setup >> to make database writes reliable. >> > OK, so if I want the 15 minute speed, I need to give up safety (OK in > this case as this is just research testing), or see if I can tune > postgres better. Or use a trustworthy hardware caching battery backed RAID controller, either in RAID mode or JBOD mode.
On 06/05/2010 06:54 PM, Scott Marlowe wrote: > On Sat, Jun 5, 2010 at 5:03 PM, Jon Schewe <jpschewe@mtu.net> wrote: > >> >> On 06/05/2010 05:52 PM, Greg Smith wrote: >> >>> Jon Schewe wrote: >>> >>>>> If that's the case, what you've measured is which filesystems are >>>>> safe because they default to flushing drive cache (the ones that take >>>>> around 15 minutes) and which do not (the ones that take >=around 2 >>>>> hours). You can't make ext3 flush the cache correctly no matter what >>>>> you do with barriers, they just don't work on ext3 the way PostgreSQL >>>>> needs them to. >>>>> >>>>> >>>>> >>>> So the 15 minute runs are doing it correctly and safely, but the slow >>>> ones are doing the wrong thing? That would imply that ext3 is the safe >>>> one. But your last statement suggests that ext3 is doing the wrong >>>> thing. >>>> >>>> >>> I goofed and reversed the two times when writing that. As is always >>> the case with this sort of thing, the unsafe runs are the fast ones. >>> ext3 does not ever do the right thing no matter how you configure it, >>> you have to compensate for its limitations with correct hardware setup >>> to make database writes reliable. >>> >>> >> OK, so if I want the 15 minute speed, I need to give up safety (OK in >> this case as this is just research testing), or see if I can tune >> postgres better. >> > Or use a trustworthy hardware caching battery backed RAID controller, > either in RAID mode or JBOD mode. > Right, because the real danger here is if the power goes out you can end up with a scrambled database, correct?
On Sat, Jun 5, 2010 at 5:58 PM, Jon Schewe <jpschewe@mtu.net> wrote: > On 06/05/2010 06:54 PM, Scott Marlowe wrote: >> On Sat, Jun 5, 2010 at 5:03 PM, Jon Schewe <jpschewe@mtu.net> wrote: >> >>> >>> On 06/05/2010 05:52 PM, Greg Smith wrote: >>> >>>> Jon Schewe wrote: >>>> >>>>>> If that's the case, what you've measured is which filesystems are >>>>>> safe because they default to flushing drive cache (the ones that take >>>>>> around 15 minutes) and which do not (the ones that take >=around 2 >>>>>> hours). You can't make ext3 flush the cache correctly no matter what >>>>>> you do with barriers, they just don't work on ext3 the way PostgreSQL >>>>>> needs them to. >>>>>> >>>>>> >>>>>> >>>>> So the 15 minute runs are doing it correctly and safely, but the slow >>>>> ones are doing the wrong thing? That would imply that ext3 is the safe >>>>> one. But your last statement suggests that ext3 is doing the wrong >>>>> thing. >>>>> >>>>> >>>> I goofed and reversed the two times when writing that. As is always >>>> the case with this sort of thing, the unsafe runs are the fast ones. >>>> ext3 does not ever do the right thing no matter how you configure it, >>>> you have to compensate for its limitations with correct hardware setup >>>> to make database writes reliable. >>>> >>>> >>> OK, so if I want the 15 minute speed, I need to give up safety (OK in >>> this case as this is just research testing), or see if I can tune >>> postgres better. >>> >> Or use a trustworthy hardware caching battery backed RAID controller, >> either in RAID mode or JBOD mode. >> > Right, because the real danger here is if the power goes out you can end > up with a scrambled database, correct? Correct. Assuming you can get power applied again before the battery in the RAID controller dies, it will then flush out its cache and your data will still be coherent.
On 06/05/2010 07:02 PM, Scott Marlowe wrote: > On Sat, Jun 5, 2010 at 5:58 PM, Jon Schewe <jpschewe@mtu.net> wrote: > >> On 06/05/2010 06:54 PM, Scott Marlowe wrote: >> >>> On Sat, Jun 5, 2010 at 5:03 PM, Jon Schewe <jpschewe@mtu.net> wrote: >>> >>> >>>> On 06/05/2010 05:52 PM, Greg Smith wrote: >>>> >>>> >>>>> Jon Schewe wrote: >>>>> >>>>> >>>>>>> If that's the case, what you've measured is which filesystems are >>>>>>> safe because they default to flushing drive cache (the ones that take >>>>>>> around 15 minutes) and which do not (the ones that take >=around 2 >>>>>>> hours). You can't make ext3 flush the cache correctly no matter what >>>>>>> you do with barriers, they just don't work on ext3 the way PostgreSQL >>>>>>> needs them to. >>>>>>> >>>>>>> >>>>>>> >>>>>>> >>>>>> So the 15 minute runs are doing it correctly and safely, but the slow >>>>>> ones are doing the wrong thing? That would imply that ext3 is the safe >>>>>> one. But your last statement suggests that ext3 is doing the wrong >>>>>> thing. >>>>>> >>>>>> >>>>>> >>>>> I goofed and reversed the two times when writing that. As is always >>>>> the case with this sort of thing, the unsafe runs are the fast ones. >>>>> ext3 does not ever do the right thing no matter how you configure it, >>>>> you have to compensate for its limitations with correct hardware setup >>>>> to make database writes reliable. >>>>> >>>>> >>>>> >>>> OK, so if I want the 15 minute speed, I need to give up safety (OK in >>>> this case as this is just research testing), or see if I can tune >>>> postgres better. >>>> >>>> >>> Or use a trustworthy hardware caching battery backed RAID controller, >>> either in RAID mode or JBOD mode. >>> >>> >> Right, because the real danger here is if the power goes out you can end >> up with a scrambled database, correct? >> > Correct. Assuming you can get power applied again before the battery > in the RAID controller dies, it will then flush out its cache and your > data will still be coherent. > Or if you really don't care if your database is scrambled after a power outage you can go without the battery backed RAID controller.
On Sat, Jun 5, 2010 at 6:07 PM, Jon Schewe <jpschewe@mtu.net> wrote: > On 06/05/2010 07:02 PM, Scott Marlowe wrote: >> On Sat, Jun 5, 2010 at 5:58 PM, Jon Schewe <jpschewe@mtu.net> wrote: >> >>> On 06/05/2010 06:54 PM, Scott Marlowe wrote: >>> >>>> On Sat, Jun 5, 2010 at 5:03 PM, Jon Schewe <jpschewe@mtu.net> wrote: >>>> >>>> >>>>> On 06/05/2010 05:52 PM, Greg Smith wrote: >>>>> >>>>> >>>>>> Jon Schewe wrote: >>>>>> >>>>>> >>>>>>>> If that's the case, what you've measured is which filesystems are >>>>>>>> safe because they default to flushing drive cache (the ones that take >>>>>>>> around 15 minutes) and which do not (the ones that take >=around 2 >>>>>>>> hours). You can't make ext3 flush the cache correctly no matter what >>>>>>>> you do with barriers, they just don't work on ext3 the way PostgreSQL >>>>>>>> needs them to. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> So the 15 minute runs are doing it correctly and safely, but the slow >>>>>>> ones are doing the wrong thing? That would imply that ext3 is the safe >>>>>>> one. But your last statement suggests that ext3 is doing the wrong >>>>>>> thing. >>>>>>> >>>>>>> >>>>>>> >>>>>> I goofed and reversed the two times when writing that. As is always >>>>>> the case with this sort of thing, the unsafe runs are the fast ones. >>>>>> ext3 does not ever do the right thing no matter how you configure it, >>>>>> you have to compensate for its limitations with correct hardware setup >>>>>> to make database writes reliable. >>>>>> >>>>>> >>>>>> >>>>> OK, so if I want the 15 minute speed, I need to give up safety (OK in >>>>> this case as this is just research testing), or see if I can tune >>>>> postgres better. >>>>> >>>>> >>>> Or use a trustworthy hardware caching battery backed RAID controller, >>>> either in RAID mode or JBOD mode. >>>> >>>> >>> Right, because the real danger here is if the power goes out you can end >>> up with a scrambled database, correct? >>> >> Correct. Assuming you can get power applied again before the battery >> in the RAID controller dies, it will then flush out its cache and your >> data will still be coherent. >> > Or if you really don't care if your database is scrambled after a power > outage you can go without the battery backed RAID controller. I do that all the time. On slony replication slaves. You can use a considerably less powerful machine, IO wise, with fsync disabled and a handful of cheap SATA drives.
Jon Schewe wrote: > OK, so if I want the 15 minute speed, I need to give up safety (OK in > this case as this is just research testing), or see if I can tune > postgres better. Depending on your app, one more possibility would be to see if you can re-factor the application so it can do multiple writes in parallel rather than waiting for each one to complete. If I understand right, then many transactions could potentially be handled by a single fsync.
On 06/06/10 14:51, Ron Mayer wrote: > Jon Schewe wrote: > >> OK, so if I want the 15 minute speed, I need to give up safety (OK in >> this case as this is just research testing), or see if I can tune >> postgres better. > > Depending on your app, one more possibility would be to see if you > can re-factor the application so it can do multiple writes in parallel > rather than waiting for each one to complete. If I understand right, > then many transactions could potentially be handled by a single fsync. By using a commit delay, yes. (see postgresql.conf). You do open up the risk of losing transactions committed within the commit delay period, but you don't risk corruption like you do with fsync. Sometimes you can also batch work into bigger transactions. The classic example here is the usual long stream of individual auto-committed INSERTs, which when wrapped in an explicit transaction can be vastly quicker. -- Craig Ringer