Thread: Block at a time ...

Block at a time ...

From

Dave Crooke

Date:

16 March 2010, 21:28:46

I agree with Tom, any reordering attempt is at best second guessing the filesystem and underlying storage.

However, having the ability to control the extent size would be a worthwhile improvement for systems that walk and chew gum (write to lots of tables) concurrently.

I'm thinking of Oracle's AUTOEXTEND settings for tablespace datafiles .... I think the ideal way to do it for PG would be to make the equivalent configurable in postgresql.conf system wide, and allow specific per-table settings in the SQL metadata, similar to auto-vacuum.

An awesomely simple alternative is to just specify the extension as e.g. 5% of the existing table size .... it starts by adding one block at a time for tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB file and pre-allocating it. Very little wasteage.

Cheers
Dave

On Tue, Mar 16, 2010 at 4:49 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:

Tom Lane escribió:
> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Maybe it would make more sense to try to reorder the fsync calls
> > instead.
>
> Reorder to what, though? You still have the problem that we don't know
> much about the physical layout on-disk.

Well, to block numbers as a first step.

However, this reminds me that sometimes we take the block-at-a-time
extension policy too seriously. We had a customer that had a
performance problem because they were inserting lots of data to TOAST
tables, causing very frequent extensions. I kept wondering whether an
allocation policy that allocated several new blocks at a time could be
useful (but I didn't try it). This would also alleviate fragmentation,
thus helping the physical layout be more similar to logical block
numbers.

--
Alvaro Herrera http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Block at a time ...

From

Alvaro Herrera

Date:

16 March 2010, 21:43:57

Dave Crooke escribió:

> An awesomely simple alternative is to just specify the extension as e.g. 5%
> of the existing table size .... it starts by adding one block at a time for
> tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB
> file and pre-allocating it. Very little wasteage.

I was thinking in something like that, except that the factor I'd use
would be something like 50% or 100% of current size, capped at (say) 1 GB.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
PostgreSQL Replication, Consulting, Custom Development, 24x7 support

Re: Block at a time ...

From

"Pierre C"

Date:

17 March 2010, 04:36:29

> I was thinking in something like that, except that the factor I'd use
> would be something like 50% or 100% of current size, capped at (say) 1
> GB.

Using fallocate() ?

Re: Block at a time ...

From

Greg Stark

Date:

17 March 2010, 06:52:46

On Wed, Mar 17, 2010 at 7:32 AM, Pierre C <lists@peufeu.com> wrote:
>> I was thinking in something like that, except that the factor I'd use
>> would be something like 50% or 100% of current size, capped at (say) 1 GB.

This turns out to be a bad idea. One of the first thing Oracle DBAs
are told to do is change this default setting to allocate some
reasonably large fixed size rather than scaling upwards.

This might be mostly due to Oracle's extent-based space management but
I'm not so sure. Recall that the filesystem is probably doing some
rounding itself. If you allocate 120kB it's probably allocating 128kB
itself anyways. Having two layers rounding up will result in odd
behaviour.

In any case I was planning on doing this a while back. Then I ran some
experiments and couldn't actually demonstrate any problem. ext2 seems
to do a perfectly reasonable job of avoiding this problem. All the
files were mostly large contiguous blocks after running some tests --
IIRC running pgbench.

> Using fallocate() ?

I think we need posix_fallocate().

--
greg

Re: Block at a time ...

From

Tom Lane

Date:

17 March 2010, 11:27:21

Greg Stark <gsstark@mit.edu> writes:
> I think we need posix_fallocate().

The problem with posix_fallocate (other than questionable portability)
is that it doesn't appear to guarantee anything at all about what is in
the space it allocates.  Worst case, we might find valid-looking
Postgres data there (eg, because a block was recycled from some recently
dropped table).  If we have to write something anyway to zero the space,
what's the point?

            regards, tom lane

Re: Block at a time ...

From

Dave Crooke

Date:

17 March 2010, 12:16:25

Greg - with Oracle, I always do fixed 2GB dbf's for poartability, and preallocate the whole file in advance. However, the situation is a bit different in that Oracle will put blocks from multiple tables and indexes in a DBF if you don't tell it differently.

Tom - I'm not sure what Oracle does, but it literally writes the whole extent before using it .... I think they are just doing the literal equivalent of dd if=/dev/zero ... it takes several seconds to prep a 2GB file on decent storage.

Cheers
Dave

On Wed, Mar 17, 2010 at 9:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Greg Stark <gsstark@mit.edu> writes:
> I think we need posix_fallocate().

The problem with posix_fallocate (other than questionable portability)
is that it doesn't appear to guarantee anything at all about what is in
the space it allocates. Worst case, we might find valid-looking
Postgres data there (eg, because a block was recycled from some recently
dropped table). If we have to write something anyway to zero the space,
what's the point?

regards, tom lane

Re: Block at a time ...

From

Craig James

Date:

17 March 2010, 13:47:18

On 3/17/10 2:52 AM, Greg Stark wrote:
> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com>  wrote:
>>> I was thinking in something like that, except that the factor I'd use
>>> would be something like 50% or 100% of current size, capped at (say) 1 GB.
>
> This turns out to be a bad idea. One of the first thing Oracle DBAs
> are told to do is change this default setting to allocate some
> reasonably large fixed size rather than scaling upwards.
>
> This might be mostly due to Oracle's extent-based space management but
> I'm not so sure. Recall that the filesystem is probably doing some
> rounding itself. If you allocate 120kB it's probably allocating 128kB
> itself anyways. Having two layers rounding up will result in odd
> behaviour.
>
> In any case I was planning on doing this a while back. Then I ran some
> experiments and couldn't actually demonstrate any problem. ext2 seems
> to do a perfectly reasonable job of avoiding this problem. All the
> files were mostly large contiguous blocks after running some tests --
> IIRC running pgbench.

This is one of the more-or-less solved problems in Unix/Linux.  Ext* file systems have a "reserve" usually of 10% of
thedisk space that nobody except root can use.  It's not for root, it's because with 10% of the disk free, you can
almostalways do a decent job of allocating contiguous blocks and get good performance.  Unless Postgres has some weird
problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with
file-allocationstrategies. 

Craig

Re: Block at a time ...

From

Bob Lunney

Date:

17 March 2010, 14:02:16

Greg is correct, as usual.  Geometric growth of files is A Bad Thing in an  Oracle DBA's world, since you can
unexpectedly(automatically?) run out of file system space when the database determines it needs x% more extents than
lasttime. 

The concept of contiguous extents, however, has some merit, particularly when restoring databases.  Prior to parallel
restore,a table's files were created and extended in roughly contiguous allocations, presuming there was no other
activityon your database disks.  (You do dedicate disks, don't you?)  When using 8-way parallel restore against a
six-diskRAID 10 group I found that table and index scan performance dropped by about 10x.  I/O performance was restored
byeither clustering the tables one at a time, or by dropping and restoring them one at a time.  The only reason I can
comeup with for this behavior is file fragmentation and increased seek times. 

If PostgreSQL had a mechanism to pre-allocate files prior to restoring the database that might mitigate the problem.

Then if we could only get parallel index operations ...

Bob Lunney

--- On Wed, 3/17/10, Greg Stark <gsstark@mit.edu> wrote:

> From: Greg Stark <gsstark@mit.edu>
> Subject: Re: [PERFORM] Block at a time ...
> To: "Pierre C" <lists@peufeu.com>
> Cc: "Alvaro Herrera" <alvherre@commandprompt.com>, "Dave Crooke" <dcrooke@gmail.com>,
pgsql-performance@postgresql.org
> Date: Wednesday, March 17, 2010, 5:52 AM
> On Wed, Mar 17, 2010 at 7:32 AM,
> Pierre C <lists@peufeu.com>
> wrote:
> >> I was thinking in something like that, except that
> the factor I'd use
> >> would be something like 50% or 100% of current
> size, capped at (say) 1 GB.
>
> This turns out to be a bad idea. One of the first thing
> Oracle DBAs
> are told to do is change this default setting to allocate
> some
> reasonably large fixed size rather than scaling upwards.
>
> This might be mostly due to Oracle's extent-based space
> management but
> I'm not so sure. Recall that the filesystem is probably
> doing some
> rounding itself. If you allocate 120kB it's probably
> allocating 128kB
> itself anyways. Having two layers rounding up will result
> in odd
> behaviour.
>
> In any case I was planning on doing this a while back. Then
> I ran some
> experiments and couldn't actually demonstrate any problem.
> ext2 seems
> to do a perfectly reasonable job of avoiding this problem.
> All the
> files were mostly large contiguous blocks after running
> some tests --
> IIRC running pgbench.
>
>
> > Using fallocate() ?
>
> I think we need posix_fallocate().
>
> --
> greg
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Block at a time ...

From

Scott Carey

Date:

22 March 2010, 15:49:03

On Mar 17, 2010, at 9:41 AM, Craig James wrote:

> On 3/17/10 2:52 AM, Greg Stark wrote:
>> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com>  wrote:
>>>> I was thinking in something like that, except that the factor I'd use
>>>> would be something like 50% or 100% of current size, capped at (say) 1 GB.
>>
>> This turns out to be a bad idea. One of the first thing Oracle DBAs
>> are told to do is change this default setting to allocate some
>> reasonably large fixed size rather than scaling upwards.
>>
>> This might be mostly due to Oracle's extent-based space management but
>> I'm not so sure. Recall that the filesystem is probably doing some
>> rounding itself. If you allocate 120kB it's probably allocating 128kB
>> itself anyways. Having two layers rounding up will result in odd
>> behaviour.
>>
>> In any case I was planning on doing this a while back. Then I ran some
>> experiments and couldn't actually demonstrate any problem. ext2 seems
>> to do a perfectly reasonable job of avoiding this problem. All the
>> files were mostly large contiguous blocks after running some tests --
>> IIRC running pgbench.
>
> This is one of the more-or-less solved problems in Unix/Linux.  Ext* file systems have a "reserve" usually of 10% of
thedisk space that nobody except root can use.  It's not for root, it's because with 10% of the disk free, you can
almostalways do a decent job of allocating contiguous blocks and get good performance.  Unless Postgres has some weird
problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with
file-allocationstrategies. 
>
> Craig
>

Its fairly easy to break.  Just do a parallel import with say, 16 concurrent tables being written to at once.  Result?
Fragmentedtables. 

> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Block at a time ...

From

"Pierre C"

Date:

22 March 2010, 18:00:35

>> This is one of the more-or-less solved problems in Unix/Linux.  Ext*
>> file systems have a "reserve" usually of 10% of the disk space that
>> nobody except root can use.  It's not for root, it's because with 10%
>> of the disk free, you can almost always do a decent job of allocating
>> contiguous blocks and get good performance.  Unless Postgres has some
>> weird problem that Linux has never seen before (and that wouldn't be
>> unprecedented...), there's probably no need to fool with
>> file-allocation strategies.
>>
>> Craig
>
> Its fairly easy to break.  Just do a parallel import with say, 16
> concurrent tables being written to at once.  Result?  Fragmented tables.

Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a
medium-high rate (a few megabytes per second and up) when lots of data can
sit in the cache and be flushed/allocated as big contiguous chunks. I'm
pretty sure ext4/XFS would pass your parallel import test.

However if you have files like tables (and indexes) or logs that grow
slowly over time (something like a few megabytes per hour or less), after
a few days/weeks/months, horrible fragmentation is an almost guaranteed
result on many filesystems (NTFS being perhaps the absolute worst).

Re: Block at a time ...

From

Dave Crooke

Date:

22 March 2010, 18:06:13

This is why pre-allocation is a good idea if you have the space ....

Tom, what about a really simple command in a forthcoming release of PG that would just preformat a 1GB file at a time? This is what I've always done scripted with Oracle (ALTER TABLESPACE foo ADD DATAFILE ....) rather than relying on its autoextender when performance has been a concern.

Cheers
Dave

On Mon, Mar 22, 2010 at 3:55 PM, Pierre C <lists@peufeu.com> wrote:

This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies.

Craig

Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables.

Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a medium-high rate (a few megabytes per second and up) when lots of data can sit in the cache and be flushed/allocated as big contiguous chunks. I'm pretty sure ext4/XFS would pass your parallel import test.

However if you have files like tables (and indexes) or logs that grow slowly over time (something like a few megabytes per hour or less), after a few days/weeks/months, horrible fragmentation is an almost guaranteed result on many filesystems (NTFS being perhaps the absolute worst).

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Block at a time ...

From

Greg Stark

Date:

22 March 2010, 18:25:39

On Mon, Mar 22, 2010 at 6:47 PM, Scott Carey <scott@richrelevance.com> wrote:
> Its fairly easy to break.  Just do a parallel import with say, 16 concurrent tables being written to at once.
 Result? Fragmented tables. 
>

Fwiw I did do some investigation about this at one point and could not
demonstrate any significant fragmentation. But that was on Linux --
different filesystem implementations would have different success
rates. And there could be other factors as well such as how full the
fileystem is or how old it is.

--
greg

Re: Block at a time ...

From

Craig James

Date:

22 March 2010, 20:46:32

On 3/22/10 11:47 AM, Scott Carey wrote:
>
> On Mar 17, 2010, at 9:41 AM, Craig James wrote:
>
>> On 3/17/10 2:52 AM, Greg Stark wrote:
>>> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com>   wrote:
>>>>> I was thinking in something like that, except that the factor I'd use
>>>>> would be something like 50% or 100% of current size, capped at (say) 1 GB.
>>>
>>> This turns out to be a bad idea. One of the first thing Oracle DBAs
>>> are told to do is change this default setting to allocate some
>>> reasonably large fixed size rather than scaling upwards.
>>>
>>> This might be mostly due to Oracle's extent-based space management but
>>> I'm not so sure. Recall that the filesystem is probably doing some
>>> rounding itself. If you allocate 120kB it's probably allocating 128kB
>>> itself anyways. Having two layers rounding up will result in odd
>>> behaviour.
>>>
>>> In any case I was planning on doing this a while back. Then I ran some
>>> experiments and couldn't actually demonstrate any problem. ext2 seems
>>> to do a perfectly reasonable job of avoiding this problem. All the
>>> files were mostly large contiguous blocks after running some tests --
>>> IIRC running pgbench.
>>
>> This is one of the more-or-less solved problems in Unix/Linux.  Ext* file systems have a "reserve" usually of 10% of
thedisk space that nobody except root can use.  It's not for root, it's because with 10% of the disk free, you can
almostalways do a decent job of allocating contiguous blocks and get good performance.  Unless Postgres has some weird
problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with
file-allocationstrategies. 
>>
>> Craig
>>
>
> Its fairly easy to break.  Just do a parallel import with say, 16 concurrent tables being written to at once.
Result? Fragmented tables. 

Is this from real-life experience?  With fragmentation, there's a point of diminishing return.  A couple head-seeks now
andthen hardly matter.  My recollection is that even when there are lots of concurrent processes running that are all
makingfiles larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous
space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty
good"contiguousness" (if that's a word). 

On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't tried
it...

Craig

Re: Block at a time ...

From

Scott Carey

Date:

26 March 2010, 21:28:18

On Mar 22, 2010, at 4:46 PM, Craig James wrote:

> On 3/22/10 11:47 AM, Scott Carey wrote:
>>
>> On Mar 17, 2010, at 9:41 AM, Craig James wrote:
>>
>>> On 3/17/10 2:52 AM, Greg Stark wrote:
>>>> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com>   wrote:
>>>>>> I was thinking in something like that, except that the factor I'd use
>>>>>> would be something like 50% or 100% of current size, capped at (say) 1 GB.
>>>>
>>>> This turns out to be a bad idea. One of the first thing Oracle DBAs
>>>> are told to do is change this default setting to allocate some
>>>> reasonably large fixed size rather than scaling upwards.
>>>>
>>>> This might be mostly due to Oracle's extent-based space management but
>>>> I'm not so sure. Recall that the filesystem is probably doing some
>>>> rounding itself. If you allocate 120kB it's probably allocating 128kB
>>>> itself anyways. Having two layers rounding up will result in odd
>>>> behaviour.
>>>>
>>>> In any case I was planning on doing this a while back. Then I ran some
>>>> experiments and couldn't actually demonstrate any problem. ext2 seems
>>>> to do a perfectly reasonable job of avoiding this problem. All the
>>>> files were mostly large contiguous blocks after running some tests --
>>>> IIRC running pgbench.
>>>
>>> This is one of the more-or-less solved problems in Unix/Linux.  Ext* file systems have a "reserve" usually of 10%
ofthe disk space that nobody except root can use.  It's not for root, it's because with 10% of the disk free, you can
almostalways do a decent job of allocating contiguous blocks and get good performance.  Unless Postgres has some weird
problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with
file-allocationstrategies. 
>>>
>>> Craig
>>>
>>
>> Its fairly easy to break.  Just do a parallel import with say, 16 concurrent tables being written to at once.
Result? Fragmented tables. 
>
> Is this from real-life experience?  With fragmentation, there's a point of diminishing return.  A couple head-seeks
nowand then hardly matter.  My recollection is that even when there are lots of concurrent processes running that are
allmaking files larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous
space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty
good"contiguousness" (if that's a word). 
>
> On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't
triedit... 
>

Well how fragmented is too fragmented depends on the use case and the hardware capability.  In real world use, which
forme means about 20 phases of large bulk inserts a day and not a lot of updates or index maintenance, the system gets
somewhatfragmented but its not too bad.  I did a dump/restore in 8.4 with parallel restore and it was much slower than
usual. I did a single threaded restore and it was much faster.  The dev environments are on ext3 and we see this pretty
clearly-- but poor OS tuning can mask it (readahead parameter not set high enough).   This is CentOS 5.4/5.3, perhaps
laterkernels are better at scheduling file writes to avoid this.  We also use the deadline scheduler which helps a lot
onconcurrent reads, but might be messing up concurrent writes. 
On production with xfs this was also bad at first --- in fact worse because xfs's default 'allocsize' setting is 64k.
Sofiles were regularly fragmented in small multiples of 64k.   Changing the 'allocsize' parameter to 80MB made the
restoreprocess produce files with fragment sizes of 80MB.  80MB is big for most systems, but this array does over
1000MB/secsequential read at peak, and only 200MB/sec with moderate fragmentation. 
It won't fail to allocate disk space due to any 'reservations' of the delayed allocation, it just means that it won't
chooseto create a new file or extent within 80MB of another file that is open unless it has to.  This can cause
performanceproblems if you have lots of small files, which is why the default is 64k. 

> Craig