Thread: Block at a time ...
I agree with Tom, any reordering attempt is at best second guessing the filesystem and underlying storage.
However, having the ability to control the extent size would be a worthwhile improvement for systems that walk and chew gum (write to lots of tables) concurrently.
I'm thinking of Oracle's AUTOEXTEND settings for tablespace datafiles .... I think the ideal way to do it for PG would be to make the equivalent configurable in postgresql.conf system wide, and allow specific per-table settings in the SQL metadata, similar to auto-vacuum.
An awesomely simple alternative is to just specify the extension as e.g. 5% of the existing table size .... it starts by adding one block at a time for tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB file and pre-allocating it. Very little wasteage.
Cheers
Dave
However, having the ability to control the extent size would be a worthwhile improvement for systems that walk and chew gum (write to lots of tables) concurrently.
I'm thinking of Oracle's AUTOEXTEND settings for tablespace datafiles .... I think the ideal way to do it for PG would be to make the equivalent configurable in postgresql.conf system wide, and allow specific per-table settings in the SQL metadata, similar to auto-vacuum.
An awesomely simple alternative is to just specify the extension as e.g. 5% of the existing table size .... it starts by adding one block at a time for tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB file and pre-allocating it. Very little wasteage.
Cheers
Dave
On Tue, Mar 16, 2010 at 4:49 PM, Alvaro Herrera <alvherre@commandprompt.com> wrote:
Tom Lane escribió:Well, to block numbers as a first step.> Alvaro Herrera <alvherre@commandprompt.com> writes:
> > Maybe it would make more sense to try to reorder the fsync calls
> > instead.
>
> Reorder to what, though? You still have the problem that we don't know
> much about the physical layout on-disk.
However, this reminds me that sometimes we take the block-at-a-time
extension policy too seriously. We had a customer that had a
performance problem because they were inserting lots of data to TOAST
tables, causing very frequent extensions. I kept wondering whether an
allocation policy that allocated several new blocks at a time could be
useful (but I didn't try it). This would also alleviate fragmentation,
thus helping the physical layout be more similar to logical block
numbers.
PostgreSQL Replication, Consulting, Custom Development, 24x7 support
Dave Crooke escribió: > An awesomely simple alternative is to just specify the extension as e.g. 5% > of the existing table size .... it starts by adding one block at a time for > tiny tables, and once your table is over 20GB, it ends up adding a whole 1GB > file and pre-allocating it. Very little wasteage. I was thinking in something like that, except that the factor I'd use would be something like 50% or 100% of current size, capped at (say) 1 GB. -- Alvaro Herrera http://www.CommandPrompt.com/ PostgreSQL Replication, Consulting, Custom Development, 24x7 support
> I was thinking in something like that, except that the factor I'd use > would be something like 50% or 100% of current size, capped at (say) 1 > GB. Using fallocate() ?
On Wed, Mar 17, 2010 at 7:32 AM, Pierre C <lists@peufeu.com> wrote: >> I was thinking in something like that, except that the factor I'd use >> would be something like 50% or 100% of current size, capped at (say) 1 GB. This turns out to be a bad idea. One of the first thing Oracle DBAs are told to do is change this default setting to allocate some reasonably large fixed size rather than scaling upwards. This might be mostly due to Oracle's extent-based space management but I'm not so sure. Recall that the filesystem is probably doing some rounding itself. If you allocate 120kB it's probably allocating 128kB itself anyways. Having two layers rounding up will result in odd behaviour. In any case I was planning on doing this a while back. Then I ran some experiments and couldn't actually demonstrate any problem. ext2 seems to do a perfectly reasonable job of avoiding this problem. All the files were mostly large contiguous blocks after running some tests -- IIRC running pgbench. > Using fallocate() ? I think we need posix_fallocate(). -- greg
Greg Stark <gsstark@mit.edu> writes: > I think we need posix_fallocate(). The problem with posix_fallocate (other than questionable portability) is that it doesn't appear to guarantee anything at all about what is in the space it allocates. Worst case, we might find valid-looking Postgres data there (eg, because a block was recycled from some recently dropped table). If we have to write something anyway to zero the space, what's the point? regards, tom lane
Greg - with Oracle, I always do fixed 2GB dbf's for poartability, and preallocate the whole file in advance. However, the situation is a bit different in that Oracle will put blocks from multiple tables and indexes in a DBF if you don't tell it differently.
Tom - I'm not sure what Oracle does, but it literally writes the whole extent before using it .... I think they are just doing the literal equivalent of dd if=/dev/zero ... it takes several seconds to prep a 2GB file on decent storage.
Tom - I'm not sure what Oracle does, but it literally writes the whole extent before using it .... I think they are just doing the literal equivalent of dd if=/dev/zero ... it takes several seconds to prep a 2GB file on decent storage.
Cheers
Dave
On Wed, Mar 17, 2010 at 9:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dave
On Wed, Mar 17, 2010 at 9:27 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
The problem with posix_fallocate (other than questionable portability)
is that it doesn't appear to guarantee anything at all about what is in
the space it allocates. Worst case, we might find valid-looking
Postgres data there (eg, because a block was recycled from some recently
dropped table). If we have to write something anyway to zero the space,
what's the point?
regards, tom lane
On 3/17/10 2:52 AM, Greg Stark wrote: > On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com> wrote: >>> I was thinking in something like that, except that the factor I'd use >>> would be something like 50% or 100% of current size, capped at (say) 1 GB. > > This turns out to be a bad idea. One of the first thing Oracle DBAs > are told to do is change this default setting to allocate some > reasonably large fixed size rather than scaling upwards. > > This might be mostly due to Oracle's extent-based space management but > I'm not so sure. Recall that the filesystem is probably doing some > rounding itself. If you allocate 120kB it's probably allocating 128kB > itself anyways. Having two layers rounding up will result in odd > behaviour. > > In any case I was planning on doing this a while back. Then I ran some > experiments and couldn't actually demonstrate any problem. ext2 seems > to do a perfectly reasonable job of avoiding this problem. All the > files were mostly large contiguous blocks after running some tests -- > IIRC running pgbench. This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% of thedisk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almostalways do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocationstrategies. Craig
Greg is correct, as usual. Geometric growth of files is A Bad Thing in an Oracle DBA's world, since you can unexpectedly(automatically?) run out of file system space when the database determines it needs x% more extents than lasttime. The concept of contiguous extents, however, has some merit, particularly when restoring databases. Prior to parallel restore,a table's files were created and extended in roughly contiguous allocations, presuming there was no other activityon your database disks. (You do dedicate disks, don't you?) When using 8-way parallel restore against a six-diskRAID 10 group I found that table and index scan performance dropped by about 10x. I/O performance was restored byeither clustering the tables one at a time, or by dropping and restoring them one at a time. The only reason I can comeup with for this behavior is file fragmentation and increased seek times. If PostgreSQL had a mechanism to pre-allocate files prior to restoring the database that might mitigate the problem. Then if we could only get parallel index operations ... Bob Lunney --- On Wed, 3/17/10, Greg Stark <gsstark@mit.edu> wrote: > From: Greg Stark <gsstark@mit.edu> > Subject: Re: [PERFORM] Block at a time ... > To: "Pierre C" <lists@peufeu.com> > Cc: "Alvaro Herrera" <alvherre@commandprompt.com>, "Dave Crooke" <dcrooke@gmail.com>, pgsql-performance@postgresql.org > Date: Wednesday, March 17, 2010, 5:52 AM > On Wed, Mar 17, 2010 at 7:32 AM, > Pierre C <lists@peufeu.com> > wrote: > >> I was thinking in something like that, except that > the factor I'd use > >> would be something like 50% or 100% of current > size, capped at (say) 1 GB. > > This turns out to be a bad idea. One of the first thing > Oracle DBAs > are told to do is change this default setting to allocate > some > reasonably large fixed size rather than scaling upwards. > > This might be mostly due to Oracle's extent-based space > management but > I'm not so sure. Recall that the filesystem is probably > doing some > rounding itself. If you allocate 120kB it's probably > allocating 128kB > itself anyways. Having two layers rounding up will result > in odd > behaviour. > > In any case I was planning on doing this a while back. Then > I ran some > experiments and couldn't actually demonstrate any problem. > ext2 seems > to do a perfectly reasonable job of avoiding this problem. > All the > files were mostly large contiguous blocks after running > some tests -- > IIRC running pgbench. > > > > Using fallocate() ? > > I think we need posix_fallocate(). > > -- > greg > > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance >
On Mar 17, 2010, at 9:41 AM, Craig James wrote: > On 3/17/10 2:52 AM, Greg Stark wrote: >> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com> wrote: >>>> I was thinking in something like that, except that the factor I'd use >>>> would be something like 50% or 100% of current size, capped at (say) 1 GB. >> >> This turns out to be a bad idea. One of the first thing Oracle DBAs >> are told to do is change this default setting to allocate some >> reasonably large fixed size rather than scaling upwards. >> >> This might be mostly due to Oracle's extent-based space management but >> I'm not so sure. Recall that the filesystem is probably doing some >> rounding itself. If you allocate 120kB it's probably allocating 128kB >> itself anyways. Having two layers rounding up will result in odd >> behaviour. >> >> In any case I was planning on doing this a while back. Then I ran some >> experiments and couldn't actually demonstrate any problem. ext2 seems >> to do a perfectly reasonable job of avoiding this problem. All the >> files were mostly large contiguous blocks after running some tests -- >> IIRC running pgbench. > > This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% of thedisk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almostalways do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocationstrategies. > > Craig > Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmentedtables. > -- > Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-performance
>> This is one of the more-or-less solved problems in Unix/Linux. Ext* >> file systems have a "reserve" usually of 10% of the disk space that >> nobody except root can use. It's not for root, it's because with 10% >> of the disk free, you can almost always do a decent job of allocating >> contiguous blocks and get good performance. Unless Postgres has some >> weird problem that Linux has never seen before (and that wouldn't be >> unprecedented...), there's probably no need to fool with >> file-allocation strategies. >> >> Craig > > Its fairly easy to break. Just do a parallel import with say, 16 > concurrent tables being written to at once. Result? Fragmented tables. Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a medium-high rate (a few megabytes per second and up) when lots of data can sit in the cache and be flushed/allocated as big contiguous chunks. I'm pretty sure ext4/XFS would pass your parallel import test. However if you have files like tables (and indexes) or logs that grow slowly over time (something like a few megabytes per hour or less), after a few days/weeks/months, horrible fragmentation is an almost guaranteed result on many filesystems (NTFS being perhaps the absolute worst).
This is why pre-allocation is a good idea if you have the space ....
Tom, what about a really simple command in a forthcoming release of PG that would just preformat a 1GB file at a time? This is what I've always done scripted with Oracle (ALTER TABLESPACE foo ADD DATAFILE ....) rather than relying on its autoextender when performance has been a concern.
Cheers
Dave
Tom, what about a really simple command in a forthcoming release of PG that would just preformat a 1GB file at a time? This is what I've always done scripted with Oracle (ALTER TABLESPACE foo ADD DATAFILE ....) rather than relying on its autoextender when performance has been a concern.
Cheers
Dave
On Mon, Mar 22, 2010 at 3:55 PM, Pierre C <lists@peufeu.com> wrote:
Delayed allocation (ext4, XFS) helps a lot for concurrent writing at a medium-high rate (a few megabytes per second and up) when lots of data can sit in the cache and be flushed/allocated as big contiguous chunks. I'm pretty sure ext4/XFS would pass your parallel import test.This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% of the disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almost always do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problem that Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocation strategies.
Craig
Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables.
However if you have files like tables (and indexes) or logs that grow slowly over time (something like a few megabytes per hour or less), after a few days/weeks/months, horrible fragmentation is an almost guaranteed result on many filesystems (NTFS being perhaps the absolute worst).
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Mon, Mar 22, 2010 at 6:47 PM, Scott Carey <scott@richrelevance.com> wrote: > Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. > Fwiw I did do some investigation about this at one point and could not demonstrate any significant fragmentation. But that was on Linux -- different filesystem implementations would have different success rates. And there could be other factors as well such as how full the fileystem is or how old it is. -- greg
On 3/22/10 11:47 AM, Scott Carey wrote: > > On Mar 17, 2010, at 9:41 AM, Craig James wrote: > >> On 3/17/10 2:52 AM, Greg Stark wrote: >>> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com> wrote: >>>>> I was thinking in something like that, except that the factor I'd use >>>>> would be something like 50% or 100% of current size, capped at (say) 1 GB. >>> >>> This turns out to be a bad idea. One of the first thing Oracle DBAs >>> are told to do is change this default setting to allocate some >>> reasonably large fixed size rather than scaling upwards. >>> >>> This might be mostly due to Oracle's extent-based space management but >>> I'm not so sure. Recall that the filesystem is probably doing some >>> rounding itself. If you allocate 120kB it's probably allocating 128kB >>> itself anyways. Having two layers rounding up will result in odd >>> behaviour. >>> >>> In any case I was planning on doing this a while back. Then I ran some >>> experiments and couldn't actually demonstrate any problem. ext2 seems >>> to do a perfectly reasonable job of avoiding this problem. All the >>> files were mostly large contiguous blocks after running some tests -- >>> IIRC running pgbench. >> >> This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% of thedisk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almostalways do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocationstrategies. >> >> Craig >> > > Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. Is this from real-life experience? With fragmentation, there's a point of diminishing return. A couple head-seeks now andthen hardly matter. My recollection is that even when there are lots of concurrent processes running that are all makingfiles larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty good"contiguousness" (if that's a word). On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't tried it... Craig
On Mar 22, 2010, at 4:46 PM, Craig James wrote: > On 3/22/10 11:47 AM, Scott Carey wrote: >> >> On Mar 17, 2010, at 9:41 AM, Craig James wrote: >> >>> On 3/17/10 2:52 AM, Greg Stark wrote: >>>> On Wed, Mar 17, 2010 at 7:32 AM, Pierre C<lists@peufeu.com> wrote: >>>>>> I was thinking in something like that, except that the factor I'd use >>>>>> would be something like 50% or 100% of current size, capped at (say) 1 GB. >>>> >>>> This turns out to be a bad idea. One of the first thing Oracle DBAs >>>> are told to do is change this default setting to allocate some >>>> reasonably large fixed size rather than scaling upwards. >>>> >>>> This might be mostly due to Oracle's extent-based space management but >>>> I'm not so sure. Recall that the filesystem is probably doing some >>>> rounding itself. If you allocate 120kB it's probably allocating 128kB >>>> itself anyways. Having two layers rounding up will result in odd >>>> behaviour. >>>> >>>> In any case I was planning on doing this a while back. Then I ran some >>>> experiments and couldn't actually demonstrate any problem. ext2 seems >>>> to do a perfectly reasonable job of avoiding this problem. All the >>>> files were mostly large contiguous blocks after running some tests -- >>>> IIRC running pgbench. >>> >>> This is one of the more-or-less solved problems in Unix/Linux. Ext* file systems have a "reserve" usually of 10% ofthe disk space that nobody except root can use. It's not for root, it's because with 10% of the disk free, you can almostalways do a decent job of allocating contiguous blocks and get good performance. Unless Postgres has some weird problemthat Linux has never seen before (and that wouldn't be unprecedented...), there's probably no need to fool with file-allocationstrategies. >>> >>> Craig >>> >> >> Its fairly easy to break. Just do a parallel import with say, 16 concurrent tables being written to at once. Result? Fragmented tables. > > Is this from real-life experience? With fragmentation, there's a point of diminishing return. A couple head-seeks nowand then hardly matter. My recollection is that even when there are lots of concurrent processes running that are allmaking files larger and larger, the Linux file system still can do a pretty good job of allocating mostly-contiguous space. It doesn't just dumbly allocate from some list, but rather tries to allocate in a way that results in pretty good"contiguousness" (if that's a word). > > On the other hand, this is just from reading discussion groups like this one over the last few decades, I haven't triedit... > Well how fragmented is too fragmented depends on the use case and the hardware capability. In real world use, which forme means about 20 phases of large bulk inserts a day and not a lot of updates or index maintenance, the system gets somewhatfragmented but its not too bad. I did a dump/restore in 8.4 with parallel restore and it was much slower than usual. I did a single threaded restore and it was much faster. The dev environments are on ext3 and we see this pretty clearly-- but poor OS tuning can mask it (readahead parameter not set high enough). This is CentOS 5.4/5.3, perhaps laterkernels are better at scheduling file writes to avoid this. We also use the deadline scheduler which helps a lot onconcurrent reads, but might be messing up concurrent writes. On production with xfs this was also bad at first --- in fact worse because xfs's default 'allocsize' setting is 64k. Sofiles were regularly fragmented in small multiples of 64k. Changing the 'allocsize' parameter to 80MB made the restoreprocess produce files with fragment sizes of 80MB. 80MB is big for most systems, but this array does over 1000MB/secsequential read at peak, and only 200MB/sec with moderate fragmentation. It won't fail to allocate disk space due to any 'reservations' of the delayed allocation, it just means that it won't chooseto create a new file or extent within 80MB of another file that is open unless it has to. This can cause performanceproblems if you have lots of small files, which is why the default is 64k. > Craig