>> >If we extended relations by more than 8k at a time, we would know a lot
>> >more about disk layout, at least on filesystems with a decent amount of
>> >free space.
>>
>> I doubt it makes that much difference. If there was a significant amount
>> of fragmentation, we'd hear more complaints about seq scan performance.
>>
>> The issue here is that we don't know which relations are on which drives
>> and controllers, how they're striped, mirrored etc.
>
> Actually, isn't pre-allocation one of the tricks that Greenplum uses to
> get it's seqscan performance?
My tests here show that, at least on reiserfs, after a few hours of
benchmark torture (this represents several million write queries), table
files become significantly fragmented. I believe the table and index files
get extended more or less simultaneously and end up somehow a bit mixed up
on disk. Seq scan perf suffers. reiserfs doesn't have an excellent
fragmentation behaviour... NTFS is worse than hell in this respect. So,
pre-alloc could be a good idea. Brutal Defrag (cp /var/lib/postgresql to
somewhere and back) gets seq scan perf back to disk throughput.
Also, by the way, InnoDB uses a BTree organized table. The advantage is
that data is always clustered on the primary key (which means you have to
use something as your primary key that isn't necessary "natural", you have
to choose it to get good clustering, and you can't always do it right, so
it somehow, in the end, sucks rather badly). Anyway, seq-scan on InnoDB is
very slow because, as the btree grows (just like postgres indexes) pages
are split and scanning the pages in btree order becomes a mess of seeks.
So, seq scan in InnoDB is very very slow unless periodic OPTIMIZE TABLE is
applied. (caveat to the postgres TODO item "implement automatic table
clustering"...)