Greg Stark <gsstark@MIT.EDU> writes:
> Well my theory was sort of half right. It has nothing to do with fooling Linux
> into thinking it's a sequential read. Apparently this filesystem was created
> with 32k blocks. I don't remember if that was intentional or if ext2/3 did it
> automatically based on the size of the filesystem.
>
> So it doesn't have wide-ranging implications for Postgres's default 8k block
> size. But it is a good lesson about the importance of not using a larger
> filesystem block than Postgres's block size. The net effect is that if the
> filesystem block is N*8k then your random_page_cost goes up by a factor of N.
> That could be devastating for OLTP performance.
Hm, apparently I spoke too soon. tune2fs says the block size is in fact 4k.
Yet the performance of the block reading test program with 4k or 8k blocks
behaves as if Linux is reading 32k blocks. And in fact when I run it with 32k
blocks I get kind of behaviour we were expecting where the breakeven point is
around 20%.
So it's not the 8k block reading that's fooling Linux into reading ahead 32k.
It seems 32k readahead is the default for Linux, or perhaps it's the
sequential access pattern that's triggering it.
I'm trying to test it with posix_fadvise() set to random access but I'm having
trouble compiling. Do I need a special #define to get posix_fadvise from
glibc?
--
greg