Thread: out-of-line (TOAST) storage ineffective when loading from dump?
Afaics, TOAST was invented so that big attributes wouldn't be in the way (of readahead, the buffer cache and so on) when working with the other attributes. This is based on the assumption that the other attributes are accessed more often than the whole contents of the big attributes. Now I wonder how disk blocks are used when loading a dump with big text data so that out-of-line storage is used. If disk block usage was following this pattern: heap page toast heap page 1 toast heap page .. toast heap page n heap page toast heap page 1 toast heap page .. toast heap page n heap page toast heap page 1 toast heap page .. toast heap page n heap page If further the assumption is correct, that the granularity of the lower level chaches is bigger than the PostgreSQL page size, then that would mean that loading from a dump destroys the advantage of out-of-line storage. I haven't got any numbers to back this theory up. What do you think? Markus Bertheau
Markus Bertheau wrote: > Afaics, TOAST was invented so that big attributes wouldn't be in the > way (of readahead, the buffer cache and so on) when working with the > other attributes. This is based on the assumption that the other > attributes are accessed more often than the whole contents of the big > attributes. Actually, I think it was designed as a way of working around PG's 8KB block-size. That imposed a maximum row size of the same and, well meant you couldn't have 2 x 5KB text fields for example. > Now I wonder how disk blocks are used when loading a dump with big > text data so that out-of-line storage is used. If disk block usage was > following this pattern: > > heap page > toast heap page 1 > toast heap page .. > toast heap page n The "toast" pages are stored in a separate table - see manual for details. There's a whole chapter (53.2) on this. > If further the assumption is correct, that the granularity of the > lower level chaches is bigger than the PostgreSQL page size, then that > would mean that loading from a dump destroys the advantage of > out-of-line storage. > > I haven't got any numbers to back this theory up. I'm not sure what you mean by this. The page-cache will cache individual pages regardless of type afaik. A large data-load will probably mess up your cache for other processes. On the other hand, assuming you'll be building indexes etc. too then it's going to be in cache one way or another. -- Richard Huxton Archonet Ltd
2008/2/19, Richard Huxton <dev@archonet.com>: > Markus Bertheau wrote: > > Afaics, TOAST was invented so that big attributes wouldn't be in the > > way when working with the other attributes. > > Actually, I think it was designed as a way of working around PG's 8KB > block-size. That imposed a maximum row size of the same and, well meant > you couldn't have 2 x 5KB text fields for example. Ah, ok. So what I'm talking about is a possible positive side effect of out-of-line storage. > The "toast" pages are stored in a separate table - see manual for > details. There's a whole chapter (53.2) on this. Yes, but I assume that on disk the pages will be laid out sequentially - not intentionally so, of course. See below. > I'm not sure what you mean by this. The page-cache will cache individual > pages regardless of type afaik. A large data-load will probably mess up > your cache for other processes. On the other hand, assuming you'll be > building indexes etc. too then it's going to be in cache one way or another. I'm loading a table with some short attributes and a large toastable attribute. That means that for every main table heap page several toast table heap pages are written. This happens through the buffer cache and the background writer, so maybe the pages aren't written in the order in which they were created in the buffer cache, but if they are, they end up on disk (assuming that the file system is not fragmented) roughly like that: main table heap page 1 toast table heap page 1 toast table heap page . toast table heap page n main table heap page 2 toast table heap page n+1 toast table heap page . toast table heap page 2n Significantly later a sequential scan of the table has to be made, the toastable attribute is not needed for the operation. The caches are cold or otherwise occupied. If the granularity of caches that are nearer to the disk in the cache hierarchy than the PG buffer cache is higher than the PG page size (i.e. a cache unit is bigger than the PG page size), then every read of a main table heap page will inescapably read some toast table heap pages into the cache (whichever cache that may be). If all the main table heap pages were laid out adjecently on disk, they could be read faster and caches be polluted less. Markus
Re: out-of-line (TOAST) storage ineffective when loading from dump?
From
Martijn van Oosterhout
Date:
On Tue, Feb 19, 2008 at 03:55:27PM +0600, Markus Bertheau wrote: > > The "toast" pages are stored in a separate table - see manual for > > details. There's a whole chapter (53.2) on this. > > Yes, but I assume that on disk the pages will be laid out sequentially > - not intentionally so, of course. See below. I should hope not. I don't think any modern filesystem works that way. There's a reason why there's no defragmentation tool for many filesystems, it's not an issue these days. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Those who make peaceful revolution impossible will make violent revolution inevitable. > -- John F Kennedy
Attachment
Markus Bertheau wrote: > 2008/2/19, Richard Huxton <dev@archonet.com>: > I'm loading a table with some short attributes and a large toastable attribute. > That means that for every main table heap page several toast table heap pages > are written. This happens through the buffer cache and the background writer, > so maybe the pages aren't written in the order in which they were created in > the buffer cache, but if they are, they end up on disk (assuming that the file > system is not fragmented) roughly like that: > > main table heap page 1 > toast table heap page 1 > toast table heap page . > toast table heap page n > main table heap page 2 > toast table heap page n+1 > toast table heap page . > toast table heap page 2n Well, that's assuming: 1. You're not re-using space from previously deleted/updated rows. 2. You've not got a RAID array striping writes over multiple disks 3. The underlying filesystem + buffering isn't doing anything too clever. > Significantly later a sequential scan of the table has to be made, the > toastable attribute is not needed for the operation. The caches are cold or > otherwise occupied. OK > If the granularity of caches that are nearer to the disk in > the cache hierarchy than the PG buffer cache is higher than the PG page size > (i.e. a cache unit is bigger than the PG page size), then every read of a main > table heap page will inescapably read some toast table heap pages into the > cache (whichever cache that may be). I think disk blocks on your more common file-systems are 4KB by default (ext2/3 and ntfs for example). I'm not aware of any default disk-block sizes more than the 8KB page-size of PG. Of course, the OS may read ahead if it sees you scanning, but it will do that on a file basis. > If all the main table heap pages were > laid out adjecently on disk, they could be read faster and caches be polluted > less. True enough. The key word there though is "if" - it means that PG is trying to out-think the filesystem, OS and hardware. It should be easy enough to test on a particular system though. 1. pg_restore a sample table with TOASTed data. 2. pg_restore the same data but no TOASTed data. 3. cluster the table with TOASTed data (which should force a rewrite of the whole table but not its TOASTed data) If the timing of various selects differ hugely then there's something worth investigating. -- Richard Huxton Archonet Ltd
2008/2/19, Richard Huxton <dev@archonet.com>: > Markus Bertheau wrote: > > 2008/2/19, Richard Huxton <dev@archonet.com>: > > I'm loading a table with some short attributes and a large toastable attribute. > > That means that for every main table heap page several toast table heap pages > > are written. This happens through the buffer cache and the background writer, > > so maybe the pages aren't written in the order in which they were created in > > the buffer cache, but if they are, they end up on disk (assuming that the file > > system is not fragmented) roughly like that: > > > > main table heap page 1 > > toast table heap page 1 > > toast table heap page . > > toast table heap page n > > main table heap page 2 > > toast table heap page n+1 > > toast table heap page . > > toast table heap page 2n > > Well, that's assuming: > 1. You're not re-using space from previously deleted/updated rows. > 2. You've not got a RAID array striping writes over multiple disks > 3. The underlying filesystem + buffering isn't doing anything too clever. > I think disk blocks on your more common file-systems are 4KB by default > (ext2/3 and ntfs for example). I'm not aware of any default disk-block > sizes more than the 8KB page-size of PG. Of course, the OS may read > ahead if it sees you scanning, but it will do that on a file basis. Ok, turns out that I was largely unaware of how smart file systems are nowadays. The whole story looks like a good example of how PostgreSQL relies on the file system and its caches for performant operation. Thanks Markus Bertheau