Thread: out-of-line (TOAST) storage ineffective when loading from dump?

out-of-line (TOAST) storage ineffective when loading from dump?

From

"Markus Bertheau"

Date:

19 February 2008, 03:17:25

Afaics, TOAST was invented so that big attributes wouldn't be in the
way (of readahead, the buffer cache and so on) when working with the
other attributes. This is based on the assumption that the other
attributes are accessed more often than the whole contents of the big
attributes.

Now I wonder how disk blocks are used when loading a dump with big
text data so that out-of-line storage is used. If disk block usage was
following this pattern:

heap page
toast heap page 1
toast heap page ..
toast heap page n
heap page
toast heap page 1
toast heap page ..
toast heap page n
heap page
toast heap page 1
toast heap page ..
toast heap page n
heap page

If further the assumption is correct, that the granularity of the
lower level chaches is bigger than the PostgreSQL page size, then that
would mean that loading from a dump destroys the advantage of
out-of-line storage.

I haven't got any numbers to back this theory up.

What do you think?

Markus Bertheau

Re: out-of-line (TOAST) storage ineffective when loading from dump?

From

Richard Huxton

Date:

19 February 2008, 04:58:57

Markus Bertheau wrote:
> Afaics, TOAST was invented so that big attributes wouldn't be in the
> way (of readahead, the buffer cache and so on) when working with the
> other attributes. This is based on the assumption that the other
> attributes are accessed more often than the whole contents of the big
> attributes.

Actually, I think it was designed as a way of working around PG's 8KB
block-size. That imposed a maximum row size of the same and, well meant
you couldn't have 2 x 5KB text fields for example.

> Now I wonder how disk blocks are used when loading a dump with big
> text data so that out-of-line storage is used. If disk block usage was
> following this pattern:
>
> heap page
> toast heap page 1
> toast heap page ..
> toast heap page n

The "toast" pages are stored in a separate table - see manual for
details. There's a whole chapter (53.2) on this.

> If further the assumption is correct, that the granularity of the
> lower level chaches is bigger than the PostgreSQL page size, then that
> would mean that loading from a dump destroys the advantage of
> out-of-line storage.
>
> I haven't got any numbers to back this theory up.

I'm not sure what you mean by this. The page-cache will cache individual
pages regardless of type afaik. A large data-load will probably mess up
your cache for other processes. On the other hand, assuming you'll be
building indexes etc. too then it's going to be in cache one way or another.

--
   Richard Huxton
   Archonet Ltd

Re: out-of-line (TOAST) storage ineffective when loading from dump?

From

"Markus Bertheau"

Date:

19 February 2008, 05:55:33

2008/2/19, Richard Huxton <dev@archonet.com>:
> Markus Bertheau wrote:
> > Afaics, TOAST was invented so that big attributes wouldn't be in the
> > way when working with the other attributes.
>
> Actually, I think it was designed as a way of working around PG's 8KB
> block-size. That imposed a maximum row size of the same and, well meant
> you couldn't have 2 x 5KB text fields for example.

Ah, ok. So what I'm talking about is a possible positive side effect of
out-of-line storage.

> The "toast" pages are stored in a separate table - see manual for
> details. There's a whole chapter (53.2) on this.

Yes, but I assume that on disk the pages will be laid out sequentially
- not intentionally so, of course. See below.

> I'm not sure what you mean by this. The page-cache will cache individual
> pages regardless of type afaik. A large data-load will probably mess up
> your cache for other processes. On the other hand, assuming you'll be
> building indexes etc. too then it's going to be in cache one way or another.

I'm loading a table with some short attributes and a large toastable attribute.
That means that for every main table heap page several toast table heap pages
are written. This happens through the buffer cache and the background writer,
so maybe the pages aren't written in the order in which they were created in
the buffer cache, but if they are, they end up on disk (assuming that the file
system is not fragmented) roughly like that:

main table heap page 1
toast table heap page 1
toast table heap page .
toast table heap page n
main table heap page 2
toast table heap page n+1
toast table heap page .
toast table heap page 2n

Significantly later a sequential scan of the table has to be made, the
toastable attribute is not needed for the operation. The caches are cold or
otherwise occupied. If the granularity of caches that are nearer to the disk in
the cache hierarchy than the PG buffer cache is higher than the PG page size
(i.e. a cache unit is bigger than the PG page size), then every read of a main
table heap page will inescapably read some toast table heap pages into the
cache (whichever cache that may be).  If all the main table heap pages were
laid out adjecently on disk, they could be read faster and caches be polluted
less.

Markus

Re: out-of-line (TOAST) storage ineffective when loading from dump?

From

Martijn van Oosterhout

Date:

19 February 2008, 06:48:27

On Tue, Feb 19, 2008 at 03:55:27PM +0600, Markus Bertheau wrote:
> > The "toast" pages are stored in a separate table - see manual for
> > details. There's a whole chapter (53.2) on this.
>
> Yes, but I assume that on disk the pages will be laid out sequentially
> - not intentionally so, of course. See below.

I should hope not. I don't think any modern filesystem works that way.
There's a reason why there's no defragmentation tool for many
filesystems, it's not an issue these days.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Those who make peaceful revolution impossible will make violent revolution inevitable.
>  -- John F Kennedy

Attachment

signature.asc

Re: out-of-line (TOAST) storage ineffective when loading from dump?

From

Richard Huxton

Date:

19 February 2008, 06:59:37

Markus Bertheau wrote:
> 2008/2/19, Richard Huxton <dev@archonet.com>:
> I'm loading a table with some short attributes and a large toastable attribute.
> That means that for every main table heap page several toast table heap pages
> are written. This happens through the buffer cache and the background writer,
> so maybe the pages aren't written in the order in which they were created in
> the buffer cache, but if they are, they end up on disk (assuming that the file
> system is not fragmented) roughly like that:
>
> main table heap page 1
> toast table heap page 1
> toast table heap page .
> toast table heap page n
> main table heap page 2
> toast table heap page n+1
> toast table heap page .
> toast table heap page 2n

Well, that's assuming:
1. You're not re-using space from previously deleted/updated rows.
2. You've not got a RAID array striping writes over multiple disks
3. The underlying filesystem + buffering isn't doing anything too clever.

> Significantly later a sequential scan of the table has to be made, the
> toastable attribute is not needed for the operation. The caches are cold or
> otherwise occupied.

OK

 > If the granularity of caches that are nearer to the disk in
> the cache hierarchy than the PG buffer cache is higher than the PG page size
> (i.e. a cache unit is bigger than the PG page size), then every read of a main
> table heap page will inescapably read some toast table heap pages into the
> cache (whichever cache that may be).

I think disk blocks on your more common file-systems are 4KB by default
(ext2/3 and ntfs for example). I'm not aware of any default disk-block
sizes more than the 8KB page-size of PG. Of course, the OS may read
ahead if it sees you scanning, but it will do that on a file basis.

 > If all the main table heap pages were
> laid out adjecently on disk, they could be read faster and caches be polluted
> less.

True enough. The key word there though is "if" - it means that PG is
trying to out-think the filesystem, OS and hardware.

It should be easy enough to test on a particular system though.
1. pg_restore a sample table with TOASTed data.
2. pg_restore the same data but no TOASTed data.
3. cluster the table with TOASTed data (which should force a rewrite of
the whole table but not its TOASTed data)

If the timing of various selects differ hugely then there's something
worth investigating.

--
   Richard Huxton
   Archonet Ltd

Re: out-of-line (TOAST) storage ineffective when loading from dump?

From

"Markus Bertheau"

Date:

19 February 2008, 12:49:21

2008/2/19, Richard Huxton <dev@archonet.com>:
> Markus Bertheau wrote:
> > 2008/2/19, Richard Huxton <dev@archonet.com>:
> > I'm loading a table with some short attributes and a large toastable attribute.
> > That means that for every main table heap page several toast table heap pages
> > are written. This happens through the buffer cache and the background writer,
> > so maybe the pages aren't written in the order in which they were created in
> > the buffer cache, but if they are, they end up on disk (assuming that the file
> > system is not fragmented) roughly like that:
> >
> > main table heap page 1
> > toast table heap page 1
> > toast table heap page .
> > toast table heap page n
> > main table heap page 2
> > toast table heap page n+1
> > toast table heap page .
> > toast table heap page 2n
>
> Well, that's assuming:
> 1. You're not re-using space from previously deleted/updated rows.
> 2. You've not got a RAID array striping writes over multiple disks
> 3. The underlying filesystem + buffering isn't doing anything too clever.

> I think disk blocks on your more common file-systems are 4KB by default
> (ext2/3 and ntfs for example). I'm not aware of any default disk-block
> sizes more than the 8KB page-size of PG. Of course, the OS may read
> ahead if it sees you scanning, but it will do that on a file basis.

Ok, turns out that I was largely unaware of how smart file systems are
nowadays. The whole story looks like a good example of how PostgreSQL relies on
the file system and its caches for performant operation.

Thanks

Markus Bertheau