Thread: Reducing the size of BufferTag & remodeling forks

Reducing the size of BufferTag & remodeling forks

From

Andres Freund

Date:

02 July 2015, 13:36:30

Hi,

I've complained a number of times that our BufferTag is ridiculously
large:
typedef struct buftag
{ RelFileNode rnode; /* physical relation identifier */ ForkNumber forkNum; BlockNumber blockNum;
/*blknum relative to begin of reln */

} BufferTag;

typedef struct RelFileNode
{ Oid spcNode; /* tablespace */ Oid dbNode; /* database */ Oid relNode;
/* relation */

} RelFileNode;

that amounts to 20 bytes. That's problematic because we frequently have
to compare or hash the entire buffer tag. Comparing 20bytes is rather
branch intensive, and shows up noticably on profiles. It's also a
stumbling block on the way to a smarter buffer mapping data structure,
because it makes e.g. trees rather deep.

The buffer tag is currently used in two situations:

1) Dealing with the buffer mapping, we need to identify the underlying file uniquely and we need the block number (8
bytes).

2) When writing out the a block we need, in addition to 1), have information about where to store the file. That
requiresthe tablespace and database.

You may know that a filenode (RelFileNode->relNode) is currently *not*
unique across databases and tablespaces.

Additionally you might have noticed that the above description also
disregards relation forks.

I think we should work towards 1) being sufficient for its purpose. My
suggestion to get there is twofold:

1) Introduce a shared pg_relfilenode table. Every table, even shared/nailed ones, get an entry therein. It's there to
makeit possibly to uniquely allocate relfilenodes across databases & tablespaces.

2) Replace relation forks, with the exception of the init fork which is special anyway, with separate relfilenodes.
Storedin seperate columns in pg_class.

This scheme has a number of advantages: We don't need to look at the
filesystem anymore to find out whether a relfilenode exists. The buffer
tags are 8 bytes. The number of stats doesn't scale O(#forks *
#relations) anymore, allowing us to add additional forks more easily.

I think something akin to init forks is going to survive because they've
to be copied without access to the catalogs - but that's fine, they just
aren't allowed to go through shared buffers. Afaics that's not a
problem.

Obviously this is a rather high-level description, but right now this
sounds doable to me.

Thoughts?

- Andres

Re: Reducing the size of BufferTag & remodeling forks

From

Tom Lane

Date:

02 July 2015, 13:52:06

Andres Freund <andres@anarazel.de> writes:
> 1) Introduce a shared pg_relfilenode table. Every table, even
>    shared/nailed ones, get an entry therein. It's there to make it
>    possibly to uniquely allocate relfilenodes across databases &
>    tablespaces.
> 2) Replace relation forks, with the exception of the init fork which is
>    special anyway, with separate relfilenodes. Stored in seperate
>    columns in pg_class.

> Thoughts?

I'm concerned about the traffic and contention involved with #1.
I'm also concerned about the assumption that relfilenode should,
or even can be, unique across an entire installation.  (I suppose
widening it to 8 bytes would fix some of the hazards there, but
that bloats your buffer tag again.)

But here's the big problem: you're talking about a huge amount of
work for what seems likely to be a microscopic improvement in some
operations.  Worse, we'll be taking penalties for other operations.
How will you do DropDatabaseBuffers() for instance?

CREATE DATABASE is going to be a problem, too.
        regards, tom lane

Re: Reducing the size of BufferTag & remodeling forks

From

Andres Freund

Date:

02 July 2015, 14:07:55

On 2015-07-02 09:51:59 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > 1) Introduce a shared pg_relfilenode table. Every table, even
> >    shared/nailed ones, get an entry therein. It's there to make it
> >    possibly to uniquely allocate relfilenodes across databases &
> >    tablespaces.
> > 2) Replace relation forks, with the exception of the init fork which is
> >    special anyway, with separate relfilenodes. Stored in seperate
> >    columns in pg_class.
> 
> > Thoughts?
> 
> I'm concerned about the traffic and contention involved with #1.

I don't think that'll be that significant in comparison to all the other
work done when creating a relation. Unless we do something wrong it'll
be highly unlikely to get row level contention, as the oids of the
individual relations will be from the oid counter or something similar.

> I'm also concerned about the assumption that relfilenode should,
> or even can be, unique across an entire installation.  (I suppose
> widening it to 8 bytes would fix some of the hazards there, but
> that bloats your buffer tag again.)

Why? Because it limits the number of relations & forks we can have to
2**32? That seems like an extraordinary large limit? The catalog sizes
(pg_attribute most prominently) are a problem at a much lower number of
relations than that. Also rel/catcache management.

> But here's the big problem: you're talking about a huge amount of
> work for what seems likely to be a microscopic improvement in some
> operations.

I don't think it's microscopic at all. Just hacking away database &
tablespace from hashing & comparisons in the buffer tag (obviously not a
correct thing, but works enough for pgbench) results in quite measurable
performance benefits. But the main point isn't the performance
improvements themselves, but that it opens the door to smarter buffer
mapping algorithms, which e.g. will allow ordered access.  Also not
having the current problem with increasing the number of forks would be
good.

> Worse, we'll be taking penalties for other operations.
> How will you do DropDatabaseBuffers() for instance?

> CREATE DATABASE is going to be a problem, too.

More promently than that, without access to the database/tablespace we
couldn't even write out dirty buffers in a reasonable manner.  That's
why I think we're going to have to continue storing those two in the
buffer descriptors, just not include them in the buffer mapping.

Greetings,

Andres Freund

Re: Reducing the size of BufferTag & remodeling forks

From

Alvaro Herrera

Date:

03 July 2015, 16:59:26

Andres Freund wrote:

> 2) Replace relation forks, with the exception of the init fork which is
>    special anyway, with separate relfilenodes. Stored in seperate
>    columns in pg_class.

Different AMs have different fork needs; for heaps you want one main
fork, one VM, one fsm.  But for indexes, the VM fork is not necessary,
and some AMs might want different ones.  For instance, GIN would benefit
from having separate forks to store the internal indexes, and BRIN would
benefit from a separate fork for the revmap.

What I'm saying is that I'm not sure it's okay to store the forks in
pg_class columns; instead perhaps we should have a separate catalog on
which each relation can have many forks, or perhaps have the pg_class
entry store an array (ick).  Or perhaps rather than "relvmfork" (the
pg_class column for the relfilenode for the VM fork) we could store
relfilenode1, relfilenode2 where the value for each N fork is
AM-specific. (so for heaps 1 is main, 2 is FSM, 3 is VM; for BRIN 1 is
main, 2 is revmap; and so forth).

FWIW the whole idea seems reasonable to me.  I worry about concurrent
traffic into the pg_relfilenode shared table -- if temp table creation
is common across many databases, is it going to become a contention
point?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing the size of BufferTag & remodeling forks

From

Andres Freund

Date:

03 July 2015, 17:16:15

On 2015-07-03 13:59:07 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > 2) Replace relation forks, with the exception of the init fork which is
> >    special anyway, with separate relfilenodes. Stored in seperate
> >    columns in pg_class.
> 
> Different AMs have different fork needs; for heaps you want one main
> fork, one VM, one fsm.  But for indexes, the VM fork is not necessary,
> and some AMs might want different ones.  For instance, GIN would benefit
> from having separate forks to store the internal indexes, and BRIN would
> benefit from a separate fork for the revmap.
> 
> What I'm saying is that I'm not sure it's okay to store the forks in
> pg_class columns

Right. Part of the point of this design is that you could easily add
further forks without system wide knowledge and that it is *not*
required anymore that all relfilenodes are in one column. I think it'd
probably make sense to have at least _vm and _fsm in pg_class, but we
could easily add further ones elsewhere.

> instead perhaps we should have a separate catalog on
> which each relation can have many forks, or perhaps have the pg_class
> entry store an array (ick).  Or perhaps rather than "relvmfork" (the
> pg_class column for the relfilenode for the VM fork) we could store
> relfilenode1, relfilenode2 where the value for each N fork is
> AM-specific. (so for heaps 1 is main, 2 is FSM, 3 is VM; for BRIN 1 is
> main, 2 is revmap; and so forth).

None of these sound particularly pretty to me. An array of relfilenodes
would probably be the least ugly one.

> FWIW the whole idea seems reasonable to me.  I worry about concurrent
> traffic into the pg_relfilenode shared table -- if temp table creation
> is common across many databases, is it going to become a contention
> point?

I don't think it'll be. It's essentially just inserts into a tiny table
with a single index, right? We can do a bootload of inserts into one of
these.

In an *assert enabled* build:
CREATE TABLE pg_relfilenode() WITH OIDS;
ALTER TABLE pg_relfilenode ADD CONSTRAINT pg_relfilenode_pkey PRIMARY KEY(oid);

which is pretty much how pg_relfilenode would look like. Although we'd
not go through the whole executor for the inserts.

pgbench -h localhost -p 5440 postgres -n -f /tmp/pg_relfilenode.sql -j 16 -c 16 -T 20 -P 1
cat /tmp/pg_relfilenode.sql:
INSERT INTO pg_relfilenode DEFAULT VALUES;
progress: 5.0 s, 32168.4 tps, lat 0.495 ms stddev 1.728
progress: 6.0 s, 33719.6 tps, lat 0.473 ms stddev 0.773

andres@awork2:~/src/postgresql$ pgbench -h localhost -p 5440 postgres -n -f /tmp/temptable.sql -j 16 -c 16 -T 20 -P 1
CREATE TEMPORARY TABLE blarg() ON COMMIT DROP;
progress: 6.0 s, 5018.2 tps, lat 3.185 ms stddev 3.671
progress: 7.0 s, 4890.9 tps, lat 3.272 ms stddev 4.346

and that's with zero actual columns. If you instead add some:
CREATE TEMPORARY TABLE blarg(id serial primary key, data text, blarg int) ON COMMIT DROP;
progress: 7.0 s, 974.1 tps, lat 16.462 ms stddev 9.058
progress: 8.0 s, 999.9 tps, lat 16.045 ms stddev 7.011

So no, I don't think this'll be a relevant problem. We do so many
inserts for a single temp table that a single insert into another one
completely vanishes in comparison.

Re: Reducing the size of BufferTag & remodeling forks

From

Simon Riggs

Date:

12 September 2015, 12:12:35

On 2 July 2015 at 14:36, Andres Freund <andres@anarazel.de> wrote:

Hi,

I've complained a number of times that our BufferTag is ridiculously
large:
typedef struct buftag
{
RelFileNode rnode; /* physical relation identifier */
ForkNumber forkNum;
BlockNumber blockNum; /* blknum relative to begin of reln */
} BufferTag;

typedef struct RelFileNode
{
Oid spcNode; /* tablespace */
Oid dbNode; /* database */
Oid relNode; /* relation */
} RelFileNode;

that amounts to 20 bytes. That's problematic because we frequently have
to compare or hash the entire buffer tag. Comparing 20bytes is rather
branch intensive, and shows up noticably on profiles. It's also a
stumbling block on the way to a smarter buffer mapping data structure,
because it makes e.g. trees rather deep.

The buffer tag is currently used in two situations:

1) Dealing with the buffer mapping, we need to identify the underlying
file uniquely and we need the block number (8 bytes).

2) When writing out the a block we need, in addition to 1), have
information about where to store the file. That requires the
tablespace and database.

You may know that a filenode (RelFileNode->relNode) is currently *not*
unique across databases and tablespaces.

Why do we have to do buffer lookups using the full buffer tag?

Why not just use (relNode, blockNum) and resolve hash collisions, if any?

Your suggestion to avoid hashing the whole buffer tag was a good one. Having a permanent table to produce a smaller tag is a fairly pessimistic solution; why not just have an optimistic solution in memory instead?

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing the size of BufferTag & remodeling forks

From

Andres Freund

Date:

12 September 2015, 20:19:53

On 2015-09-12 13:12:26 +0100, Simon Riggs wrote:
> Why do we have to do buffer lookups using the full buffer tag?

We don't necessarily.

> Why not just use (relNode, blockNum) and resolve hash collisions, if any?

I tried that and unfortunately it came out as a negative - the number of
collision gets higher and collisions in a hashtable will often imply a
cache miss. You can even see the comparison function rising in the
profile.

I've actually changed course a bit and I'm trying something different: A
two level structure. One hashtable that maps (RelFileNode, ForkNumber)
to a 'open relation' data structure, and from there a radix tree over
just the block number. To avoid having to look up in the hashtable
frequently there's a pointer in RelationData to a fork's radix tree.

This seems to have several advantages:
* It doesn't require changes to the physical representation
* A radix tree allows ordered traversal, allowing for efficient prefetching et al.
* The ability to efficiently get all pages that belong to a relation makes dropping/truncating tables radically more
efficientthan now.

* A single lookup in the radix tree is on average significantly faster than in the hash table. I've measured ~350 vs 60
cyclesin a nonconcurrent workload and ~820 vs 72~ cycles in a concurrent workload, using a recorded buffer access
traces.
* Having a 'open relation' representation in shared memory also makes it much easier to implement more efficient
relationextension and truncation - we can have pointers in there that signal how far the relation has been extended and
howfar it has been initialized.

The biggest disadvantage is that it's a good bit more complex than what
we have right now. That we need dynamically many radix trees makes
memory management nontrivial.

Sounds halfway sane?

Andres

Re: Reducing the size of BufferTag & remodeling forks

From

Simon Riggs

Date:

13 September 2015, 13:16:38

On 12 September 2015 at 21:19, Andres Freund <andres@anarazel.de> wrote:

On 2015-09-12 13:12:26 +0100, Simon Riggs wrote:
> Why do we have to do buffer lookups using the full buffer tag?

We don't necessarily.

> Why not just use (relNode, blockNum) and resolve hash collisions, if any?

I tried that and unfortunately it came out as a negative - the number of
collision gets higher and collisions in a hashtable will often imply a
cache miss. You can even see the comparison function rising in the
profile.

I've actually changed course a bit and I'm trying something different: A
two level structure. One hashtable that maps (RelFileNode, ForkNumber)
to a 'open relation' data structure, and from there a radix tree over
just the block number. To avoid having to look up in the hashtable
frequently there's a pointer in RelationData to a fork's radix tree.

This seems to have several advantages:
* It doesn't require changes to the physical representation

Seems necessary

* A radix tree allows ordered traversal, allowing for efficient
prefetching et al.

Could be very important... some benchmarks of whether that was worthwhile would really help

* The ability to efficiently get all pages that belong to a relation
makes dropping/truncating tables radically more efficient than now.

Not very important, since there are other approaches

* A single lookup in the radix tree is on average significantly faster
than in the hash table. I've measured ~350 vs 60 cycles in a
nonconcurrent workload and ~820 vs 72~ cycles in a concurrent
workload, using a recorded buffer access traces.

Good

* Having a 'open relation' representation in shared memory also makes it
much easier to implement more efficient relation extension and
truncation - we can have pointers in there that signal how far the
relation has been extended and how far it has been initialized.

Probably important, for extension

The biggest disadvantage is that it's a good bit more complex than what
we have right now. That we need dynamically many radix trees makes
memory management nontrivial.

Which might mean we lose benefit by requiring many additional memory accesses.

Sounds halfway sane?

I think it is a good research direction, as long as we get the focus right.

Another similar approach is direct allocation of chunks of memory for in-memory access. That would be much less flexible, but access would be in <5 cycles. I mention this for comparison only.

So if the focus is on more efficient and reasonably flexible access to data in memory, then it sounds good.

Simon Riggs http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing the size of BufferTag & remodeling forks

From

Alvaro Herrera

Date:

19 April 2016, 18:44:45

Andres Freund wrote:

> I've actually changed course a bit and I'm trying something different: A
> two level structure. One hashtable that maps (RelFileNode, ForkNumber)
> to a 'open relation' data structure, and from there a radix tree over
> just the block number. To avoid having to look up in the hashtable
> frequently there's a pointer in RelationData to a fork's radix tree.

Is this going anywhere, or did you drop the subject altogether?

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Reducing the size of BufferTag & remodeling forks

From

Andres Freund

Date:

21 April 2016, 18:25:12

On 2016-04-19 15:44:36 -0300, Alvaro Herrera wrote:
> Andres Freund wrote:
> 
> > I've actually changed course a bit and I'm trying something different: A
> > two level structure. One hashtable that maps (RelFileNode, ForkNumber)
> > to a 'open relation' data structure, and from there a radix tree over
> > just the block number. To avoid having to look up in the hashtable
> > frequently there's a pointer in RelationData to a fork's radix tree.
> 
> Is this going anywhere, or did you drop the subject altogether?

I've postponed working on it a bit, as it was becoming clearer that 9.6
wasn't a realistic target. I do plan to pick this up again once we're in
beta.

Any specific reason for asking?

Andres