Re: Reducing the size of BufferTag & remodeling forks - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Reducing the size of BufferTag & remodeling forks
Date
Msg-id CANP8+jLT+zQdj=6T2fsBNFj_wvqDdPvJ7fkTg++_RcOx5QoCEw@mail.gmail.com
Whole thread Raw
In response to Re: Reducing the size of BufferTag & remodeling forks  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On 12 September 2015 at 21:19, Andres Freund <andres@anarazel.de> wrote:
On 2015-09-12 13:12:26 +0100, Simon Riggs wrote:
> Why do we have to do buffer lookups using the full buffer tag?

We don't necessarily.

> Why not just use (relNode, blockNum) and resolve hash collisions, if any?

I tried that and unfortunately it came out as a negative - the number of
collision gets higher and collisions in a hashtable will often imply a
cache miss. You can even see the comparison function rising in the
profile.

I've actually changed course a bit and I'm trying something different: A
two level structure. One hashtable that maps (RelFileNode, ForkNumber)
to a 'open relation' data structure, and from there a radix tree over
just the block number. To avoid having to look up in the hashtable
frequently there's a pointer in RelationData to a fork's radix tree.

This seems to have several advantages:
* It doesn't require changes to the physical representation

Seems necessary
 
* A radix tree allows ordered traversal, allowing for efficient
  prefetching et al.

Could be very important... some benchmarks of whether that was worthwhile would really help
 
* The ability to efficiently get all pages that belong to a relation
  makes dropping/truncating tables radically more efficient than now.

Not very important, since there are other approaches
 
* A single lookup in the radix tree is on average significantly faster
  than in the hash table. I've measured ~350 vs 60 cycles in a
  nonconcurrent workload and ~820 vs 72~ cycles in a concurrent
  workload, using a recorded buffer access traces.

Good
 
* Having a 'open relation' representation in shared memory also makes it
  much easier to implement more efficient relation extension and
  truncation - we can have pointers in there that signal how far the
  relation has been extended and how far it has been initialized.

Probably important, for extension
 
The biggest disadvantage is that it's a good bit more complex than what
we have right now. That we need dynamically many radix trees makes
memory management nontrivial.

Which might mean we lose benefit by requiring many additional memory accesses.
 
Sounds halfway sane?

I think it is a good research direction, as long as we get the focus right.

Another similar approach is direct allocation of chunks of memory for in-memory access. That would be much less flexible, but access would be in <5 cycles. I mention this for comparison only.

So if the focus is on more efficient and reasonably flexible access to data in memory, then it sounds good. 

--
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [DOCS] Missing COMMENT ON POLICY
Next
From: Michael Paquier
Date:
Subject: Re: New gist vacuum.