RE: Use simplehash.h instead of dynahash in SMgr - Mailing list pgsql-hackers

From Jakub Wartak
Subject RE: Use simplehash.h instead of dynahash in SMgr
Date
Msg-id VI1PR0701MB696001353B6E41166E43BEC8F6599@VI1PR0701MB6960.eurprd07.prod.outlook.com
Whole thread Raw
In response to Re: Use simplehash.h instead of dynahash in SMgr  (David Rowley <dgrowleyml@gmail.com>)
Responses Re: Use simplehash.h instead of dynahash in SMgr
List pgsql-hackers
Hey David,

> I think you'd have to batch by filenode and transaction in that case. Each batch might be pretty small on a typical
OLTPworkload, so it might not help much there, or it might hinder.
 

True, it is very workload dependent (I was chasing mainly INSERTs multiValues, INSERT-SELECT) that often hit the same
$block,certainly not OLTP. I would even say that INSERT-as-SELECT would be more suited for DWH-like processing.
 

> But having said that, I don't think any of those possibilities should stop us speeding up smgropen().

Of course! I've tried a couple of much more smaller ideas, but without any big gains. I was able to squeeze like
300-400kfunction calls per second (WAL records/s), that was the point I think where I think smgropen() got abused. 
 

> > Another potential option that we've discussed is that the redo generation
> itself is likely a brake of efficient recovery performance today (e.g. INSERT-
> SELECT on table with indexes, generates interleaved WAL records that touch
> often limited set of blocks that usually put Smgr into spotlight).
> 
> I'm not quite sure if I understand what you mean here.  Is this queuing up
> WAL records up during transactions and flush them out to WAL every so
> often after rearranging them into an order that's more optimal for replay?

Why not both? 😉 We were very concentrated on standby side, but on primary side one could also change how WAL records
aregenerated:
 

1) Minimalization of records towards same repeated $block eg. Heap2 table_multi_insert() API already does this and it
mattersto generate more optimal stream for replay:
 

postgres@test=# create table t (id bigint primary key);
postgres@test=# insert into t select generate_series(1, 10);

results in many calls due to interleave heap with btree records for the same block from Smgr perspective (this is
especiallyvisible on highly indexed tables)  =>    
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000108, prev 4/E70000A0, desc: INSERT_LEAF off
1,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E7000148, prev 4/E7000108, desc: INSERT off 2
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000188, prev 4/E7000148, desc: INSERT_LEAF off
2,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E70001C8, prev 4/E7000188, desc: INSERT off 3
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000208, prev 4/E70001C8, desc: INSERT_LEAF off
3,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E7000248, prev 4/E7000208, desc: INSERT off 4
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243284, lsn: 4/E7000288, prev 4/E7000248, desc: INSERT_LEAF off
4,blkref #0: rel 1663/16384/32794 blk 1
 
rmgr: Heap        len (rec/tot):     63/    63, tx:   17243284, lsn: 4/E70002C8, prev 4/E7000288, desc: INSERT off 5
flags0x00, blkref #0: rel 1663/16384/32791 blk 0
 
[..]
Similar stuff happens for UPDATE. It basically prevents recent-buffer optimization that avoid repeated calls to
smgropen().

And here's already existing table_multi_inserts v2 API (Heap2) sample with obvious elimination of unnecessary
individualcalls to smgopen() via one big MULTI_INSERT instead (for CTAS/COPY/REFRESH MV) :
 
postgres@test=# create table t (id bigint primary key);
postgres@test=# copy (select generate_series (1, 10)) to '/tmp/t';
postgres@test=# copy t from '/tmp/t';
=>
rmgr: Heap2       len (rec/tot):    210/   210, tx:   17243290, lsn: 4/E9000028, prev 4/E8004410, desc:
MULTI_INSERT+INIT10 tuples flags 0x02, blkref #0: rel 1663/16384/32801 blk 0
 
rmgr: Btree       len (rec/tot):    102/   102, tx:   17243290, lsn: 4/E9000100, prev 4/E9000028, desc: NEWROOT lev 0,
blkref#0: rel 1663/16384/32804 blk 1, blkref #2: rel 1663/16384/32804 blk 0
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off
1,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off
2,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off
3,blkref #0: rel 1663/16384/32804 blk 1
 
[..]
Here Btree it is very localized (at least when concurrent sessions are not generating WAL) and it enables Thomas's
recent-bufferto kick in
 

DELETE is much more simple (thanks to not chewing out those Btree records) and also thanks to Thomas's recent-buffer
shouldtheoretically put much less stress on smgropen() already:
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED000028, prev 4/EC002800, desc: DELETE off 1
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED000060, prev 4/ED000028, desc: DELETE off 2
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED000098, prev 4/ED000060, desc: DELETE off 3
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
rmgr: Heap        len (rec/tot):     54/    54, tx:   17243296, lsn: 4/ED0000D0, prev 4/ED000098, desc: DELETE off 4
flags0x00 KEYS_UPDATED , blkref #0: rel 1663/16384/32808 blk 0
 
[..]

2) So what's missing - I may be wrong on this one - something like "index_multi_inserts" Btree2 API to avoid repeatedly
overwhelmingsmgropen() on recovery side for same index's $buffer. Not sure it is worth the effort, though especially
recent-bufferfixes that:
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E9000168, prev 4/E9000100, desc: INSERT_LEAF off
1,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001A8, prev 4/E9000168, desc: INSERT_LEAF off
2,blkref #0: rel 1663/16384/32804 blk 1
 
rmgr: Btree       len (rec/tot):     64/    64, tx:   17243290, lsn: 4/E90001E8, prev 4/E90001A8, desc: INSERT_LEAF off
3,blkref #0: rel 1663/16384/32804 blk 1
 
right?

3) Concurrent DML sessions mixing WAL records: the buffering on backend's side of things (on private "thread" of WAL -
inprivate memory - that would be simply "copied" into logwriter's main WAL buffer when committing/buffer full) -  it
wouldseem like an very interesting idea to limit interleaving concurrent sessions WAL records between each other and
exploitthe recent-buffer enhancement to avoid repeating the same calls to Smgr, wouldn't it? (I'm just mentioning it as
Isaw you were benchmarking it here and called out this idea).
 

I could be wrong though with many of those simplifications, in any case please consult with Thomas as he knows much
betterand is much more trusted source than me 😉
 

-J.


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: Small issues with CREATE TABLE COMPRESSION
Next
From: Robert Haas
Date:
Subject: Re: [bug?] Missed parallel safety checks, and wrong parallel safety