Re: [GSoC] artbufmgr - Mailing list pgsql-hackers

From pantilimonov michael
Subject Re: [GSoC] artbufmgr
Date
Msg-id 7009531564058048@myt4-42d4a9f8d9f5.qloud-c.yandex.net
Whole thread Raw
In response to [GSoC] artbufmgr  (pantilimonov misha <pantlimon@yandex.ru>)
List pgsql-hackers
Greetings all,

> So, this is a plan, that I would like to stick with subsequently:
>
> 1) Work on synchronization.
> 2) Examine bulk operations on buffers, replace them with tree (prefix) iterator.
> 3) A reasonable way to size memory for tree maintenance.
>

previously i described the plan, that i wanted to follow; achieved results are summarized below.
Should note beforehand, that third item wasn't touched.

>
> Dynamic data structure like tree brings problems that are not relevant for the current hashtable implementation.
> The number of LWLocks in the hashtable is fixed and equal to the number of partitions(128),
> the memory footprint for buckets and hash-entries can be easily precomputed.
>
> Currently, I am thinking about the idea of tree decomposition, so that it is really a
> tree of trees, where RelFileNode[12bytes] (maybe + ForkNum) can be used for the top tree and
> the rest for the trees of blocks. Hence, each bottom tree can contain a single private LWLock, that will be
> used to protect its content. Won't be that an overkill? if we take a hypothetical example with
> 2 tablespaces(hdd + sdd), each with 500+ objects, so the number of LWLocks should scale accordingly, or
> we will be forced to reuse them, by unloading block trees, that should be done in some intelligent way.
>

Primarily i have started with a single-lock art tree, using the same locking strategy as an existing hashtable.
Both structs were surrounded by "defines", so i could run them independently or simultaneously.
The last option was mainly used as some kind of validation check that tree works in the same way
as hashtable. I have tested a single-lock tree version using pgbench and 'installcheck',
in order to test it on some kind of activity where multiple parallel processes are involved and
fresh relations tags arrive/leave due to the create/drop of tables.

It was obvious that single-lock tree, by definition, can't stand
128-locks (each lock is used to protect specific region(partition)) hashtable in all concurrent cases, so
the idea was to split the current 'MainTree' into the tree of trees. Such separation has additional benefits,
besides throughput improvement:

a) Most of the time (relatively) 'MainTree' is not modified, as the time goes it gets saturated by
    'most used' relation's keys. After that point, it is mostly used as a read-only structure.
b) 'Path' to specific subtree of 'MainTree' can be cached in SMgrRelation by saving pointers to the
     corresponding ForkNums. (just an array of pointers[max_forknum + 1])
     In the current implementation, this optimization can reduce key length from 20 to 4 bytes.

typedef struct buftag
{
    Oid            spcNode;        /* tablespace */
    Oid            dbNode;            /* database */
    Oid            relNode;        /* relation */
    ForkNumber    forkNum;
    BlockNumber blockNum;        /* blknum relative to begin of reln */
} BufferTag;

To split 'MainTree' i have injected LWLock to the art_tree structure and created
an additional separate freelist of these subtree's nodes, that is used then for dynamic allocation
and deallocation.
The key length in 'MainTree' is 16 bytes (spcNode, dbnode, relNode, forkNum) and likely
can be reduced to 13 bytes by shifting forkNum value that ranges only from -1 to 3, but occupies
4 bytes.
The key length of each subtree is 4 bytes only - BlockNumber.

Below are results of tests, performed on database initialized with
pgbench -i -s 50 with shared_buffers=128MB
and pgbench -i -s 100 with shared buffers=1GB.
It should be noted that such workload does not really represents 'real life' results,
as all contention goes into certain partitions (in case of hashtable) and subtrees (in case of tree).
Next goal is to compare data structures on some kind of realistic benchmark or just create
multiple databases inside cluster and run corresponding number of pgbench instances.

tested on pc with i7, ssd
each test performed 5 times, readonly ran subsequently,
full ran with fresh start(otherwise some problems to be fixed in tree..), best result is taken.

readonly test: pgbench --random-seed=2 -t 100000 -S -c 6 -j 6 -n
full test: pgbench --random-seed=2 -t 10000 -c 6 -j 6
drop test:
       create table test_drop2 as select a, md5(a::text) from generate_series(1, 20000) as a; ~ 167 blocks
       drop test_drop2;

128MB:
           readonly HASH: latency average = 0.102 ms tps = 58934.229170 (excluding connections establishing)
                   full  HASH: latency average = 1.233 ms tps = 4868.332127 (excluding connections establishing)

           readonly TREE: latency average = 0.106 ms tps = 56818.893538 (excluding connections establishing)
                   full  TREE: latency average = 1.227 ms tps = 4890.546989 (excluding connections establishing)

1GB:
           readonly HASH: latency average = 0.100 ms tps = 60120.841307 (excluding connections establishing)
                   full  HASH: latency average = 1.325 ms tps = 4529.205977 (excluding connections establishing)
                 drop HASH: min ~4.9ms, max ~54ms

           readonly TREE: latency average = 0.100 ms tps = 60247.345044 (excluding connections establishing)
                   full  TREE: latency average = 1.286 ms tps = 4665.538565 (excluding connections establishing)
                 drop TREE: min ~4.3ms, max ~52ms 

These tests do not show significant superiority of any of the data structures, but it should be noted
that maintenance complexity of tree (its dynamic nature) is much higher than an existing hash-table implementation for
buffer management.

Before the start of this project, we assumed that the tree might not be faster than the hash table, but at least be on
apar.
 
Tree's additional properties may allow other operations to be performed more efficiently,
therefore making it more attractive as a future structure for buffer management.

So, i feel like it is better for now _not_ to concentrate on those 'additional properties' and try to pursue
the third item stated in the beginning -- maintenance complexity. So at the end of the project, the tree can be
fully used as an alternative structure instead of the existing hashtable and be
the base for 'additional properties' future implementation, like drop/truncate/checkpoint/relation extending/etc.
My attempt to implement a tree version of DropRelFileNodesAllBuffers showed that
there are many places in the system, that should cope with these changes.




pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: Built-in connection pooler
Next
From: Tom Lane
Date:
Subject: Re: add_path() for Path without InitPlan: cost comparison vs. Paths that require one