Re: LWLocks in DSM memory - Mailing list pgsql-hackers

From Thomas Munro
Subject Re: LWLocks in DSM memory
Date
Msg-id CAEepm=2Cx3Wi5u0V+U16WHPG8eHBTQMk8pMsoDGrDZ-rX9GMgw@mail.gmail.com
Whole thread Raw
In response to LWLocks in DSM memory  (Thomas Munro <thomas.munro@enterprisedb.com>)
Responses Re: LWLocks in DSM memory  (Thomas Munro <thomas.munro@enterprisedb.com>)
Re: LWLocks in DSM memory  (Robert Haas <robertmhaas@gmail.com>)
Re: LWLocks in DSM memory  (Andres Freund <andres@anarazel.de>)
List pgsql-hackers
On Sun, Jul 24, 2016 at 1:10 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> One solution could be to provide a non-circular variant of the dlist
> interface that uses NULL list termination.  I've attached a quick
> sketch of something like that which seems to work correctly.  It is
> only lightly tested so far and probably buggy, but shows the general
> idea.

On reflection, it wouldn't make much sense to put "noncircular" in the
names of interfaces, as that is an internal detail.  Maybe
"relocatable" or "position independent" or something like that, since
that's a user-visible property of the dlist_head.  Here is a version
of the patch that uses _relocatable.

> Any thoughts on this approach, or better ways to solve this problem?
> How valuable is the branch-free property of the circular push and
> delete operations?

I investigated this a bit.  If I do this on my laptop (clang, no
asserts, -O2),  it takes 3895 milliseconds, or 4.8ns per push/delete
operation:

        dlist_init(&head);
        for (i = 0; i < 100000000; ++i)
        {
            dlist_push_head(&head, &nodes[0]);
            dlist_push_tail(&head, &nodes[1]);
            dlist_push_head(&head, &nodes[2]);
            dlist_push_tail(&head, &nodes[3]);
            dlist_delete(&nodes[2]);
            dlist_delete(&nodes[3]);
            dlist_delete(&nodes[0]);
            dlist_delete(&nodes[1]);
        }

The relocatable version takes 5907 milliseconds, or 7.4ns per
push/delete operation:

        dlist_init_relocatable(&head);
        for (i = 0; i < 100000000; ++i)
        {
            dlist_push_head_relocatable(&head, &nodes[0]);
            dlist_push_tail_relocatable(&head, &nodes[1]);
            dlist_push_head_relocatable(&head, &nodes[2]);
            dlist_push_tail_relocatable(&head, &nodes[3]);
            dlist_delete_relocatable(&head, &nodes[2]);
            dlist_delete_relocatable(&head, &nodes[3]);
            dlist_delete_relocatable(&head, &nodes[0]);
            dlist_delete_relocatable(&head, &nodes[1]);
        }

Those operations are ~50% slower.  So getting rid of dlist's clever
branch-free code generally seems like a bad idea.

Next I wondered if it would be a bad idea to use slower relocatable
dlist heads for all LWLocks.  Obviously LWLocks are used all over the
place and it would be bad to slow them down, but I wondered if maybe
dlist manipulation might not be a very significant part of their work.
So I put a LWLock and a counter in shared memory, and had N background
workers run a tight loop that locks, increments and unlocks
concurrently until the counter reaches 1 billion.  This creates
different degrees of contention and wait list sizes.  The worker loop
looks like this:

    while (!finished)
    {
        LWLockAcquire(&control->lock, LW_EXCLUSIVE);
        ++control->counter;
        if (control->counter >= control->goal)
            finished = true;
        LWLockRelease(&control->lock);
    }

I measured the following times for unpatched master, on my 4 core laptop:

16 workers = 73.067s, 74.869s, 75.338s
8 workers  = 65.846s, 67.622s, 68.039s
4 workers  = 68.763s, 68.980s, 69.035s <-- curiously slower here
3 workers  = 59.701s, 59.991s, 60.133s
2 workers  = 53.620s, 55.300s, 55.790s
1 worker   = 21.578s, 21.535s, 21.598s

With the attached patched I got:

16 workers = 75.341s, 77.445s, 77.635s <- +3.4%
8 workers  = 67.462s, 68.622s, 68.851s <- +1.4%
4 workers  = 64.983s, 65.021s, 65.496s <- -5.7%
3 workers  = 60.247s, 60.425s, 60.492s <- +0.7%
2 workers  = 57.544s, 57.626s, 58.133s <- +2.3%
1 worker   = 21.403s, 21.486s, 21.661s <- -0.2%

Somewhat noisy data and different effects at different numbers of workers.

I can post the source for those tests if anyone is interested.  If you
have any other ideas for access patterns to test, or clever ways to
keep push and delete branch-free while also avoiding internal pointers
back to dlist_head, I'm all ears.  Otherwise, if a change affecting
all LWLocks turns out to be unacceptable, maybe we would need to have
a different LWLock interface for relocatable LWLocks to make them
suitable for use in DSM segments.  Any thoughts?

--
Thomas Munro
http://www.enterprisedb.com

Attachment

pgsql-hackers by date:

Previous
From: Alvaro Herrera
Date:
Subject: Re: Re: GiST optimizing memmoves in gistplacetopage for fixed-size updates [PoC]
Next
From: Thomas Munro
Date:
Subject: Re: LWLocks in DSM memory