Re: [HACKERS] [POC] Faster processing at Gather node - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] [POC] Faster processing at Gather node
Date
Msg-id 20180110000934.lft7ebeqek4grrlx@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] [POC] Faster processing at Gather node  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [HACKERS] [POC] Faster processing at Gather node
List pgsql-hackers
Hi,

On 2017-12-04 10:50:53 -0500, Robert Haas wrote:
> Subject: [PATCH 1/2] shm-mq-less-spinlocks-v2


> + * mq_sender and mq_bytes_written can only be changed by the sender.
> + * mq_receiver and mq_sender are protected by mq_mutex, although, importantly,
> + * they cannot change once set, and thus may be read without a lock once this
> + * is known to be the case.

I don't recall our conversation around this anymore, and haven't read
down far enough to see the relevant code. Lest I forget: Such construct
often need careful use of barriers.


> - * mq_detached can be set by either the sender or the receiver, so the mutex
> - * must be held to read or write it.  Memory barriers could be used here as
> - * well, if needed.
> + * mq_bytes_read and mq_bytes_written are not protected by the mutex.  Instead,
> + * they are written atomically using 8 byte loads and stores.  Memory barriers
> + * must be carefully used to synchronize reads and writes of these values with
> + * reads and writes of the actual data in mq_ring.
> + *
> + * mq_detached needs no locking.  It can be set by either the sender or the
> + * receiver, but only ever from false to true, so redundant writes don't
> + * matter.  It is important that if we set mq_detached and then set the
> + * counterparty's latch, the counterparty must be certain to see the change
> + * after waking up.  Since SetLatch begins with a memory barrier and ResetLatch
> + * ends with one, this should be OK.

s/should/is/ or similar?


Perhaps a short benchmark for 32bit systems using shm_mq wouldn't hurt?
I suspect there won't be much of a performance impact, but it's probably
worth checking.


>   * mq_ring_size and mq_ring_offset never change after initialization, and
>   * can therefore be read without the lock.
>   *
> - * Importantly, mq_ring can be safely read and written without a lock.  Were
> - * this not the case, we'd have to hold the spinlock for much longer
> - * intervals, and performance might suffer.  Fortunately, that's not
> - * necessary.  At any given time, the difference between mq_bytes_read and
> + * At any given time, the difference between mq_bytes_read and

Hm, why did you remove the first part about mq_ring itself?


> @@ -848,18 +868,19 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
>  
>      while (sent < nbytes)
>      {
> -        bool        detached;
>          uint64        rb;
> +        uint64        wb;
>  
>          /* Compute number of ring buffer bytes used and available. */
> -        rb = shm_mq_get_bytes_read(mq, &detached);
> -        Assert(mq->mq_bytes_written >= rb);
> -        used = mq->mq_bytes_written - rb;
> +        rb = pg_atomic_read_u64(&mq->mq_bytes_read);
> +        wb = pg_atomic_read_u64(&mq->mq_bytes_written);
> +        Assert(wb >= rb);
> +        used = wb - rb;

Just to make sure my understanding is correct: No barriers needed here
because "bytes_written" is only written by the sending backend, and
"bytes_read" cannot lap it. Correct?


>          Assert(used <= ringsize);
>          available = Min(ringsize - used, nbytes - sent);
>  
>          /* Bail out if the queue has been detached. */
> -        if (detached)
> +        if (mq->mq_detached)

Hm, do all paths here guarantee that mq->mq_detached won't be stored on
the stack / register in the first iteration, and then not reread any
further? I think it's fine because every branch of the if below ends up
in a syscall / barrier, but it might be worth noting on that here.


> +            /*
> +             * Since mq->mqh_counterparty_attached is known to be true at this
> +             * point, mq_receiver has been set, and it can't change once set.
> +             * Therefore, we can read it without acquiring the spinlock.
> +             */
> +            Assert(mqh->mqh_counterparty_attached);
> +            SetLatch(&mq->mq_receiver->procLatch);

Perhaps mention that this could lead to spuriously signalling the wrong
backend in case of detach, but that that is fine?

>              /* Skip manipulation of our latch if nowait = true. */
>              if (nowait)
> @@ -934,10 +953,18 @@ shm_mq_send_bytes(shm_mq_handle *mqh, Size nbytes, const void *data,
>          }
>          else
>          {
> -            Size        offset = mq->mq_bytes_written % (uint64) ringsize;
> -            Size        sendnow = Min(available, ringsize - offset);
> +            Size        offset;
> +            Size        sendnow;
> +
> +            offset = wb % (uint64) ringsize;
> +            sendnow = Min(available, ringsize - offset);

I know the scheme isn't new, but I do find it not immediately obvious
that 'wb' is short for 'bytes_written'.


> -            /* Write as much data as we can via a single memcpy(). */
> +            /*
> +             * Write as much data as we can via a single memcpy(). Make sure
> +             * these writes happen after the read of mq_bytes_read, above.
> +             * This barrier pairs with the one in shm_mq_inc_bytes_read.
> +             */

s/above/above. Otherwise a newer mq_bytes_read could become visible
before the corresponding reads have fully finished./?

Could you also add a comment as to why you think a read barrier isn't
sufficient? IIUC that's the case because we need to prevent reordering
in both directions: Can't neither start reading based on a "too new"
bytes_read, nor can affort writes to mq_ring being reordered to before
the barrier. Correct?


> +            pg_memory_barrier();
>              memcpy(&mq->mq_ring[mq->mq_ring_offset + offset],
>                     (char *) data + sent, sendnow);
>              sent += sendnow;

Btw, this mq_ring_offset stuff seems a bit silly, why don't we use
proper padding/union in the struct to make it unnecessary to do that bit
of offset calculation every time? I think it currently prevents
efficient address calculation instructions from being used.


> From 666d33a363036a647dde83cb28b9d7ad0b31f76c Mon Sep 17 00:00:00 2001
> From: Robert Haas <rhaas@postgresql.org>
> Date: Sat, 4 Nov 2017 19:03:03 +0100
> Subject: [PATCH 2/2] shm-mq-reduce-receiver-latch-set-v1

> -    /* Consume any zero-copy data from previous receive operation. */
> -    if (mqh->mqh_consume_pending > 0)
> +    /*
> +     * If we've consumed an amount of data greater than 1/4th of the ring
> +     * size, mark it consumed in shared memory.  We try to avoid doing this
> +     * unnecessarily when only a small amount of data has been consumed,
> +     * because SetLatch() is fairly expensive and we don't want to do it
> +     * too often.
> +     */
> +    if (mqh->mqh_consume_pending > mq->mq_ring_size / 4)
>      {

Hm. Why are we doing this at the level of updating the variables, rather
than SetLatch calls?

Greetings,

Andres Freund


pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: Add RANGE with values and exclusions clauses to the Window Functions
Next
From: Ashutosh Sharma
Date:
Subject: Re: [HACKERS] dead or outdated URLs found in win32.h