Re: Cache relation sizes? - Mailing list pgsql-hackers

From Andres Freund
Subject Re: Cache relation sizes?
Date
Msg-id 20201230043536.fsqngrit2kl2mlue@alap3.anarazel.de
Whole thread Raw
In response to Re: Cache relation sizes?  (Thomas Munro <thomas.munro@gmail.com>)
List pgsql-hackers
Hi,

On 2020-11-19 18:01:14 +1300, Thomas Munro wrote:
> From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001
> From: Thomas Munro <thomas.munro@gmail.com>
> Date: Fri, 13 Nov 2020 14:38:41 +1300
> Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory.
>
> Introduce a fixed size pool of SMgrSharedRelation objects.  A new GUC
> smgr_shared_relation controls how many can exist at once, and they are
> evicted as required.

As noted over IM, I think it'd be good if we could avoid the need for a
new GUC here and instead derive it from other settings. E.g.
  max_files_per_process * MaxBackends
or maybe
  max_files_per_process * log(MaxBackends)
(on the basis that it's unlikely that each backend uses completely
separate files).

Given the size of the structs it seems that the memory overhead of this
should be acceptable?



> XXX smgrimmedsync() is maybe too blunt an instrument?
>
> XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via
> some new interface, but it doesn't actually know if it's cleaning a fork
> that extended a relation...
>
> XXX perhaps bgwriter should try to clean them?
>

As I suggested on the same call, I think it might make sense to
integrate this with the checkpointer sync queue. Have checkpointer try
to keep some a certain fraction of the entries clean (or even make them
unused?), by processing all the sync requests for that
relfilnode. That'd allow to a) only sync segments we actually need to
sync b) syncing those segments again at the end of the checkpoint.


Another use-case I have is that I'd like to make file-extension better
than it currently is. There's two main challenges:

- Having bulk extension add pages to the FSM isn't actually a good
  solution, as that has a high likelihood of multiple backends ending up
  trying to insert into the same page. If we had, in SMgrSharedRelation,
  a 'physical' relation size and a 'logical' relation size, we could do
  better, and hand out pre-allocated pages to exactly one backend.

- Doing bulk-extension when there's no other space left leads to large
  pile-ups of processes needing extension lock -> a large latency
  increase. Instead we should opportunistically start to do extension
  when we've used-up most of the bulk allocated space.



> +/*
> + * An entry in the hash table that allows us to look up objects in the
> + * SMgrSharedRelation pool by rnode (+ backend).
> + */
> +typedef struct SMgrSharedRelationMapping
> +{
> +    RelFileNodeBackend rnode;
> +    int                index;
> +} SMgrSharedRelationMapping;

Maybe add a note that index points into
SMgrSharedRelationPool->objects[]? And why you chose to not store
SMgrSharedRelation directly in the hashtable (makes sense to me, still
worth commenting on imo).



> +/* Flags. */
> +#define SR_LOCKED                    0x01
> +#define SR_VALID                    0x02
> +
> +/* Each forknum gets its own dirty, syncing and just dirtied bits. */
> +#define SR_DIRTY(forknum)            (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0))
> +#define SR_SYNCING(forknum)            (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1))
> +#define SR_JUST_DIRTIED(forknum)    (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2))

Hm - do we really want all that stored in a single flag value?

> +/*
> + * Lock a SMgrSharedRelation.  The lock is a spinlock that should be held for
> + * only a few instructions.  The return value is the current set of flags,
> + * which may be modified and then passed to smgr_unlock_sr() to be atomically
> + * when the lock is released.
> + */
> +static uint32
> +smgr_lock_sr(SMgrSharedRelation *sr)
> +{
> +
> +    for (;;)
> +    {
> +        uint32    old_flags = pg_atomic_read_u32(&sr->flags);
> +        uint32    flags;
> +
> +        if (!(old_flags & SR_LOCKED))
> +        {
> +            flags = old_flags | SR_LOCKED;
> +            if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags))
> +                return flags;
> +        }
> +    }
> +    return 0; /* unreachable */
> +}

Since this is a spinlock, you should use the spin delay infrastructure
(c.f. LWLockWaitListLock()). Otherwise it'll react terribly if there
ever ended up contention.

What's the reason not to use a per-entry lwlock instead?

Naming: I find it weird to have smgr_ as a prefix and _sr as a
postfix. I'd go for smgr_sr_$op.


> +/*
> + * Unlock a SMgrSharedRelation, atomically updating its flags at the same
> + * time.
> + */
> +static void
> +smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags)
> +{
> +    pg_write_barrier();
> +    pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED);
> +}

This assumes there never are atomic modification of flags without
holding the spinlock (or a cmpxchg loop testing whether somebody is
holding the lock).Is that right / good?


> +/*
> + * Allocate a new invalid SMgrSharedRelation, and return it locked.
> + *
> + * The replacement algorithm is a simple FIFO design with no second chance for
> + * now.
> + */
> +static SMgrSharedRelation *
> +smgr_alloc_sr(void)
> +{
> +    SMgrSharedRelationMapping *mapping;
> +    SMgrSharedRelation *sr;
> +    uint32 index;
> +    LWLock *mapping_lock;
> +    uint32 flags;
> +    RelFileNodeBackend rnode;
> +    uint32 hash;
> +
> + retry:
> +    /* Lock the next one in clock-hand order. */
> +    index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations;
> +    sr = &sr_pool->objects[index];
> +    flags = smgr_lock_sr(sr);

Modulo is pretty expensive - I'd force smgr_shared_relations to be a
power-of-two and then do a & smgr_shared_relations_mask initialized to
(smgr_shared_relations - 1).



> +    /* If it's unused, can return it, still locked, immediately. */
> +    if (!(flags & SR_VALID))
> +        return sr;
> +
> +    /*
> +     * Copy the rnode and unlock.  We'll briefly acquire both mapping and SR
> +     * locks, but we need to do it in that order, so we'll unlock the SR
> +     * first.
> +     */
> +    rnode = sr->rnode;
> +    smgr_unlock_sr(sr, flags);
> +
> +    hash = get_hash_value(sr_mapping_table, &rnode);
> +    mapping_lock = SR_PARTITION_LOCK(hash);
> +
> +    LWLockAcquire(mapping_lock, LW_EXCLUSIVE);
> +    mapping = hash_search_with_hash_value(sr_mapping_table,
> +                                          &rnode,
> +                                          hash,
> +                                          HASH_FIND,
> +                                          NULL);
> +    if (!mapping || mapping->index != index)
> +    {
> +        /* Too slow, it's gone or now points somewhere else.  Go around. */
> +        LWLockRelease(mapping_lock);
> +        goto retry;
> +    }
> +
> +    /* We will lock the SR for just a few instructions. */
> +    flags = smgr_lock_sr(sr);
> +    Assert(flags & SR_VALID);

So the replacement here is purely based on when the relation was
accessed first, not when it was accessed last. Probably good enough, but
seems worth a comment.


> 3.  The nblocks value must be free enough for scans, extension,
> truncatation and dropping buffers, because the those operations all
> executed memory barriers when it acquired a snapshot to scan (which
> doesn't need to see blocks added after that) or an exclusive heavyweight
> lock to extend, truncate or drop.  XXX right?

"free enough"?

I am not sure this holds for non-mvcc scans?


> @@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln)
>  bool
>  smgrexists(SMgrRelation reln, ForkNumber forknum)
>  {
> -    if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber)
> +    if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber)
>          return true;

ISTM that all these places referencing _fast is leaking an
implementation detail (but not very far) - seems like that should be an
implementation detail for smgrnblocks() which can then call out to
_fast/_shared or such.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: WIP: WAL prefetch (another approach)
Next
From: Justin Pryzby
Date:
Subject: Re: pglz compression performance, take two