Re: Cache relation sizes? - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: Cache relation sizes? |
Date | |
Msg-id | 20201230043536.fsqngrit2kl2mlue@alap3.anarazel.de Whole thread Raw |
In response to | Re: Cache relation sizes? (Thomas Munro <thomas.munro@gmail.com>) |
List | pgsql-hackers |
Hi, On 2020-11-19 18:01:14 +1300, Thomas Munro wrote: > From ac3c61926bf947a3288724bd02cf8439ff5c14bc Mon Sep 17 00:00:00 2001 > From: Thomas Munro <thomas.munro@gmail.com> > Date: Fri, 13 Nov 2020 14:38:41 +1300 > Subject: [PATCH v2 1/2] WIP: Track relation sizes in shared memory. > > Introduce a fixed size pool of SMgrSharedRelation objects. A new GUC > smgr_shared_relation controls how many can exist at once, and they are > evicted as required. As noted over IM, I think it'd be good if we could avoid the need for a new GUC here and instead derive it from other settings. E.g. max_files_per_process * MaxBackends or maybe max_files_per_process * log(MaxBackends) (on the basis that it's unlikely that each backend uses completely separate files). Given the size of the structs it seems that the memory overhead of this should be acceptable? > XXX smgrimmedsync() is maybe too blunt an instrument? > > XXX perhaps mdsyncfiletag should tell smgr.c when it cleans forks via > some new interface, but it doesn't actually know if it's cleaning a fork > that extended a relation... > > XXX perhaps bgwriter should try to clean them? > As I suggested on the same call, I think it might make sense to integrate this with the checkpointer sync queue. Have checkpointer try to keep some a certain fraction of the entries clean (or even make them unused?), by processing all the sync requests for that relfilnode. That'd allow to a) only sync segments we actually need to sync b) syncing those segments again at the end of the checkpoint. Another use-case I have is that I'd like to make file-extension better than it currently is. There's two main challenges: - Having bulk extension add pages to the FSM isn't actually a good solution, as that has a high likelihood of multiple backends ending up trying to insert into the same page. If we had, in SMgrSharedRelation, a 'physical' relation size and a 'logical' relation size, we could do better, and hand out pre-allocated pages to exactly one backend. - Doing bulk-extension when there's no other space left leads to large pile-ups of processes needing extension lock -> a large latency increase. Instead we should opportunistically start to do extension when we've used-up most of the bulk allocated space. > +/* > + * An entry in the hash table that allows us to look up objects in the > + * SMgrSharedRelation pool by rnode (+ backend). > + */ > +typedef struct SMgrSharedRelationMapping > +{ > + RelFileNodeBackend rnode; > + int index; > +} SMgrSharedRelationMapping; Maybe add a note that index points into SMgrSharedRelationPool->objects[]? And why you chose to not store SMgrSharedRelation directly in the hashtable (makes sense to me, still worth commenting on imo). > +/* Flags. */ > +#define SR_LOCKED 0x01 > +#define SR_VALID 0x02 > + > +/* Each forknum gets its own dirty, syncing and just dirtied bits. */ > +#define SR_DIRTY(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 0)) > +#define SR_SYNCING(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 1)) > +#define SR_JUST_DIRTIED(forknum) (0x04 << ((forknum) + (MAX_FORKNUM + 1) * 2)) Hm - do we really want all that stored in a single flag value? > +/* > + * Lock a SMgrSharedRelation. The lock is a spinlock that should be held for > + * only a few instructions. The return value is the current set of flags, > + * which may be modified and then passed to smgr_unlock_sr() to be atomically > + * when the lock is released. > + */ > +static uint32 > +smgr_lock_sr(SMgrSharedRelation *sr) > +{ > + > + for (;;) > + { > + uint32 old_flags = pg_atomic_read_u32(&sr->flags); > + uint32 flags; > + > + if (!(old_flags & SR_LOCKED)) > + { > + flags = old_flags | SR_LOCKED; > + if (pg_atomic_compare_exchange_u32(&sr->flags, &old_flags, flags)) > + return flags; > + } > + } > + return 0; /* unreachable */ > +} Since this is a spinlock, you should use the spin delay infrastructure (c.f. LWLockWaitListLock()). Otherwise it'll react terribly if there ever ended up contention. What's the reason not to use a per-entry lwlock instead? Naming: I find it weird to have smgr_ as a prefix and _sr as a postfix. I'd go for smgr_sr_$op. > +/* > + * Unlock a SMgrSharedRelation, atomically updating its flags at the same > + * time. > + */ > +static void > +smgr_unlock_sr(SMgrSharedRelation *sr, uint32 flags) > +{ > + pg_write_barrier(); > + pg_atomic_write_u32(&sr->flags, flags & ~SR_LOCKED); > +} This assumes there never are atomic modification of flags without holding the spinlock (or a cmpxchg loop testing whether somebody is holding the lock).Is that right / good? > +/* > + * Allocate a new invalid SMgrSharedRelation, and return it locked. > + * > + * The replacement algorithm is a simple FIFO design with no second chance for > + * now. > + */ > +static SMgrSharedRelation * > +smgr_alloc_sr(void) > +{ > + SMgrSharedRelationMapping *mapping; > + SMgrSharedRelation *sr; > + uint32 index; > + LWLock *mapping_lock; > + uint32 flags; > + RelFileNodeBackend rnode; > + uint32 hash; > + > + retry: > + /* Lock the next one in clock-hand order. */ > + index = pg_atomic_fetch_add_u32(&sr_pool->next, 1) % smgr_shared_relations; > + sr = &sr_pool->objects[index]; > + flags = smgr_lock_sr(sr); Modulo is pretty expensive - I'd force smgr_shared_relations to be a power-of-two and then do a & smgr_shared_relations_mask initialized to (smgr_shared_relations - 1). > + /* If it's unused, can return it, still locked, immediately. */ > + if (!(flags & SR_VALID)) > + return sr; > + > + /* > + * Copy the rnode and unlock. We'll briefly acquire both mapping and SR > + * locks, but we need to do it in that order, so we'll unlock the SR > + * first. > + */ > + rnode = sr->rnode; > + smgr_unlock_sr(sr, flags); > + > + hash = get_hash_value(sr_mapping_table, &rnode); > + mapping_lock = SR_PARTITION_LOCK(hash); > + > + LWLockAcquire(mapping_lock, LW_EXCLUSIVE); > + mapping = hash_search_with_hash_value(sr_mapping_table, > + &rnode, > + hash, > + HASH_FIND, > + NULL); > + if (!mapping || mapping->index != index) > + { > + /* Too slow, it's gone or now points somewhere else. Go around. */ > + LWLockRelease(mapping_lock); > + goto retry; > + } > + > + /* We will lock the SR for just a few instructions. */ > + flags = smgr_lock_sr(sr); > + Assert(flags & SR_VALID); So the replacement here is purely based on when the relation was accessed first, not when it was accessed last. Probably good enough, but seems worth a comment. > 3. The nblocks value must be free enough for scans, extension, > truncatation and dropping buffers, because the those operations all > executed memory barriers when it acquired a snapshot to scan (which > doesn't need to see blocks added after that) or an exclusive heavyweight > lock to extend, truncate or drop. XXX right? "free enough"? I am not sure this holds for non-mvcc scans? > @@ -699,7 +748,7 @@ smgrclearowner(SMgrRelation *owner, SMgrRelation reln) > bool > smgrexists(SMgrRelation reln, ForkNumber forknum) > { > - if (smgrnblocks_shared(reln, forknum) != InvalidBlockNumber) > + if (smgrnblocks_fast(reln, forknum) != InvalidBlockNumber) > return true; ISTM that all these places referencing _fast is leaking an implementation detail (but not very far) - seems like that should be an implementation detail for smgrnblocks() which can then call out to _fast/_shared or such. Greetings, Andres Freund
pgsql-hackers by date: