Re: [HACKERS] POC: Sharing record typmods between backends - Mailing list pgsql-hackers
From | Andres Freund |
---|---|
Subject | Re: [HACKERS] POC: Sharing record typmods between backends |
Date | |
Msg-id | 20170725100906.rroaddzbhxge7ubt@alap3.anarazel.de Whole thread Raw |
In response to | Re: [HACKERS] POC: Sharing record typmods between backends (Thomas Munro <thomas.munro@enterprisedb.com>) |
List | pgsql-hackers |
On 2017-07-10 21:39:09 +1200, Thomas Munro wrote: > Here's a new version that introduces a per-session DSM segment to hold > the shared record typmod registry (and maybe more things later). You like to switch it up. *.patchset.tgz??? ;) It does concern me that we're growing yet another somewhat different hashtable implementation. Yet I don't quite see how we could avoid it. dynahash relies on proper pointers, simplehash doesn't do locking (and shouldn't) and also relies on pointers, although to a much lesser degree. All the open coded tables aren't a good match either. So I don't quite see an alternative, but I'd love one. Regards, Andres diff --git a/src/backend/lib/dht.c b/src/backend/lib/dht.c new file mode 100644 index 00000000000..2fec70f7742 --- /dev/null +++ b/src/backend/lib/dht.c FWIW, not a big fan of dht as a filename (nor of dsa.c). For one DHT usually refers to distributed hash tables, which this is not, and for another the abbreviation is so short it's not immediately understandable, and likely to conflict further. I think it'd possibly ok to have dht as symbol prefixes, but rename the file to be longer. + * To deal with currency, it has a fixed size set of partitions, each of which + * is independently locked. s/currency/concurrency/ I presume. + * Each bucket maps to a partition; so insert, find + * and iterate operations normally only acquire one lock. Therefore, good + * concurrency is achieved whenever they don't collide at the lock partition s/they/operations/? + * level. However, when a resize operation begins, all partition locks must + * be acquired simultaneously for a brief period. This is only expected to + * happen a small number of times until a stable size is found, since growth is + * geometric. I'm a bit doubtful that we need partitioning at this point, and that it doesn't actually *degrade* performance for your typmod case. + * Resizing is done incrementally so that no individual insert operation pays + * for the potentially large cost of splitting all buckets. I'm not sure this is a reasonable tradeoff for the use-case suggested so far, it doesn't exactly make things simpler. We're not going to grow much. +/* The opaque type used for tracking iterator state. */ +struct dht_iterator; +typedef struct dht_iterator dht_iterator; Isn't it actually the iterator state? Rather than tracking it? Also, why is it opaque given you're actually defining it below? Guess you'd moved it at some point. +/* + * The set of parameters needed to create or attach to a hash table. The + * members tranche_id and tranche_name do not need to be initialized when + * attaching to an existing hash table. + */ +typedef struct +{ + Size key_size; /* Size of the key (initial bytes of entry) */ + Size entry_size; /* Total size of entry */ Let's use size_t, like we kind of concluded in the thread you started: http://archives.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us :) + dht_compare_function compare_function; /* Compare function */ + dht_hash_function hash_function; /* Hash function */ Might be worth explaining that these need to provided when attaching because they're possibly process local. Did you test this with EXEC_BACKEND? + int tranche_id; /* The tranche ID to use for locks. */ +} dht_parameters; +struct dht_iterator +{ + dht_hash_table *hash_table; /* The hash table we are iterating over. */ + bool exclusive; /* Whether to lock buckets exclusively. */ + Size partition; /* The index of the next partition to visit. */ + Size bucket; /* The index of the next bucket to visit. */ + dht_hash_table_item *item; /* The most recently returned item. */ + dsa_pointer last_item_pointer; /* The last item visited. */ + Size table_size_log2; /* The table size when we started iterating. */ + bool locked; /* Whether the current partition is locked. */ Haven't gotten to the actual code yet, but this kinda suggest we leave a partition locked when iterating? Hm, that seems likely to result in a fair bit of pain... +/* Iterating over the whole hash table. */ +extern void dht_iterate_begin(dht_hash_table *hash_table, + dht_iterator *iterator, bool exclusive); +extern void *dht_iterate_next(dht_iterator *iterator); +extern void dht_iterate_delete(dht_iterator *iterator); s/delete/delete_current/? Otherwise it looks like it's part of manipulating just the iterator. +extern void dht_iterate_release(dht_iterator *iterator); I'd add lock to to the name. +/* + * An item in the hash table. This wraps the user's entry object in an + * envelop that holds a pointer back to the bucket and a pointer to the next + * item in the bucket. + */ +struct dht_hash_table_item +{ + /* The hashed key, to avoid having to recompute it. */ + dht_hash hash; + /* The next item in the same bucket. */ + dsa_pointer next; + /* The user's entry object follows here. */ + char entry[FLEXIBLE_ARRAY_MEMBER]; What's the point of using FLEXIBLE_ARRAY_MEMBER here? And isn't using a char going to lead to alignment problems? +/* The number of partitions for locking purposes. */ +#define DHT_NUM_PARTITIONS_LOG2 7 Could use some justification. +/* + * The head object for a hash table. This will be stored in dynamic shared + * memory. + */ +typedef struct +{ Why anonymous? Not that it hurts much, but seems weird to deviate just here. +/* + * Create a new hash table backed by the given dynamic shared area, with the + * given parameters. + */ +dht_hash_table * +dht_create(dsa_area *area, const dht_parameters *params) +{ + dht_hash_table *hash_table; + dsa_pointer control; + + /* Allocate the backend-local object representing the hash table. */ + hash_table = palloc(sizeof(dht_hash_table)); Should be documented that this uses caller's MemoryContext. + /* Set up the array of lock partitions. */ + { + int i; + + for (i = 0; i < DHT_NUM_PARTITIONS; ++i) + { + LWLockInitialize(PARTITION_LOCK(hash_table, i), + hash_table->control->lwlock_tranche_id); + hash_table->control->partitions[i].count = 0; + } I'd store hash_table->control->lwlock_tranche_id and partitions[i] in local vars. Possibly hash_table->control too. +/* + * Detach from a hash table. This frees backend-local resources associated + * with the hash table, but the hash table will continue to exist until it is + * either explicitly destroyed (by a backend that is still attached to it), or + * the area that backs it is returned to the operating system. + */ +void +dht_detach(dht_hash_table *hash_table) +{ + /* The hash table may have been destroyed. Just free local memory. */ + pfree(hash_table); +} Somewhat inclined to add debugging refcount - seems like bugs around that might be annoying to find. Maybe also add an assert ensuring that no locks are held? +/* + * Look up an entry, given a key. Returns a pointer to an entry if one can be + * found with the given key. Returns NULL if the key is not found. If a + * non-NULL value is returned, the entry is locked and must be released by + * calling dht_release. If an error is raised before dht_release is called, + * the lock will be released automatically, but the caller must take care to + * ensure that the entry is not left corrupted. The lock mode is either + * shared or exclusive depending on 'exclusive'. This API seems a bit fragile. +/* + * Returns a pointer to an exclusively locked item which must be released with + * dht_release. If the key is found in the hash table, 'found' is set to true + * and a pointer to the existing entry is returned. If the key is not found, + * 'found' is set to false, and a pointer to a newly created entry is related. "is related"? + */ +void * +dht_find_or_insert(dht_hash_table *hash_table, + const void *key, + bool *found) +{ + size_t hash; + size_t partition_index; + dht_partition *partition; + dht_hash_table_item *item; + + hash = hash_table->params.hash_function(key, hash_table->params.key_size); + partition_index = PARTITION_FOR_HASH(hash); + partition = &hash_table->control->partitions[partition_index]; + + Assert(hash_table->control->magic == DHT_MAGIC); + Assert(!hash_table->exclusively_locked); Why just exclusively locked? Why'd shared be ok? +/* + * Unlock an entry which was locked by dht_find or dht_find_or_insert. + */ +void +dht_release(dht_hash_table *hash_table, void *entry) +{ + dht_hash_table_item *item = ITEM_FROM_ENTRY(entry); + size_t partition_index = PARTITION_FOR_HASH(item->hash); + bool deferred_resize_work = false; + + Assert(hash_table->control->magic == DHT_MAGIC); Assert lock held (LWLockHeldByMe()) +/* + * Begin iterating through the whole hash table. The caller must supply a + * dht_iterator object, which can then be used to call dht_iterate_next to get + * values until the end is reached. + */ +void +dht_iterate_begin(dht_hash_table *hash_table, + dht_iterator *iterator, + bool exclusive) +{ + Assert(hash_table->control->magic == DHT_MAGIC); + + iterator->hash_table = hash_table; + iterator->exclusive = exclusive; + iterator->partition = 0; + iterator->bucket = 0; + iterator->item = NULL; + iterator->last_item_pointer = InvalidDsaPointer; + iterator->locked = false; + + /* Snapshot the size (arbitrary lock to prevent size changing). */ + LWLockAcquire(PARTITION_LOCK(hash_table, 0), LW_SHARED); + ensure_valid_bucket_pointers(hash_table); + iterator->table_size_log2 = hash_table->size_log2; + LWLockRelease(PARTITION_LOCK(hash_table, 0)); Hm. So we're introducing some additional contention on partition 0 - probably ok. +/* + * Move to the next item in the hash table. Returns a pointer to an entry, or + * NULL if the end of the hash table has been reached. The item is locked in + * exclusive or shared mode depending on the argument given to + * dht_iterate_begin. The caller can optionally release the lock by calling + * dht_iterate_release, and then call dht_iterate_next again to move to the + * next entry. If the iteration is in exclusive mode, client code can also + * call dht_iterate_delete. When the end of the hash table is reached, or at + * any time, the client may call dht_iterate_end to abandon iteration. + */ I'd just shorten the end to "at any time the client may call dht_iterate_end to ..." +/* + * Release the most recently obtained lock. This can optionally be called in + * between calls to dht_iterator_next to allow other processes to access the + * same partition of the hash table. + */ +void +dht_iterate_release(dht_iterator *iterator) +{ + Assert(iterator->locked); + LWLockRelease(PARTITION_LOCK(iterator->hash_table, iterator->partition)); + iterator->locked = false; +} + +/* + * Terminate iteration. This must be called after iteration completes, + * whether or not the end was reached. The iterator object may then be reused + * for another iteration. + */ +void +dht_iterate_end(dht_iterator *iterator) +{ + Assert(iterator->hash_table->control->magic == DHT_MAGIC); + if (iterator->locked) + LWLockRelease(PARTITION_LOCK(iterator->hash_table, + iterator->partition)); +} + +/* + * Print out debugging information about the internal state of the hash table. + */ +void +dht_dump(dht_hash_table *hash_table) +{ + size_t i; + size_t j; + + Assert(hash_table->control->magic == DHT_MAGIC); + + for (i = 0; i < DHT_NUM_PARTITIONS; ++i) + LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED); Should probably assert & document that no locks are held - otherwise there's going to be ugly deadlocks. And that's an unlikely thing to try. + ensure_valid_bucket_pointers(hash_table); + + fprintf(stderr, + "hash table size = %zu\n", (size_t) 1 << hash_table->size_log2); + for (i = 0; i < DHT_NUM_PARTITIONS; ++i) + { + dht_partition *partition = &hash_table->control->partitions[i]; + size_t begin = BUCKET_INDEX_FOR_PARTITION(i, hash_table->size_log2); + size_t end = BUCKET_INDEX_FOR_PARTITION(i + 1, hash_table->size_log2); + + fprintf(stderr, " partition %zu\n", i); + fprintf(stderr, + " active buckets (key count = %zu)\n", partition->count); + + for (j = begin; j < end; ++j) + { + size_t count = 0; + dsa_pointer bucket = hash_table->buckets[j]; + + while (DsaPointerIsValid(bucket)) + { + dht_hash_table_item *item; + + item = dsa_get_address(hash_table->area, bucket); + + bucket = item->next; + ++count; + } + fprintf(stderr, " bucket %zu (key count = %zu)\n", j, count); + } + if (RESIZE_IN_PROGRESS(hash_table)) + { + size_t begin; + size_t end; + + begin = BUCKET_INDEX_FOR_PARTITION(i, hash_table->size_log2 - 1); + end = BUCKET_INDEX_FOR_PARTITION(i + 1, + hash_table->size_log2 - 1); + + fprintf(stderr, " old buckets (key count = %zu)\n", + partition->old_count); + + for (j = begin; j < end; ++j) + { + size_t count = 0; + dsa_pointer bucket = hash_table->old_buckets[j]; + + while (DsaPointerIsValid(bucket)) + { + dht_hash_table_item *item; + + item = dsa_get_address(hash_table->area, bucket); + + bucket = item->next; + ++count; + } + fprintf(stderr, + " bucket %zu (key count = %zu)\n", j, count); + } + } + } + + for (i = 0; i < DHT_NUM_PARTITIONS; ++i) + LWLockRelease(PARTITION_LOCK(hash_table, i)); +} I'd put this below actual "production" code. - Andres
pgsql-hackers by date: