Re: [HACKERS] POC: Sharing record typmods between backends - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] POC: Sharing record typmods between backends
Date
Msg-id 20170725100906.rroaddzbhxge7ubt@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] POC: Sharing record typmods between backends  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On 2017-07-10 21:39:09 +1200, Thomas Munro wrote:
> Here's a new version that introduces a per-session DSM segment to hold
> the shared record typmod registry (and maybe more things later).

You like to switch it up. *.patchset.tgz??? ;)


It does concern me that we're growing yet another somewhat different
hashtable implementation. Yet I don't quite see how we could avoid
it. dynahash relies on proper pointers, simplehash doesn't do locking
(and shouldn't) and also relies on pointers, although to a much lesser
degree.  All the open coded tables aren't a good match either.  So I
don't quite see an alternative, but I'd love one.


Regards,

Andres

diff --git a/src/backend/lib/dht.c b/src/backend/lib/dht.c
new file mode 100644
index 00000000000..2fec70f7742
--- /dev/null
+++ b/src/backend/lib/dht.c

FWIW, not a big fan of dht as a filename (nor of dsa.c). For one DHT
usually refers to distributed hash tables, which this is not, and for
another the abbreviation is so short it's not immediately
understandable, and likely to conflict further.  I think it'd possibly
ok to have dht as symbol prefixes, but rename the file to be longer.

+ * To deal with currency, it has a fixed size set of partitions, each of which
+ * is independently locked.

s/currency/concurrency/ I presume.


+ * Each bucket maps to a partition; so insert, find
+ * and iterate operations normally only acquire one lock.  Therefore, good
+ * concurrency is achieved whenever they don't collide at the lock partition

s/they/operations/?


+ * level.  However, when a resize operation begins, all partition locks must
+ * be acquired simultaneously for a brief period.  This is only expected to
+ * happen a small number of times until a stable size is found, since growth is
+ * geometric.

I'm a bit doubtful that we need partitioning at this point, and that it
doesn't actually *degrade* performance for your typmod case.


+ * Resizing is done incrementally so that no individual insert operation pays
+ * for the potentially large cost of splitting all buckets.

I'm not sure this is a reasonable tradeoff for the use-case suggested so
far, it doesn't exactly make things simpler. We're not going to grow
much.


+/* The opaque type used for tracking iterator state. */
+struct dht_iterator;
+typedef struct dht_iterator dht_iterator;

Isn't it actually the iterator state? Rather than tracking it? Also, why
is it opaque given you're actually defining it below? Guess you'd moved
it at some point.


+/*
+ * The set of parameters needed to create or attach to a hash table.  The
+ * members tranche_id and tranche_name do not need to be initialized when
+ * attaching to an existing hash table.
+ */
+typedef struct
+{
+    Size key_size;                /* Size of the key (initial bytes of entry) */
+    Size entry_size;            /* Total size of entry */

Let's use size_t, like we kind of concluded in the thread you started:
http://archives.postgresql.org/message-id/25076.1489699457%40sss.pgh.pa.us
:)

+    dht_compare_function compare_function;    /* Compare function */
+    dht_hash_function hash_function;    /* Hash function */

Might be worth explaining that these need to provided when attaching
because they're possibly process local. Did you test this with
EXEC_BACKEND?

+    int tranche_id;                /* The tranche ID to use for locks. */
+} dht_parameters;


+struct dht_iterator
+{
+    dht_hash_table *hash_table;    /* The hash table we are iterating over. */
+    bool exclusive;                /* Whether to lock buckets exclusively. */
+    Size partition;                /* The index of the next partition to visit. */
+    Size bucket;                /* The index of the next bucket to visit. */
+    dht_hash_table_item *item;  /* The most recently returned item. */
+    dsa_pointer last_item_pointer; /* The last item visited. */
+    Size table_size_log2;     /* The table size when we started iterating. */
+    bool locked;            /* Whether the current partition is locked. */

Haven't gotten to the actual code yet, but this kinda suggest we leave a
partition locked when iterating? Hm, that seems likely to result in a
fair bit of pain...


+/* Iterating over the whole hash table. */
+extern void dht_iterate_begin(dht_hash_table *hash_table,
+                              dht_iterator *iterator, bool exclusive);
+extern void *dht_iterate_next(dht_iterator *iterator);
+extern void dht_iterate_delete(dht_iterator *iterator);

s/delete/delete_current/? Otherwise it looks like it's part of
manipulating just the iterator.

+extern void dht_iterate_release(dht_iterator *iterator);

I'd add lock to to the name.


+/*
+ * An item in the hash table.  This wraps the user's entry object in an
+ * envelop that holds a pointer back to the bucket and a pointer to the next
+ * item in the bucket.
+ */
+struct dht_hash_table_item
+{
+    /* The hashed key, to avoid having to recompute it. */
+    dht_hash    hash;
+    /* The next item in the same bucket. */
+    dsa_pointer next;
+    /* The user's entry object follows here. */
+    char        entry[FLEXIBLE_ARRAY_MEMBER];

What's the point of using FLEXIBLE_ARRAY_MEMBER here? And isn't using a
char going to lead to alignment problems?


+/* The number of partitions for locking purposes. */
+#define DHT_NUM_PARTITIONS_LOG2 7

Could use some justification.


+/*
+ * The head object for a hash table.  This will be stored in dynamic shared
+ * memory.
+ */
+typedef struct
+{

Why anonymous? Not that it hurts much, but seems weird to deviate just
here.



+/*
+ * Create a new hash table backed by the given dynamic shared area, with the
+ * given parameters.
+ */
+dht_hash_table *
+dht_create(dsa_area *area, const dht_parameters *params)
+{
+    dht_hash_table *hash_table;
+    dsa_pointer control;
+
+    /* Allocate the backend-local object representing the hash table. */
+    hash_table = palloc(sizeof(dht_hash_table));

Should be documented that this uses caller's MemoryContext.


+    /* Set up the array of lock partitions. */
+    {
+        int            i;
+
+        for (i = 0; i < DHT_NUM_PARTITIONS; ++i)
+        {
+            LWLockInitialize(PARTITION_LOCK(hash_table, i),
+                             hash_table->control->lwlock_tranche_id);
+            hash_table->control->partitions[i].count = 0;
+        }

I'd store hash_table->control->lwlock_tranche_id and partitions[i] in
local vars. Possibly hash_table->control too.


+/*
+ * Detach from a hash table.  This frees backend-local resources associated
+ * with the hash table, but the hash table will continue to exist until it is
+ * either explicitly destroyed (by a backend that is still attached to it), or
+ * the area that backs it is returned to the operating system.
+ */
+void
+dht_detach(dht_hash_table *hash_table)
+{
+    /* The hash table may have been destroyed.  Just free local memory. */
+    pfree(hash_table);
+}

Somewhat inclined to add debugging refcount - seems like bugs around
that might be annoying to find. Maybe also add an assert ensuring that
no locks are held?


+/*
+ * Look up an entry, given a key.  Returns a pointer to an entry if one can be
+ * found with the given key.  Returns NULL if the key is not found.  If a
+ * non-NULL value is returned, the entry is locked and must be released by
+ * calling dht_release.  If an error is raised before dht_release is called,
+ * the lock will be released automatically, but the caller must take care to
+ * ensure that the entry is not left corrupted.  The lock mode is either
+ * shared or exclusive depending on 'exclusive'.

This API seems a bit fragile.


+/*
+ * Returns a pointer to an exclusively locked item which must be released with
+ * dht_release.  If the key is found in the hash table, 'found' is set to true
+ * and a pointer to the existing entry is returned.  If the key is not found,
+ * 'found' is set to false, and a pointer to a newly created entry is related.

"is related"?


+ */
+void *
+dht_find_or_insert(dht_hash_table *hash_table,
+                   const void *key,
+                   bool *found)
+{
+    size_t        hash;
+    size_t        partition_index;
+    dht_partition *partition;
+    dht_hash_table_item *item;
+
+    hash = hash_table->params.hash_function(key, hash_table->params.key_size);
+    partition_index = PARTITION_FOR_HASH(hash);
+    partition = &hash_table->control->partitions[partition_index];
+
+    Assert(hash_table->control->magic == DHT_MAGIC);
+    Assert(!hash_table->exclusively_locked);

Why just exclusively locked? Why'd shared be ok?


+/*
+ * Unlock an entry which was locked by dht_find or dht_find_or_insert.
+ */
+void
+dht_release(dht_hash_table *hash_table, void *entry)
+{
+    dht_hash_table_item *item = ITEM_FROM_ENTRY(entry);
+    size_t        partition_index = PARTITION_FOR_HASH(item->hash);
+    bool        deferred_resize_work = false;
+
+    Assert(hash_table->control->magic == DHT_MAGIC);

Assert lock held (LWLockHeldByMe())

+/*
+ * Begin iterating through the whole hash table.  The caller must supply a
+ * dht_iterator object, which can then be used to call dht_iterate_next to get
+ * values until the end is reached.
+ */
+void
+dht_iterate_begin(dht_hash_table *hash_table,
+                  dht_iterator *iterator,
+                  bool exclusive)
+{
+    Assert(hash_table->control->magic == DHT_MAGIC);
+
+    iterator->hash_table = hash_table;
+    iterator->exclusive = exclusive;
+    iterator->partition = 0;
+    iterator->bucket = 0;
+    iterator->item = NULL;
+    iterator->last_item_pointer = InvalidDsaPointer;
+    iterator->locked = false;
+
+    /* Snapshot the size (arbitrary lock to prevent size changing). */
+    LWLockAcquire(PARTITION_LOCK(hash_table, 0), LW_SHARED);
+    ensure_valid_bucket_pointers(hash_table);
+    iterator->table_size_log2 = hash_table->size_log2;
+    LWLockRelease(PARTITION_LOCK(hash_table, 0));

Hm. So we're introducing some additional contention on partition 0 -
probably ok.

+/*
+ * Move to the next item in the hash table.  Returns a pointer to an entry, or
+ * NULL if the end of the hash table has been reached.  The item is locked in
+ * exclusive or shared mode depending on the argument given to
+ * dht_iterate_begin.  The caller can optionally release the lock by calling
+ * dht_iterate_release, and then call dht_iterate_next again to move to the
+ * next entry.  If the iteration is in exclusive mode, client code can also
+ * call dht_iterate_delete.  When the end of the hash table is reached, or at
+ * any time, the client may call dht_iterate_end to abandon iteration.
+ */

I'd just shorten the end to "at any time the client may call
dht_iterate_end to ..."


+/*
+ * Release the most recently obtained lock.  This can optionally be called in
+ * between calls to dht_iterator_next to allow other processes to access the
+ * same partition of the hash table.
+ */
+void
+dht_iterate_release(dht_iterator *iterator)
+{
+    Assert(iterator->locked);
+    LWLockRelease(PARTITION_LOCK(iterator->hash_table, iterator->partition));
+    iterator->locked = false;
+}
+
+/*
+ * Terminate iteration.  This must be called after iteration completes,
+ * whether or not the end was reached.  The iterator object may then be reused
+ * for another iteration.
+ */
+void
+dht_iterate_end(dht_iterator *iterator)
+{
+    Assert(iterator->hash_table->control->magic == DHT_MAGIC);
+    if (iterator->locked)
+        LWLockRelease(PARTITION_LOCK(iterator->hash_table,
+                                     iterator->partition));
+}
+
+/*
+ * Print out debugging information about the internal state of the hash table.
+ */
+void
+dht_dump(dht_hash_table *hash_table)
+{
+    size_t        i;
+    size_t        j;
+
+    Assert(hash_table->control->magic == DHT_MAGIC);
+
+    for (i = 0; i < DHT_NUM_PARTITIONS; ++i)
+        LWLockAcquire(PARTITION_LOCK(hash_table, i), LW_SHARED);

Should probably assert & document that no locks are held - otherwise
there's going to be ugly deadlocks. And that's an unlikely thing to try.


+    ensure_valid_bucket_pointers(hash_table);
+
+    fprintf(stderr,
+            "hash table size = %zu\n", (size_t) 1 << hash_table->size_log2);
+    for (i = 0; i < DHT_NUM_PARTITIONS; ++i)
+    {
+        dht_partition *partition = &hash_table->control->partitions[i];
+        size_t        begin = BUCKET_INDEX_FOR_PARTITION(i, hash_table->size_log2);
+        size_t        end = BUCKET_INDEX_FOR_PARTITION(i + 1, hash_table->size_log2);
+
+        fprintf(stderr, "  partition %zu\n", i);
+        fprintf(stderr,
+                "    active buckets (key count = %zu)\n", partition->count);
+
+        for (j = begin; j < end; ++j)
+        {
+            size_t        count = 0;
+            dsa_pointer bucket = hash_table->buckets[j];
+
+            while (DsaPointerIsValid(bucket))
+            {
+                dht_hash_table_item *item;
+
+                item = dsa_get_address(hash_table->area, bucket);
+
+                bucket = item->next;
+                ++count;
+            }
+            fprintf(stderr, "      bucket %zu (key count = %zu)\n", j, count);
+        }
+        if (RESIZE_IN_PROGRESS(hash_table))
+        {
+            size_t        begin;
+            size_t        end;
+
+            begin = BUCKET_INDEX_FOR_PARTITION(i, hash_table->size_log2 - 1);
+            end = BUCKET_INDEX_FOR_PARTITION(i + 1,
+                                             hash_table->size_log2 - 1);
+
+            fprintf(stderr, "    old buckets (key count = %zu)\n",
+                    partition->old_count);
+
+            for (j = begin; j < end; ++j)
+            {
+                size_t        count = 0;
+                dsa_pointer bucket = hash_table->old_buckets[j];
+
+                while (DsaPointerIsValid(bucket))
+                {
+                    dht_hash_table_item *item;
+
+                    item = dsa_get_address(hash_table->area, bucket);
+
+                    bucket = item->next;
+                    ++count;
+                }
+                fprintf(stderr,
+                        "      bucket %zu (key count = %zu)\n", j, count);
+            }
+        }
+    }
+
+    for (i = 0; i < DHT_NUM_PARTITIONS; ++i)
+        LWLockRelease(PARTITION_LOCK(hash_table, i));
+}

I'd put this below actual "production" code.



- Andres



pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: [HACKERS] [PATCH] Pageinspect - add functions on GIN and GiSTindexes from gevel
Next
From: Masahiko Sawada
Date:
Subject: Re: [HACKERS] pg_stop_backup(wait_for_archive := true) on standby server