Re: WIP: [[Parallel] Shared] Hash - Mailing list pgsql-hackers

From Andres Freund
Subject Re: WIP: [[Parallel] Shared] Hash
Date
Msg-id 20170326214726.sqk4zt2bbq5ofthd@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] WIP: [[Parallel] Shared] Hash  (Thomas Munro <thomas.munro@enterprisedb.com>)
List pgsql-hackers
On 2017-03-23 20:35:09 +1300, Thomas Munro wrote:
> Here is a new patch series responding to feedback from Peter and Andres:

+
+/* Per-participant shared state. */
+typedef struct SharedTuplestoreParticipant
+{
+    LWLock lock;

Hm. No padding (ala LWLockMinimallyPadded / LWLockPadded) - but that's
probably ok, for now.




+    bool error;                    /* Error occurred flag. */
+    bool eof;                    /* End of file reached. */
+    int read_fileno;            /* BufFile segment file number. */
+    off_t read_offset;            /* Offset within segment file. */

Hm. I wonder if it'd not be better to work with 64bit offsets, and just
separate that out upon segment access.

+/* The main data structure in shared memory. */

"main data structure" isn't particularly meaningful.

+struct SharedTuplestore
+{
+    int reading_partition;
+    int nparticipants;
+    int flags;

Maybe add a comment saying /* flag bits from SHARED_TUPLESTORE_* */?

+    Size meta_data_size;

What's this?

+    SharedTuplestoreParticipant participants[FLEXIBLE_ARRAY_MEMBER];

I'd add a comment here, that there's further data after participants.

+};

+
+/* Per-participant backend-private state. */
+struct SharedTuplestoreAccessor
+{

Hm. The name and it being backend-local are a bit conflicting.

+    int participant;            /* My partitipant number. */
+    SharedTuplestore *sts;        /* The shared state. */
+    int nfiles;                    /* Size of local files array. */
+    BufFile **files;            /* Files we have open locally for writing. */

Shouldn't this mention that it's indexed by partition?

+    BufFile *read_file;            /* The current file to read from. */
+    int read_partition;            /* The current partition to read from. */
+    int read_participant;        /* The current participant to read from. */
+    int read_fileno;            /* BufFile segment file number. */
+    off_t read_offset;            /* Offset within segment file. */
+};


+/*
+ * Initialize a SharedTuplestore in existing shared memory.  There must be
+ * space for sts_size(participants) bytes.  If flags is set to the value
+ * SHARED_TUPLESTORE_SINGLE_PASS then each partition may only be read once,
+ * because underlying files will be deleted.

Any reason not to use flags that are compatible with tuplestore.c?

+ * Tuples that are stored may optionally carry a piece of fixed sized
+ * meta-data which will be retrieved along with the tuple.  This is useful for
+ * the hash codes used for multi-batch hash joins, but could have other
+ * applications.
+ */
+SharedTuplestoreAccessor *
+sts_initialize(SharedTuplestore *sts, int participants,
+               int my_participant_number,
+               Size meta_data_size,
+               int flags,
+               dsm_segment *segment)
+{

Not sure I like that the naming here has little in common with
tuplestore.h's api.


+
+MinimalTuple
+sts_gettuple(SharedTuplestoreAccessor *accessor, void *meta_data)
+{

This needs docs.

+    SharedBufFileSet *fileset = GetSharedBufFileSet(accessor->sts);
+    MinimalTuple tuple = NULL;
+
+    for (;;)
+    {

...
+        /* Check if this participant's file has already been entirely read. */
+        if (participant->eof)
+        {
+            BufFileClose(accessor->read_file);
+            accessor->read_file = NULL;
+            LWLockRelease(&participant->lock);
+            continue;

Why are we closing the file while holding the lock?

+
+        /* Read the optional meta-data. */
+        eof = false;
+        if (accessor->sts->meta_data_size > 0)
+        {
+            nread = BufFileRead(accessor->read_file, meta_data,
+                                accessor->sts->meta_data_size);
+            if (nread == 0)
+                eof = true;
+            else if (nread != accessor->sts->meta_data_size)
+                ereport(ERROR,
+                        (errcode_for_file_access(),
+                         errmsg("could not read from temporary file: %m")));
+        }
+
+        /* Read the size. */
+        if (!eof)
+        {
+            nread = BufFileRead(accessor->read_file, &tuple_size, sizeof(tuple_size));
+            if (nread == 0)
+                eof = true;

Why is it legal to have EOF here, if metadata previously didn't have an
EOF? Perhaps add an error if accessor->sts->meta_data_size != 0?


+        if (eof)
+        {
+            participant->eof = true;
+            if ((accessor->sts->flags & SHARED_TUPLESTORE_SINGLE_PASS) != 0)
+                SharedBufFileDestroy(fileset, accessor->read_partition,
+                                     accessor->read_participant);
+
+            participant->error = false;
+            LWLockRelease(&participant->lock);
+
+            /* Move to next participant's file. */
+            BufFileClose(accessor->read_file);
+            accessor->read_file = NULL;
+            continue;
+        }
+
+        /* Read the tuple. */
+        tuple = (MinimalTuple) palloc(tuple_size);
+        tuple->t_len = tuple_size;

Hm. Constantly re-allocing this doesn't strike me as a good idea (not to
mention that the API doesn't mention this is newly allocated).  Seems
like it'd be a better idea to have a per-accessor buffer where this can
be stored in - increased in size when necessary.


- Andres



pgsql-hackers by date:

Previous
From: Corey Huinker
Date:
Subject: Re: \if, \elseif, \else, \endif (was Re: PSQL commands:\quit_if, \quit_unless)
Next
From: Dmitry Dolgov
Date:
Subject: Re: [PATCH] few fts functions for jsonb