Re: Get memory contexts of an arbitrary backend process - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: Get memory contexts of an arbitrary backend process
Date
Msg-id 20201023.134611.1954518198057657956.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: Get memory contexts of an arbitrary backend process  (torikoshia <torikoshia@oss.nttdata.com>)
Responses Re: Get memory contexts of an arbitrary backend process
List pgsql-hackers
Wait...

> Attachments: 0003-Enabled-pg_get_backend_memory_contexts-to-collect.patch

For a moment I thought that the number is patch number but the
predecessors are 0002-Enabled..collect.patch and 0001-(same
name). It's not mandatory but we usually do as the follows and it's
the way of git.

v1-0001-Enabled...collect.patch
v2-0001-Enabled...collect.patch

The vn is added by -v option for git-format-patch.

At Thu, 22 Oct 2020 21:32:00 +0900, torikoshia <torikoshia@oss.nttdata.com> wrote in
> > > I added a shared hash table consisted of minimal members
> > > mainly for managing whether the file is dumped or not.
> > > Some members like 'loc' seem useful in the future, but I
> > > haven't added them since it's not essential at this point.
> > Yes, that would be good.
> > +        /*
> > +         * Since we allow only one session can request to dump
> > memory context at
> > +         * the same time, check whether the dump files already exist.
> > +         */
> > +        while (stat(dumpfile, &stat_tmp) == 0 || stat(tmpfile,
> > &stat_tmp) == 0)
> > +        {
> > +            pg_usleep(1000000L);
> > +        }
> > If pg_get_backend_memory_contexts() is executed by two or more
> > sessions at the same time, it cannot be run exclusively in this way.
> > Currently it seems to cause a crash when do it so.
> > This is easy to reproduce and can be done as follows.
> > [session-1]
> > BEGIN;
> > LOCK TABKE t1;
> >   [Session-2]
> >   BEGIN;
> >   LOCK TABLE t1; <- waiting
> >     [Session-3]
> >     select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
> >       [Session-4]
> >       select * FROM pg_get_backend_memory_contexts(<pid of
> > session-2>);
> > If you issue commit or abort at session-1, you will get SEGV.
> > Instead of checking for the existence of the file, it might be better
> > to use a hash (mcxtdumpHash) entry with LWLock.
>
> Thanks!
> Added a LWLock and changed the way from checking the file existence
> to finding the hash entry.

> > +        if (proc == NULL)
> > +        {
> > +            ereport(WARNING,
> > +                    (errmsg("PID %d is not a PostgreSQL server
> > process", dst_pid)));
> > +            return (Datum) 1;
> > +        }
> > Shouldn't it clear the hash entry before return?
>
> Yeah. added codes for removing the entry.

+        entry = AddEntryToMcxtdumpHash(dst_pid);
+
+        /* Check whether the target process is PostgreSQL backend process. */
+        /* TODO: Check also whether backend or not. */
+        proc = BackendPidGetProc(dst_pid);
+
+        if (proc == NULL)
+        {
+            ereport(WARNING,
+                    (errmsg("PID %d is not a PostgreSQL server process", dst_pid)));
+
+            LWLockAcquire(McxtDumpHashLock, LW_EXCLUSIVE);
+
+            if (hash_search(mcxtdumpHash, &dst_pid, HASH_REMOVE, NULL) == NULL)
+                elog(WARNING, "hash table corrupted");
+
+            LWLockRelease(McxtDumpHashLock);
+
+            return (Datum) 1;
+        }

Why do you enter a useles entry then remove it immedately?

+        PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid));
+        {
+            SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);

"PROCSIG_DUMP_MEMORY" is somewhat misleading. Hwo about
"PROCSIG_DUMP_MEMCXT" or "PROCSIG_DUMP_MEMORY_CONTEXT"?

I thought that the hash table would prevent multiple reqestors from
making a request at once, but the patch doesn't seem to do that.

+            /* Wait until target process finished dumping file. */
+            while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)

This needs LWLock. And this could read the entry after reused by
another backend if the dumper process is gone. That isn't likely to
happen, but theoretically the other backend may set it to
MCXTDUMPSTATUS_NOTYET inbetween two successive check on the member.

+    /*
+     * Make dump file ends with 'D'.
+     * This is checked by the caller when reading the file.
+     */
+    fputc('E', fpout);

Which is right?

+    fputc('E', fpout);
+
+    CHECK_FOR_INTERRUPTS();

This means that the process accepts another request and rewrite the
file even while the first requester is reading it. And, the file can
be removed by the first requestor before the second requestor can read
it.

+    mcxtdumpHash = ShmemInitHash("mcxtdump hash",
+                               SHMEM_MEMCONTEXT_SIZE,
+                               SHMEM_MEMCONTEXT_SIZE,

The space needed for this hash doesn't seem to be secured. The hash is
sized to 64 entries, so pg_get_backend_memory_contexts() may fail with
ERROR "out of shared memory".

The hash is used only to check if the dump file is completed and if
ended with error. If we need only those, an shared byte array with the
length of max_backend is sufficient.

+        PG_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid));
+        {
+            SendProcSignal(dst_pid, PROCSIG_DUMP_MEMORY, InvalidBackendId);
+
+            /* Wait until target process finished dumping file. */
+            while (entry->dump_status == MCXTDUMPSTATUS_NOTYET)
+            {
+                CHECK_FOR_INTERRUPTS();
+                pg_usleep(10000L);
+            }
+        }
+        PG_END_ENSURE_ERROR_CLEANUP(McxtReqKill, (Datum) Int32GetDatum(dst_pid));
+
+        if (entry->dump_status == MCXTDUMPSTATUS_ERROR)
+        {
..
+            if (stat(tmpfile, &stat_tmp) == 0)
+                unlink(tmpfile);
+            if (stat(dumpfile, &stat_tmp) == 0)
+                unlink(dumpfile);
...
+            return (Datum) 0;
+        }
+
+        /* Read values from the dumped file and put them on tuplestore. */
+        PutDumpedValuesOnTuplestore(dumpfile, tupstore, tupdesc, dst_pid);

This means that if the function gets sigint before the dumper creates
the file, the dumper can leave a dump file?

> > +        /* Wait until target process finished dumping file. */
> > +        while (!entry->is_dumped)
> > +        {
> > +            CHECK_FOR_INTERRUPTS();
> > +            pg_usleep(10000L);
> > +        }
> > If the target is killed and exit before dumping the memory
> > information, you're in an infinite loop here.
> > So how about making sure that any process that has to stop before
> > doing a memory dump changes the status of the hash (entry->is_dumped)
> > before stopping and the caller detects it?
> > I'm not sure it's best or not, but you might want to use something
> > like the on_shmem_exit callback.
>
> Thanks for your idea!
> Added a callback to change the status of the hash table entry.
>
> Although I think it's necessary to remove this callback when it
> finished
> processing memory dumping, on_shmem_exit() does not seem to have such
> a function.
> I used before_shmem_exit() since it has a corresponding function to
> remove registered callback.
> If it's inappropriate, I'm going to add a function removing the
> registered callback of on_shmem_exit().

This seems to leave a file for the pid.

> > In the current design, if the caller stops processing before reading
> > the dumped file, you will have an orphaned file.
> > It looks like the following.
> > [session-1]
> > BEGIN;
> > LOCK TABKE t1;
> >   [Session-2]
> >   BEGIN;
> >   LOCK TABLE t1; <- waiting
> >     [Session-3]
> >     select * FROM pg_get_backend_memory_contexts(<pid of session-2>);
> > If you cancel or terminate the session-3, then issue commit or abort
> > at session-1, you will get orphaned files in pg_memusage.
> > So if you allow only one session can request to dump file, it could
> > call pg_memusage_reset() before send the signal in this function.
>
> Although I'm going to allow only one session per one target process,
> I'd like to allow running multiple pg_get_backend_memory_contexts()
> which target process is different.
>
> Instead of calling pg_memusage_reset(), I added a callback for
> cleaning up orphaned files and the elements of the hash table
> using before_shmem_exit() through PG_ENSURE_ERROR_CLEANUP() and
> PG_END_ENSURE_ERROR_CLEANUP().
>
> I chose PG_ENSURE_ERROR_CLEANUP() and PG_END_ENSURE_ERROR_CLEANUP()
> here since it can handle not only termination but also cancellation.
>
> Any thoughts?

+/*
+ * pg_memusage_reset
+ *         Remove the memory context dump files.
+ */
+void
+pg_memusage_reset(int pid)

The function name seem to represents somewhat different from what it
does.


I think we might need to step-back to basic design of this feature
since this patch seems to have unhandled corner cases that are
difficult to find.

- Dump file lifecycle or state-transition of the dumper

  Currently the lifecycle of a dump file, or the state-transition of
  the dumper process doesn't seem to be well defined.

  The file is create only by the dumper.
  If the requestor reads the completed file, the reader removes it.

  If the dumper receives a cancel request, the dumper removes it if
  any.

  Of course dumper removes it if it fails to complete the file.

- Better way of signalling between the requestor and the dumper

  I think there's no doubt about request signal.

  About the complete signal, currently the requestor polls on a flag
  in a hash entry.  I'm wondering if we can get rid of polling. The
  problem on doing that is the lack of a means for a dumper to know
  the requestor.  We need to store requestor pid or backend id in the
  shared hash entry.

  By the way, about shared hash entry, it uses about 70kB for only 64
  entries so it seems inefficient than a shared array that has
  MaxBackends entries.  If we used a following array on shared memory,

  struct hoge
  {
    BackendId requestor[MAX_BACKENDS];
    int          status[MAX_BACKENDS];
    LWLock lock;
  };

  This array has the size of 24 * MaxBackends + 16. 24kB for 1000
  backends.  It could be on dsm since this feature is not used
  commonly.


- The way to cancel a request already made. (or management of the dump
  state transition.)

  Cancellation seems to contain some race conditions. But basically
  that could be done by sending a request signal after setting the
  hoge.requestor above to some special value, not needing the third
  type of signal.  The special value should be different from the
  initial state, which signals that the process is accepting a new
  request.

  As the whole, that would looks like the folloing?

------------------------------------------------------------
Successful request.

  Requestor         dumper       state
                    [idle]       initial
  [request] -------------------> requestor pid/backendid
           -signal->
                    [dumping]
           <-signal-[done]
  [read]
  [done]   --------------------> initial

------------------------------------------------------------
On failure, the dumper signals with setting state to initial.
  [request] -------------------> requestor pid/backendid
           -signal->
                    [dumping]
                    [failed]     initial
           <-signal-
     (some other requestor might come meantime.)
  <sees that the requestor is not me>
  [failed]

------------------------------------------------------------
If the requestor wants to cancel the request, it sets the state to
'cancel' then signal.

Requestor         dumper       state
                    [idle]       initial
  [request] -------------------> cancel
                    <if canceled. clean up>
                    [dumping]
                    <if canceled. clean up>
           <-signal-[done]

           -signal-><try to clean up>


Other aspects to cnosider?

regards.

--
Kyotaro Horiguchi
NTT Open Source Software Center







pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: "unix_socket_directories" should be GUC_LIST_INPUT?
Next
From: Tom Lane
Date:
Subject: Re: "unix_socket_directories" should be GUC_LIST_INPUT?