Thread: Add sub-transaction overflow status in pg_stat_activity

Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
If the subtransaction cache is overflowed in some of the transactions
then it will affect all the concurrent queries as they need to access
the SLRU for checking the visibility of each tuple.  But currently
there is no way to identify whether in any backend subtransaction is
overflowed or what is the current active subtransaction count.
Attached patch adds subtransaction count and subtransaction overflow
status in pg_stat_activity.  I have implemented this because of the
recent complain about the same[1]

[1] https://www.postgresql.org/message-id/CAFiTN-t5BkwdHm1bV8ez64guWZJB_Jjhb7arsQsftxEwpYwObg%40mail.gmail.com


-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Add sub-transaction overflow status in pg_stat_activity

From
Zhihong Yu
Date:


On Mon, Dec 6, 2021 at 8:17 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
If the subtransaction cache is overflowed in some of the transactions
then it will affect all the concurrent queries as they need to access
the SLRU for checking the visibility of each tuple.  But currently
there is no way to identify whether in any backend subtransaction is
overflowed or what is the current active subtransaction count.
Attached patch adds subtransaction count and subtransaction overflow
status in pg_stat_activity.  I have implemented this because of the
recent complain about the same[1]

[1] https://www.postgresql.org/message-id/CAFiTN-t5BkwdHm1bV8ez64guWZJB_Jjhb7arsQsftxEwpYwObg%40mail.gmail.com


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Hi,

bq. there is a no way to 

Extra 'a' before no.

+    * Number of active subtransaction in the current session.

subtransaction -> subtransactions

+    * Whether subxid count overflowed in the current session.

It seems 'count' can be dropped from the sentence.

Cheers

Re: Add sub-transaction overflow status in pg_stat_activity

From
Nikolay Samokhvalov
Date:
On Mon, Dec 6, 2021 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
If the subtransaction cache is overflowed in some of the transactions
then it will affect all the concurrent queries as they need to access
the SLRU for checking the visibility of each tuple.  But currently
there is no way to identify whether in any backend subtransaction is
overflowed or what is the current active subtransaction count.

I think it's a good idea – had the same need when recently researching various issues with subtransactions [1], needed to patch Postgres in benchmarking environments. To be fair, there is a way to understand that the overflowed state is reached for PG 13+ – on standbys, observe reads in Subtrans in pg_stat_slru. But of course, it's an indirect way.

I see that the patch adds two new columns to pg_stat_activity: subxact_count and subxact_overflowed. This should be helpful to have. Additionally, exposing the lastOverflowedXid value would be also good for troubleshooting of subtransaction edge and corner cases – a bug recently fixed in all current versions [2] was really tricky to troubleshoot in production because this value is not visible to DBAs.

Re: Add sub-transaction overflow status in pg_stat_activity

From
Justin Pryzby
Date:
> +      <entry role="catalog_table_entry"><para role="column_definition">
> +       <structfield>subxact_count</structfield> <type>xid</type>
> +      </para>
> +      <para>
> +       The current backend's active subtransactions count.

subtransaction (no s)

> +       Set to true if current backend's subtransaction cache is overflowed.

Say "has overflowed"

> +        if (local_beentry->subxact_count > 0)
> +        {
> +            values[30] = local_beentry->subxact_count;
> +            values[31] = local_beentry->subxact_overflowed;
> +        }
> +        else
> +        {
> +            nulls[30] = true;
> +            nulls[31] = true;
> +        }

Why is the subxact count set to NULL instead of zero ?

You added this to pg_stat_activity, which already has a lot of fields.
We talked a few months ago about not adding more fields that weren't commonly
used.

https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e

Since I think this field is usually not interesting to most users of
pg_stat_activity, maybe this should instead be implemented as a function like
pg_backend_get_subxact_status(pid).

People who want to could use it like:
SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

-- 
Justin



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

Thanks for the review I will work on these comments.

> > +      <entry role="catalog_table_entry"><para role="column_definition">
> > +       <structfield>subxact_count</structfield> <type>xid</type>
> > +      </para>
> > +      <para>
> > +       The current backend's active subtransactions count.
>
> subtransaction (no s)
>
> > +       Set to true if current backend's subtransaction cache is overflowed.
>
> Say "has overflowed"
>
> > +             if (local_beentry->subxact_count > 0)
> > +             {
> > +                     values[30] = local_beentry->subxact_count;
> > +                     values[31] = local_beentry->subxact_overflowed;
> > +             }
> > +             else
> > +             {
> > +                     nulls[30] = true;
> > +                     nulls[31] = true;
> > +             }
>
> Why is the subxact count set to NULL instead of zero ?

> You added this to pg_stat_activity, which already has a lot of fields.
> We talked a few months ago about not adding more fields that weren't commonly
> used.
>
https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e
>
> Since I think this field is usually not interesting to most users of
> pg_stat_activity, maybe this should instead be implemented as a function like
> pg_backend_get_subxact_status(pid).
>
> People who want to could use it like:
> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

Yeah, this is a valid point, I will change this accordingly.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 7, 2021 at 10:29 AM Nikolay Samokhvalov
<samokhvalov@gmail.com> wrote:
>
> On Mon, Dec 6, 2021 at 8:16 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>>
>> If the subtransaction cache is overflowed in some of the transactions
>> then it will affect all the concurrent queries as they need to access
>> the SLRU for checking the visibility of each tuple.  But currently
>> there is no way to identify whether in any backend subtransaction is
>> overflowed or what is the current active subtransaction count.
>
>
> I think it's a good idea – had the same need when recently researching various issues with subtransactions [1],
neededto patch Postgres in benchmarking environments. To be fair, there is a way to understand that the overflowed
stateis reached for PG 13+ – on standbys, observe reads in Subtrans in pg_stat_slru. But of course, it's an indirect
way.

Yeah right.

> I see that the patch adds two new columns to pg_stat_activity: subxact_count and subxact_overflowed. This should be
helpfulto have. Additionally, exposing the lastOverflowedXid value would be also good for troubleshooting of
subtransactionedge and corner cases – a bug recently fixed in all current versions [2] was really tricky to
troubleshootin production because this value is not visible to DBAs. 

Yeah, we can show this too, although we need to take ProcArrayLock in
the shared mode for reading this, but anyway that will be done on
users request so should not be an issue IMHO.

I will post the updated patch soon along with comments given by
Zhihong Yu and Justin.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
"Bossart, Nathan"
Date:
On 12/6/21, 8:19 PM, "Dilip Kumar" <dilipbalaut@gmail.com> wrote:
> If the subtransaction cache is overflowed in some of the transactions
> then it will affect all the concurrent queries as they need to access
> the SLRU for checking the visibility of each tuple.  But currently
> there is no way to identify whether in any backend subtransaction is
> overflowed or what is the current active subtransaction count.
> Attached patch adds subtransaction count and subtransaction overflow
> status in pg_stat_activity.  I have implemented this because of the
> recent complain about the same[1]

I'd like to give a general +1 to this effort.  Thanks for doing this!
I've actually created a function to provide this information in the
past, so I will help review.

Nathan


Re: Add sub-transaction overflow status in pg_stat_activity

From
"Imseih (AWS), Sami"
Date:
I also want to +1 this this effort. Exposing subtransaction usage is very useful.

It would also be extremely beneficial to add both subtransaction usage and overflow counters to pg_stat_database. 

Monitoring tools that capture deltas on pg_stat_database will be able to generate historical analysis and usage trends
ofsubtransactions.
 

On 12/7/21, 5:34 PM, "Bossart, Nathan" <bossartn@amazon.com> wrote:

    On 12/6/21, 8:19 PM, "Dilip Kumar" <dilipbalaut@gmail.com> wrote:
    > If the subtransaction cache is overflowed in some of the transactions
    > then it will affect all the concurrent queries as they need to access
    > the SLRU for checking the visibility of each tuple.  But currently
    > there is no way to identify whether in any backend subtransaction is
    > overflowed or what is the current active subtransaction count.
    > Attached patch adds subtransaction count and subtransaction overflow
    > status in pg_stat_activity.  I have implemented this because of the
    > recent complain about the same[1]

    I'd like to give a general +1 to this effort.  Thanks for doing this!
    I've actually created a function to provide this information in the
    past, so I will help review.

    Nathan



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

>
> You added this to pg_stat_activity, which already has a lot of fields.
> We talked a few months ago about not adding more fields that weren't commonly
> used.
>
https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e
>
> Since I think this field is usually not interesting to most users of
> pg_stat_activity, maybe this should instead be implemented as a function like
> pg_backend_get_subxact_status(pid).
>
> People who want to could use it like:
> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

I have provided two function, one for subtransaction counts and other
whether subtransaction cache is overflowed or not, we can use like
this,  if we think this is better way to do it then we can also add
another function for the lastOverflowedXid

postgres[43994]=# select id, pg_stat_get_backend_pid(id) as pid,
pg_stat_get_backend_subxact_count(id) as nsubxact,
pg_stat_get_backend_subxact_overflow(id) as overflowed from
pg_stat_get_backend_idset() as id;
 id |  pid  | nsubxact | overflowed
----+-------+----------+------------
  1 | 43806 |        0 | f
  2 | 43983 |       64 | t
  3 | 43994 |        0 | f
  4 | 44323 |       22 | f
  5 | 43802 |        0 | f
  6 | 43801 |        0 | f
  7 | 43804 |        0 | f
(7 rows)

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Attachment

Re: Add sub-transaction overflow status in pg_stat_activity

From
"Bossart, Nathan"
Date:
On 12/13/21, 6:30 AM, "Dilip Kumar" <dilipbalaut@gmail.com> wrote:
> On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
>> Since I think this field is usually not interesting to most users of
>> pg_stat_activity, maybe this should instead be implemented as a function like
>> pg_backend_get_subxact_status(pid).
>>
>> People who want to could use it like:
>> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;
>
> I have provided two function, one for subtransaction counts and other
> whether subtransaction cache is overflowed or not, we can use like
> this,  if we think this is better way to do it then we can also add
> another function for the lastOverflowedXid

The general approach looks good to me.  I think we could have just one
function for all three values, though.

Nathan


Re: Add sub-transaction overflow status in pg_stat_activity

From
Ashutosh Sharma
Date:
Hi,

I have looked into the v2 patch and here are my comments:

+   PG_RETURN_INT32(local_beentry->subxact_overflowed);
+}

Should this be PG_RETURN_BOOL instead of PG_RETURN_INT32??

--

+{ oid => '6107', descr => 'statistics: cached subtransaction count of backend',
+  proname => 'pg_stat_get_backend_subxact_count', provolatile => 's', proparallel => 'r',
+  prorettype => 'int4', proargtypes => 'int4',
+  prosrc => 'pg_stat_get_backend_subxact_count' },
+{ oid => '6108', descr => 'statistics: subtransaction cache of backend overflowed',
+  proname => 'pg_stat_get_backend_subxact_overflow', provolatile => 's', proparallel => 'r',
+  prorettype => 'bool', proargtypes => 'int4',
+  prosrc => 'pg_stat_get_backend_subxact_overflow' },

The description says that the two new functions show the statistics for "cached subtransaction count of backend" and "subtransaction cache of backend overflowed". But, when these functions are called it shows these stats for the non-backend processes like checkpointer, walwriter etc as well. Should that happen?

--

typedef struct LocalPgBackendStatus
     * not.
     */
    TransactionId backend_xmin;
+
+   /*
+    * Number of cached subtransactions in the current session.
+    */
+   int subxact_count;
+
+   /*
+    * The number of subtransactions in the current session exceeded the cached
+    * subtransaction limit.
+    */
+   bool subxact_overflowed;

All the variables inside this LocalPgBackendStatus structure are prefixed with "backend" word. Can we do the same for the newly added variables as well?

--

+ *     Get the xid and xmin, nsubxid and overflow status of the backend. The

Should this be put as - "xid, xmin, nsubxid and overflow" instead of "xid and xmin, nsubxid and overflow"?

--
With Regards,
Ashutosh Sharma.


On Mon, Dec 13, 2021 at 7:58 PM Dilip Kumar <dilipbalaut@gmail.com> wrote:
On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:

>
> You added this to pg_stat_activity, which already has a lot of fields.
> We talked a few months ago about not adding more fields that weren't commonly
> used.
> https://www.postgresql.org/message-id/flat/20210426191811.sp3o77doinphyjhu%40alap3.anarazel.de#d96d0a116f0344301eead2676ea65b2e
>
> Since I think this field is usually not interesting to most users of
> pg_stat_activity, maybe this should instead be implemented as a function like
> pg_backend_get_subxact_status(pid).
>
> People who want to could use it like:
> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;

I have provided two function, one for subtransaction counts and other
whether subtransaction cache is overflowed or not, we can use like
this,  if we think this is better way to do it then we can also add
another function for the lastOverflowedXid

postgres[43994]=# select id, pg_stat_get_backend_pid(id) as pid,
pg_stat_get_backend_subxact_count(id) as nsubxact,
pg_stat_get_backend_subxact_overflow(id) as overflowed from
pg_stat_get_backend_idset() as id;
 id |  pid  | nsubxact | overflowed
----+-------+----------+------------
  1 | 43806 |        0 | f
  2 | 43983 |       64 | t
  3 | 43994 |        0 | f
  4 | 44323 |       22 | f
  5 | 43802 |        0 | f
  6 | 43801 |        0 | f
  7 | 43804 |        0 | f
(7 rows)

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 14, 2021 at 3:57 AM Bossart, Nathan <bossartn@amazon.com> wrote:
>
> On 12/13/21, 6:30 AM, "Dilip Kumar" <dilipbalaut@gmail.com> wrote:
> > On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> >> Since I think this field is usually not interesting to most users of
> >> pg_stat_activity, maybe this should instead be implemented as a function like
> >> pg_backend_get_subxact_status(pid).
> >>
> >> People who want to could use it like:
> >> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;
> >
> > I have provided two function, one for subtransaction counts and other
> > whether subtransaction cache is overflowed or not, we can use like
> > this,  if we think this is better way to do it then we can also add
> > another function for the lastOverflowedXid
>
> The general approach looks good to me.  I think we could have just one
> function for all three values, though.

If we create just one function then the output type will be a tuple
then we might have to add another view on top of that.  Is there any
better way to do that?

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 14, 2021 at 6:23 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:
>
> Hi,
>
> I have looked into the v2 patch and here are my comments:
>
> +   PG_RETURN_INT32(local_beentry->subxact_overflowed);
> +}
>
> Should this be PG_RETURN_BOOL instead of PG_RETURN_INT32??
>
> --
>
> +{ oid => '6107', descr => 'statistics: cached subtransaction count of backend',
> +  proname => 'pg_stat_get_backend_subxact_count', provolatile => 's', proparallel => 'r',
> +  prorettype => 'int4', proargtypes => 'int4',
> +  prosrc => 'pg_stat_get_backend_subxact_count' },
> +{ oid => '6108', descr => 'statistics: subtransaction cache of backend overflowed',
> +  proname => 'pg_stat_get_backend_subxact_overflow', provolatile => 's', proparallel => 'r',
> +  prorettype => 'bool', proargtypes => 'int4',
> +  prosrc => 'pg_stat_get_backend_subxact_overflow' },
>
> The description says that the two new functions show the statistics for "cached subtransaction count of backend" and
"subtransactioncache of backend overflowed". But, when these functions are called it shows these stats for the
non-backendprocesses like checkpointer, walwriter etc as well. Should that happen? 
>
> --
>
> typedef struct LocalPgBackendStatus
>      * not.
>      */
>     TransactionId backend_xmin;
> +
> +   /*
> +    * Number of cached subtransactions in the current session.
> +    */
> +   int subxact_count;
> +
> +   /*
> +    * The number of subtransactions in the current session exceeded the cached
> +    * subtransaction limit.
> +    */
> +   bool subxact_overflowed;
>
> All the variables inside this LocalPgBackendStatus structure are prefixed with "backend" word. Can we do the same for
thenewly added variables as well? 
>
> --
>
> + *     Get the xid and xmin, nsubxid and overflow status of the backend. The
>
> Should this be put as - "xid, xmin, nsubxid and overflow" instead of "xid and xmin, nsubxid and overflow"?

Thanks, Ashutosh, I will work on your comments and post an updated
version by next week.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Fri, Dec 17, 2021 at 9:32 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
On Fri, Dec 17, 2021 at 09:00:04AM +0530, Dilip Kumar wrote:
> On Tue, Dec 14, 2021 at 3:57 AM Bossart, Nathan <bossartn@amazon.com> wrote:
> >
> > On 12/13/21, 6:30 AM, "Dilip Kumar" <dilipbalaut@gmail.com> wrote:
> > > On Tue, Dec 7, 2021 at 11:11 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > >> Since I think this field is usually not interesting to most users of
> > >> pg_stat_activity, maybe this should instead be implemented as a function like
> > >> pg_backend_get_subxact_status(pid).
> > >>
> > >> People who want to could use it like:
> > >> SELECT * FROM pg_stat_activity psa, pg_backend_get_subxact_status(pid) sub;
> > >
> > > I have provided two function, one for subtransaction counts and other
> > > whether subtransaction cache is overflowed or not, we can use like
> > > this,  if we think this is better way to do it then we can also add
> > > another function for the lastOverflowedXid
> >
> > The general approach looks good to me.  I think we could have just one
> > function for all three values, though.
>
> If we create just one function then the output type will be a tuple
> then we might have to add another view on top of that.  Is there any
> better way to do that?

I don't think you'd need to add a view on top of it.

Compare:

postgres=# SELECT 1, pg_config() LIMIT 1;
 ?column? |         pg_config         
----------+----------------------------
        1 | (BINDIR,/usr/pgsql-14/bin)

postgres=# SELECT 1, c FROM pg_config() c LIMIT 1;
 ?column? |             c             
----------+----------------------------
        1 | (BINDIR,/usr/pgsql-14/bin)

postgres=# SELECT 1, c.* FROM pg_config() c LIMIT 1;
 ?column? |  name  |      setting     
----------+--------+-------------------
        1 | BINDIR | /usr/pgsql-14/bin

Okay, that makes sense, I have modified it to make a single function.



--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 14, 2021 at 6:23 PM Ashutosh Sharma <ashu.coek88@gmail.com> wrote:

In the latest patch I have fixed comments given here except a few.

I have looked into the v2 patch and here are my comments:

+   PG_RETURN_INT32(local_beentry->subxact_overflowed);
+}

Should this be PG_RETURN_BOOL instead of PG_RETURN_INT32??

With the new patch this is not relevant because we are returning a tuple.

+{ oid => '6107', descr => 'statistics: cached subtransaction count of backend',
+  proname => 'pg_stat_get_backend_subxact_count', provolatile => 's', proparallel => 'r',
+  prorettype => 'int4', proargtypes => 'int4',
+  prosrc => 'pg_stat_get_backend_subxact_count' },
+{ oid => '6108', descr => 'statistics: subtransaction cache of backend overflowed',
+  proname => 'pg_stat_get_backend_subxact_overflow', provolatile => 's', proparallel => 'r',
+  prorettype => 'bool', proargtypes => 'int4',
+  prosrc => 'pg_stat_get_backend_subxact_overflow' },

The description says that the two new functions show the statistics for "cached subtransaction count of backend" and "subtransaction cache of backend overflowed". But, when these functions are called it shows these stats for the non-backend processes like checkpointer, walwriter etc as well. Should that happen?

I am following similar description as pg_stat_get_backend_pid, pg_stat_get_backend_idset and other relevant functioins.

typedef struct LocalPgBackendStatus
     * not.
     */
    TransactionId backend_xmin;
+
+   /*
+    * Number of cached subtransactions in the current session.
+    */
+   int subxact_count;
+
+   /*
+    * The number of subtransactions in the current session exceeded the cached
+    * subtransaction limit.
+    */
+   bool subxact_overflowed;

All the variables inside this LocalPgBackendStatus structure are prefixed with "backend" word. Can we do the same for the newly added variables as well?

Done

+ *     Get the xid and xmin, nsubxid and overflow status of the backend. The

Should this be put as - "xid, xmin, nsubxid and overflow" instead of "xid and xmin, nsubxid and overflow"?

I missed to fix this one in the last patch so updated again.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com
Attachment

Re: Add sub-transaction overflow status in pg_stat_activity

From
"Bossart, Nathan"
Date:
Thanks for the new patch!

+       <para>
+        Returns a record of information about the backend's subtransactions.
+        The fields returned are <parameter>subxact_count</parameter> identifies
+        number of active subtransaction and <parameter>subxact_overflow
+        </parameter> shows whether the backend's subtransaction cache is
+        overflowed or not.
+       </para></entry>
+       </para></entry>

nitpick: There is an extra "</para></entry>" here.

Would it be more accurate to say that subxact_count is the number of
subxids that are cached?  It can only ever go up to
PGPROC_MAX_CACHED_SUBXIDS.

Nathan


Re: Add sub-transaction overflow status in pg_stat_activity

From
Julien Rouhaud
Date:
On Thu, Jan 13, 2022 at 10:27:31PM +0000, Bossart, Nathan wrote:
> Thanks for the new patch!
> 
> +       <para>
> +        Returns a record of information about the backend's subtransactions.
> +        The fields returned are <parameter>subxact_count</parameter> identifies
> +        number of active subtransaction and <parameter>subxact_overflow
> +        </parameter> shows whether the backend's subtransaction cache is
> +        overflowed or not.
> +       </para></entry>
> +       </para></entry>
> 
> nitpick: There is an extra "</para></entry>" here.

Also the sentence looks a bit weird.  I think something like that would be
better:

> +        Returns a record of information about the backend's subtransactions.
> +        The fields returned are <parameter>subxact_count</parameter>, which
> +        identifies the number of active subtransaction and <parameter>subxact_overflow
> +        </parameter>, which shows whether the backend's subtransaction cache is
> +        overflowed or not.

While on the sub-transaction overflow topic, I'm wondering if we should also
raise a warning (maybe optionally) immediately when a backend overflows (so in
GetNewTransactionId()).

Like many I previously had to investigate a slowdown due to sub-transaction
overflow, and even with the information available in a monitoring view (I had
to rely on a quick hackish extension as I couldn't patch postgres) it's not
terribly fun to do this way.  On top of that log analyzers like pgBadger could
help to highlight such a problem.

I don't want to derail this thread so let me know if I should start a distinct
discussion for that.



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Fri, Jan 14, 2022 at 1:17 PM Julien Rouhaud <rjuju123@gmail.com> wrote:
On Thu, Jan 13, 2022 at 10:27:31PM +0000, Bossart, Nathan wrote:
> Thanks for the new patch!
>
> +       <para>
> +        Returns a record of information about the backend's subtransactions.
> +        The fields returned are <parameter>subxact_count</parameter> identifies
> +        number of active subtransaction and <parameter>subxact_overflow
> +        </parameter> shows whether the backend's subtransaction cache is
> +        overflowed or not.
> +       </para></entry>
> +       </para></entry>
>
> nitpick: There is an extra "</para></entry>" here.

Also the sentence looks a bit weird.  I think something like that would be
better:

> +        Returns a record of information about the backend's subtransactions.
> +        The fields returned are <parameter>subxact_count</parameter>, which
> +        identifies the number of active subtransaction and <parameter>subxact_overflow
> +        </parameter>, which shows whether the backend's subtransaction cache is
> +        overflowed or not.
 
Thanks for looking into this, I will work on this next week.
 
While on the sub-transaction overflow topic, I'm wondering if we should also
raise a warning (maybe optionally) immediately when a backend overflows (so in
GetNewTransactionId()).

Like many I previously had to investigate a slowdown due to sub-transaction
overflow, and even with the information available in a monitoring view (I had
to rely on a quick hackish extension as I couldn't patch postgres) it's not
terribly fun to do this way.  On top of that log analyzers like pgBadger could
help to highlight such a problem.

I don't want to derail this thread so let me know if I should start a distinct
discussion for that.

Yeah that seems like a good idea.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add sub-transaction overflow status in pg_stat_activity

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> Like many I previously had to investigate a slowdown due to sub-transaction
> overflow, and even with the information available in a monitoring view (I had
> to rely on a quick hackish extension as I couldn't patch postgres) it's not
> terribly fun to do this way.  On top of that log analyzers like pgBadger could
> help to highlight such a problem.

It feels to me like far too much effort is being invested in fundamentally
the wrong direction here.  If the subxact overflow business is causing
real-world performance problems, let's find a way to fix that, not put
effort into monitoring tools that do little to actually alleviate anyone's
pain.

            regards, tom lane



Re: Add sub-transaction overflow status in pg_stat_activity

From
"Bossart, Nathan"
Date:
On 1/14/22, 8:26 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
>> Like many I previously had to investigate a slowdown due to sub-transaction
>> overflow, and even with the information available in a monitoring view (I had
>> to rely on a quick hackish extension as I couldn't patch postgres) it's not
>> terribly fun to do this way.  On top of that log analyzers like pgBadger could
>> help to highlight such a problem.
>
> It feels to me like far too much effort is being invested in fundamentally
> the wrong direction here.  If the subxact overflow business is causing
> real-world performance problems, let's find a way to fix that, not put
> effort into monitoring tools that do little to actually alleviate anyone's
> pain.

+1

An easy first step might be to increase PGPROC_MAX_CACHED_SUBXIDS and
NUM_SUBTRANS_BUFFERS.  This wouldn't be a long-term solution to all
such performance problems, and we'd still probably want the proposed
monitoring tools, but maybe it'd kick the can down the road a bit.
Perhaps another improvement could be to store the topmost transaction
along with the parent transaction in the subtransaction log to avoid
the loop in SubTransGetTopmostTransaction().

Nathan


Re: Add sub-transaction overflow status in pg_stat_activity

From
Julien Rouhaud
Date:
On Fri, Jan 14, 2022 at 07:42:29PM +0000, Bossart, Nathan wrote:
> On 1/14/22, 8:26 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
> >
> > It feels to me like far too much effort is being invested in fundamentally
> > the wrong direction here.  If the subxact overflow business is causing
> > real-world performance problems, let's find a way to fix that, not put
> > effort into monitoring tools that do little to actually alleviate anyone's
> > pain.
> 
> +1

Agreed, it would be better but if that leads to significant work that doesn't
land in pg15, it would be nice to at least get more monitoring possibilities
in pg15 to help locate problems in application.

> An easy first step might be to increase PGPROC_MAX_CACHED_SUBXIDS and
> NUM_SUBTRANS_BUFFERS.

There's already something proposed for slru sizing:
https://commitfest.postgresql.org/36/2627/.  Unfortunately it hasn't been
committed yet despite some popularity.  I also don't know how much it improves
workloads that hit the overflow issue.

> This wouldn't be a long-term solution to all
> such performance problems, and we'd still probably want the proposed
> monitoring tools, but maybe it'd kick the can down the road a bit.

Yeah simply increasing PGPROC_MAX_CACHED_SUBXIDS won't really solve the
problem.  Also the xid cache is already ~30% of the PGPROC size, increasing it
any further is likely to end up being a loss for everyone that doesn't heavily
rely on needing more than 64 subtransactions.

There's also something proposed at
https://www.postgresql.org/message-id/003201d79d7b$189141f0$49b3c5d0$@tju.edu.cn,
which seems to reach some nice improvement without a major redesign of the
subtransaction system, but I now realize that apparently no one added it to the
commitfest :(



Re: Add sub-transaction overflow status in pg_stat_activity

From
Tom Lane
Date:
Julien Rouhaud <rjuju123@gmail.com> writes:
> On Fri, Jan 14, 2022 at 07:42:29PM +0000, Bossart, Nathan wrote:
>> On 1/14/22, 8:26 AM, "Tom Lane" <tgl@sss.pgh.pa.us> wrote:
>>> It feels to me like far too much effort is being invested in fundamentally
>>> the wrong direction here.

> Agreed, it would be better but if that leads to significant work that doesn't
> land in pg15, it would be nice to at least get more monitoring possibilities
> in pg15 to help locate problems in application.

The discussion just upthread was envisioning not only that we'd add
infrastructure for this, but then that other projects would build
more infrastructure on top of that.  That's an awful lot of work
that will become useless --- indeed maybe counterproductive --- once
we find an actual fix.  I say "counterproductive" because I wonder
what compatibility problems we'd have if the eventual fix results in
fundamental changes in the way things work in this area.

Since it's worked the same way for a lot of years, I'm not impressed
by arguments that we need to push something into v15.

>> An easy first step might be to increase PGPROC_MAX_CACHED_SUBXIDS and
>> NUM_SUBTRANS_BUFFERS.

I don't think that's an avenue to a fix.  We need some more-fundamental
rethinking about how this should work.  (No, I don't have any ideas
at the moment.)

> There's also something proposed at
> https://www.postgresql.org/message-id/003201d79d7b$189141f0$49b3c5d0$@tju.edu.cn,
> which seems to reach some nice improvement without a major redesign of the
> subtransaction system, but I now realize that apparently no one added it to the
> commitfest :(

Hmm ... that could win if we're looking up the same subtransaction's
parent over and over, but I wonder if it wouldn't degrade a lot of
workloads too.

            regards, tom lane



Re: Add sub-transaction overflow status in pg_stat_activity

From
Julien Rouhaud
Date:
On Sat, Jan 15, 2022 at 12:13:39AM -0500, Tom Lane wrote:
> 
> The discussion just upthread was envisioning not only that we'd add
> infrastructure for this, but then that other projects would build
> more infrastructure on top of that.  That's an awful lot of work
> that will become useless --- indeed maybe counterproductive --- once
> we find an actual fix.  I say "counterproductive" because I wonder
> what compatibility problems we'd have if the eventual fix results in
> fundamental changes in the way things work in this area.

I'm not sure what you're referring to.  If that's the hackish extension I
mentioned, its goal was to provide exactly what this thread is about so I
wasn't advocating for additional tooling.  If that's about pgBagder, no extra
work would be needed: there's already a report about any WARNING/ERROR and such
found in the logs so the information would be immediately visible.

> Since it's worked the same way for a lot of years, I'm not impressed
> by arguments that we need to push something into v15.

Well, people have also been struggling with it for a lot of years, even if they
don't always come here to complain about it.  And apparently at least 2 people
already had to code something similar to be able to find the problematic
transactions, so I still think that at least some monitoring improvement would
be welcome in v15 if none of the other approach get committed.



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Fri, Jan 14, 2022 at 9:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
Julien Rouhaud <rjuju123@gmail.com> writes:
> Like many I previously had to investigate a slowdown due to sub-transaction
> overflow, and even with the information available in a monitoring view (I had
> to rely on a quick hackish extension as I couldn't patch postgres) it's not
> terribly fun to do this way.  On top of that log analyzers like pgBadger could
> help to highlight such a problem.

It feels to me like far too much effort is being invested in fundamentally
the wrong direction here.  If the subxact overflow business is causing
real-world performance problems, let's find a way to fix that, not put
effort into monitoring tools that do little to actually alleviate anyone's
pain.

I don't think it is really a big effort or big change.  But I completely agree with you that if we can completely resolve this issue then there is no point in providing any such status or LOG.

--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com

Re: Add sub-transaction overflow status in pg_stat_activity

From
Andres Freund
Date:
On 2022-01-14 11:25:45 -0500, Tom Lane wrote:
> Julien Rouhaud <rjuju123@gmail.com> writes:
> > Like many I previously had to investigate a slowdown due to sub-transaction
> > overflow, and even with the information available in a monitoring view (I had
> > to rely on a quick hackish extension as I couldn't patch postgres) it's not
> > terribly fun to do this way.  On top of that log analyzers like pgBadger could
> > help to highlight such a problem.
> 
> It feels to me like far too much effort is being invested in fundamentally
> the wrong direction here.  If the subxact overflow business is causing
> real-world performance problems, let's find a way to fix that, not put
> effort into monitoring tools that do little to actually alleviate anyone's
> pain.

There seems to be some agreement on this (I certainly do agree). Thus it seems
we should mark the CF entry as rejected?

It's been failing on cfbot for weeks... https://cirrus-ci.com/task/5289336424890368?logs=docs_build#L347



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Mar 22, 2022 at 5:15 AM Andres Freund <andres@anarazel.de> wrote:

> There seems to be some agreement on this (I certainly do agree). Thus it seems
> we should mark the CF entry as rejected?
>
> It's been failing on cfbot for weeks... https://cirrus-ci.com/task/5289336424890368?logs=docs_build#L347

I have marked it rejected.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Mar 21, 2022 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
> > It feels to me like far too much effort is being invested in fundamentally
> > the wrong direction here.  If the subxact overflow business is causing
> > real-world performance problems, let's find a way to fix that, not put
> > effort into monitoring tools that do little to actually alleviate anyone's
> > pain.
>
> There seems to be some agreement on this (I certainly do agree). Thus it seems
> we should mark the CF entry as rejected?

I don't think I agree with this outcome, for two reasons.

First, we're just talking about an extra couple of columns in
pg_stat_activity here, which does not seem like a heavy price to pay.
I'm not even sure we need two columns; I think we could get down to
one pretty easily. Rough idea: number of cached subtransaction XIDs if
not overflowed, else NULL. Or if that's likely to create 0/NULL
confusion, then maybe just a Boolean, overflowed or not.

Second, the problem seems pretty fundamental to me. Shared memory is
fixed size, so we cannot use it to store an unbounded number of
subtransaction IDs. We could perhaps rejigger things to be more
memory-efficient in some way, but no matter how many subtransaction
XIDs you can keep in shared memory, the user can always consume that
number plus one --  unless you allow for ~2^31 in shared memory, which
seems unrealistic. To me, that means that overflowed snapshots are not
going away. We could make them less painful by rewriting the SLRU
stuff to be more efficient, and I bet that's possible, but I think
it's probably hard, or someone would have gotten it done by now. This
has been sucking for a long time and I see no evidence that progress
is imminent. Even if it happens, it is unlikely that it will be a full
solution. If it were possible to make SLRU lookups fast enough not to
matter, we wouldn't need to have hint bits, but in reality we do have
them and attempts to get rid of them have not gone well up until now,
and in my opinion probably never will.

The way that I view this problem is that it is relatively rare but
hard for some users to troubleshoot. I think I've seen it come up
multiple times, and judging from the earlier responses on this thread,
several other people here have, too. In my experience, the problem is
inevitably that someone has a DML statement inside a plpgsql EXCEPTION
block inside a plpgsql loop. Concurrently with that, they are running
a lot of queries that look at recently modified data, so that the
overflowed snapshot trigger SLRU lookups often enough to matter. How
is a user supposed to identify which backend is causing the problem,
as things stand today? I have generally given people the advice to go
find the DML inside of a plpgsql EXCEPTION block inside of a loop, but
some users have trouble doing that. The DBA who is observing the
performance problem is not necessarily the developer who wrote all of
the PL code, and the PL code may be large and badly formatted and
there could be a bunch of EXCEPTION blocks and it might not be clear
which one is the problem. The exception block could be calling another
function or procedure that does the actual DML rather than doing it
directly, and the loop surrounding it might not be in the same
function or procedure but in some other one that calls it, or it could
be called repeatedly from the SQL level.

I think I fundamentally disagree with the idea that we should refuse
to expose instrumentation data because some day the internals might
change. If we accepted that argument categorically, we wouldn't have
things like backend_xmin or backend_xid in pg_stat_activity, or wait
events either, but we do have those things and users find them useful.
They suck in the sense that you need to know quite a bit about how the
internals work in order to use them to find problems, but people who
want to support production PostgreSQL instances have to learn about
how those internals work one way or the other because they
demonstrably matter. It is absolutely stellar when we can say "hey, we
don't need to have a way for users to see what's going on here
internally because they don't ever need to care," but once it is
established that they do need to care, we should let them see directly
the data they need to care about rather than forcing them to
troubleshoot the problem in some more roundabout way like auditing all
of the code and guessing which part is the problem, or writing custom
dtrace scripts to run on their production instances.

If and when it happens that a field like backend_xmin or the new ones
proposed here are no longer relevant, we can just remove them from the
monitoring views. Yeah, that's a backward compatibility break, and
there's some pain associated with that. But we have demonstrated that
we are perfectly willing to incur the pain associated with adding new
columns when there is new and valuable information to display, and
that is equally a compatibility break, in the sense that it has about
the same chance of making pg_upgrade fail.

In short, I think this is a good idea, and if somebody thinks that we
should solve the underlying problem instead, I'd like to hear what
people think a realistic solution might be. Because to me, it looks
like we're refusing to commit a patch that probably took an hour to
write because with 10 years of engineering effort we could *maybe* fix
the root cause.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> In short, I think this is a good idea, and if somebody thinks that we
> should solve the underlying problem instead, I'd like to hear what
> people think a realistic solution might be. Because to me, it looks
> like we're refusing to commit a patch that probably took an hour to
> write because with 10 years of engineering effort we could *maybe* fix
> the root cause.

Maybe the original patch took an hour to write, but it's sure been
bikeshedded to death :-(.  I was complaining about the total amount
of attention spent more than the patch itself.

The patch of record seems to be v4 from 2022-01-13, which was failing
in cfbot at last report but presumably could be fixed easily.  The
proposed documentation's grammar is pretty shaky, but I don't see
much else wrong in a quick eyeball scan.

            regards, tom lane



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 10:41 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Maybe the original patch took an hour to write, but it's sure been
> bikeshedded to death :-(.  I was complaining about the total amount
> of attention spent more than the patch itself.

Unfortunately, that problem is not unique to this patch, and even more
unfortunately, despite all the bikeshedding, we still often get it
wrong. Catching up from my week off I see that you've fixed not one
but two bugs in a patch I thought I'd reviewed half to death. :-(

> The patch of record seems to be v4 from 2022-01-13, which was failing
> in cfbot at last report but presumably could be fixed easily.  The
> proposed documentation's grammar is pretty shaky, but I don't see
> much else wrong in a quick eyeball scan.

I can take a crack at improving the documentation. Do you have a view
on the best way to cut this down to a single new column, or the
desirability of doing so?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Justin Pryzby
Date:
On Mon, Nov 14, 2022 at 10:09:57AM -0500, Robert Haas wrote:
> On Mon, Mar 21, 2022 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
> > > It feels to me like far too much effort is being invested in fundamentally
> > > the wrong direction here.  If the subxact overflow business is causing
> > > real-world performance problems, let's find a way to fix that, not put
> > > effort into monitoring tools that do little to actually alleviate anyone's
> > > pain.
> >
> > There seems to be some agreement on this (I certainly do agree). Thus it seems
> > we should mark the CF entry as rejected?
> 
> I don't think I agree with this outcome, for two reasons.
> 
> First, we're just talking about an extra couple of columns in
> pg_stat_activity here, which does not seem like a heavy price to pay.

The most recent patch adds a separate function rather than adding more
columns to pg_stat_activity.  I think the complaint about making that
view wider for infrequently-used columns is entirely valid.

> If and when it happens that a field like backend_xmin or the new ones
> proposed here are no longer relevant, we can just remove them from the
> monitoring views. Yeah, that's a backward compatibility break, and
> there's some pain associated with that. But we have demonstrated that
> we are perfectly willing to incur the pain associated with adding new
> columns when there is new and valuable information to display, and
> that is equally a compatibility break, in the sense that it has about
> the same chance of making pg_upgrade fail.

Why would pg_upgrade fail due to new/removed columns in
pg_stat_activity?  Do you mean if a user creates a view on top of it?

-- 
Justin



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 10:57 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > First, we're just talking about an extra couple of columns in
> > pg_stat_activity here, which does not seem like a heavy price to pay.
>
> The most recent patch adds a separate function rather than adding more
> columns to pg_stat_activity.  I think the complaint about making that
> view wider for infrequently-used columns is entirely valid.

I guess that's OK. I don't particularly favor that approach here but I
can live with it. I agree that too-wide views are annoying, but as far
as pg_stat_activity goes, that ship has pretty much sailed already,
and the same is true for a lot of other views. Inventing a one-off
solution for this particular case doesn't seem particularly warranted
to me but, again, I can live with it.

> Why would pg_upgrade fail due to new/removed columns in
> pg_stat_activity?  Do you mean if a user creates a view on top of it?

Yes, that is a thing that some people do, and I think it is the most
likely way for any changes to the view definition to cause
compatibility problems. I could be wrong, though.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
"David G. Johnston"
Date:
On Mon, Nov 14, 2022 at 9:04 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 14, 2022 at 10:57 AM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > First, we're just talking about an extra couple of columns in
> > pg_stat_activity here, which does not seem like a heavy price to pay.
>
> The most recent patch adds a separate function rather than adding more
> columns to pg_stat_activity.  I think the complaint about making that
> view wider for infrequently-used columns is entirely valid.

I guess that's OK. I don't particularly favor that approach here but I
can live with it. I agree that too-wide views are annoying, but as far
as pg_stat_activity goes, that ship has pretty much sailed already,
and the same is true for a lot of other views. Inventing a one-off
solution for this particular case doesn't seem particularly warranted
to me but, again, I can live with it.


I can see putting counts that people would want to use for statistics elsewhere but IIUC the whole purpose of "overflowed" is to inform someone that their session presently has degraded performance because it has created too many subtransactions.  Just because the "degraded" condition itself is rare doesn't mean the field "is my session degraded" is going to be seldom consulted.  In fact, I would rather think it is always briefly consulted to confirm it has the expected value of "false" (blank, IMO, don't show anything in that column unless it is exceptional) and the presence of a value there would draw attention to the desired fact that something is wrong and warrants further investigation.  The pg_stat_activity view seems like the perfect place to at least display that exception flag.

David J.

Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 11:18 AM David G. Johnston
<david.g.johnston@gmail.com> wrote:
>> I guess that's OK. I don't particularly favor that approach here but I
>> can live with it. I agree that too-wide views are annoying, but as far
>> as pg_stat_activity goes, that ship has pretty much sailed already,
>> and the same is true for a lot of other views. Inventing a one-off
>> solution for this particular case doesn't seem particularly warranted
>> to me but, again, I can live with it.
>
> I can see putting counts that people would want to use for statistics elsewhere but IIUC the whole purpose of
"overflowed"is to inform someone that their session presently has degraded performance because it has created too many
subtransactions. Just because the "degraded" condition itself is rare doesn't mean the field "is my session degraded"
isgoing to be seldom consulted.  In fact, I would rather think it is always briefly consulted to confirm it has the
expectedvalue of "false" (blank, IMO, don't show anything in that column unless it is exceptional) and the presence of
avalue there would draw attention to the desired fact that something is wrong and warrants further investigation.  The
pg_stat_activityview seems like the perfect place to at least display that exception flag. 

OK, thanks for voting. I take that as +1 for putting it in
pg_stat_activity proper, which is also my preferred approach.

However, a slight correction: it doesn't inform you that your session
has degraded performance. It informs you that your session may be
degrading everyone else's performance.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Amit Singh
Date:
Making the information available in pg_stat_activity makes it a lot easier to identify the pid which has caused the subtran overflow. Debugging through the app code can be an endless exercise and logging every statement in postgresql logs is not practical either. If the overhead of fetching the information isn't too big, I think we should consider the subtransaction_count and is_overflowed field as potential candidates for the enhancement of pg_stat_activity.


Regards
Amit Singh

On Mon, Nov 14, 2022 at 11:10 PM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Mar 21, 2022 at 7:45 PM Andres Freund <andres@anarazel.de> wrote:
> > It feels to me like far too much effort is being invested in fundamentally
> > the wrong direction here.  If the subxact overflow business is causing
> > real-world performance problems, let's find a way to fix that, not put
> > effort into monitoring tools that do little to actually alleviate anyone's
> > pain.
>
> There seems to be some agreement on this (I certainly do agree). Thus it seems
> we should mark the CF entry as rejected?

I don't think I agree with this outcome, for two reasons.

First, we're just talking about an extra couple of columns in
pg_stat_activity here, which does not seem like a heavy price to pay.
I'm not even sure we need two columns; I think we could get down to
one pretty easily. Rough idea: number of cached subtransaction XIDs if
not overflowed, else NULL. Or if that's likely to create 0/NULL
confusion, then maybe just a Boolean, overflowed or not.

Second, the problem seems pretty fundamental to me. Shared memory is
fixed size, so we cannot use it to store an unbounded number of
subtransaction IDs. We could perhaps rejigger things to be more
memory-efficient in some way, but no matter how many subtransaction
XIDs you can keep in shared memory, the user can always consume that
number plus one --  unless you allow for ~2^31 in shared memory, which
seems unrealistic. To me, that means that overflowed snapshots are not
going away. We could make them less painful by rewriting the SLRU
stuff to be more efficient, and I bet that's possible, but I think
it's probably hard, or someone would have gotten it done by now. This
has been sucking for a long time and I see no evidence that progress
is imminent. Even if it happens, it is unlikely that it will be a full
solution. If it were possible to make SLRU lookups fast enough not to
matter, we wouldn't need to have hint bits, but in reality we do have
them and attempts to get rid of them have not gone well up until now,
and in my opinion probably never will.

The way that I view this problem is that it is relatively rare but
hard for some users to troubleshoot. I think I've seen it come up
multiple times, and judging from the earlier responses on this thread,
several other people here have, too. In my experience, the problem is
inevitably that someone has a DML statement inside a plpgsql EXCEPTION
block inside a plpgsql loop. Concurrently with that, they are running
a lot of queries that look at recently modified data, so that the
overflowed snapshot trigger SLRU lookups often enough to matter. How
is a user supposed to identify which backend is causing the problem,
as things stand today? I have generally given people the advice to go
find the DML inside of a plpgsql EXCEPTION block inside of a loop, but
some users have trouble doing that. The DBA who is observing the
performance problem is not necessarily the developer who wrote all of
the PL code, and the PL code may be large and badly formatted and
there could be a bunch of EXCEPTION blocks and it might not be clear
which one is the problem. The exception block could be calling another
function or procedure that does the actual DML rather than doing it
directly, and the loop surrounding it might not be in the same
function or procedure but in some other one that calls it, or it could
be called repeatedly from the SQL level.

I think I fundamentally disagree with the idea that we should refuse
to expose instrumentation data because some day the internals might
change. If we accepted that argument categorically, we wouldn't have
things like backend_xmin or backend_xid in pg_stat_activity, or wait
events either, but we do have those things and users find them useful.
They suck in the sense that you need to know quite a bit about how the
internals work in order to use them to find problems, but people who
want to support production PostgreSQL instances have to learn about
how those internals work one way or the other because they
demonstrably matter. It is absolutely stellar when we can say "hey, we
don't need to have a way for users to see what's going on here
internally because they don't ever need to care," but once it is
established that they do need to care, we should let them see directly
the data they need to care about rather than forcing them to
troubleshoot the problem in some more roundabout way like auditing all
of the code and guessing which part is the problem, or writing custom
dtrace scripts to run on their production instances.

If and when it happens that a field like backend_xmin or the new ones
proposed here are no longer relevant, we can just remove them from the
monitoring views. Yeah, that's a backward compatibility break, and
there's some pain associated with that. But we have demonstrated that
we are perfectly willing to incur the pain associated with adding new
columns when there is new and valuable information to display, and
that is equally a compatibility break, in the sense that it has about
the same chance of making pg_upgrade fail.

In short, I think this is a good idea, and if somebody thinks that we
should solve the underlying problem instead, I'd like to hear what
people think a realistic solution might be. Because to me, it looks
like we're refusing to commit a patch that probably took an hour to
write because with 10 years of engineering effort we could *maybe* fix
the root cause.
 
--
Robert Haas
EDB: http://www.enterprisedb.com


Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 11:35 AM Amit Singh <amitksingh.mumbai@gmail.com> wrote:
> Making the information available in pg_stat_activity makes it a lot easier to identify the pid which has caused the
subtranoverflow. Debugging through the app code can be an endless exercise and logging every statement in postgresql
logsis not practical either. If the overhead of fetching the information isn't too big, I think we should consider the
subtransaction_countand is_overflowed field as potential candidates for the enhancement of pg_stat_activity. 

The overhead of fetching the information is not large, but Justin is
concerned about the effect on the display width. I feel that's kind of
a lost cause because it's so wide already anyway, but I don't see a
reason why we need *two* new columns. Can't we get by with just one?
It could be overflowed true/false, or it could be the number of
subtransaction XIDs but with NULL instead if overflowed.

Do you have a view on this point?

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
"David G. Johnston"
Date:
On Mon, Nov 14, 2022 at 9:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 14, 2022 at 11:35 AM Amit Singh <amitksingh.mumbai@gmail.com> wrote:
> Making the information available in pg_stat_activity makes it a lot easier to identify the pid which has caused the subtran overflow. Debugging through the app code can be an endless exercise and logging every statement in postgresql logs is not practical either. If the overhead of fetching the information isn't too big, I think we should consider the subtransaction_count and is_overflowed field as potential candidates for the enhancement of pg_stat_activity.

The overhead of fetching the information is not large, but Justin is
concerned about the effect on the display width. I feel that's kind of
a lost cause because it's so wide already anyway, but I don't see a
reason why we need *two* new columns. Can't we get by with just one?
It could be overflowed true/false, or it could be the number of
subtransaction XIDs but with NULL instead if overflowed.

Do you have a view on this point?


NULL when overflowed seems like the opposite of the desired effect, calling attention to the exceptional status.  Make it a text column and write "overflow" or "###" as appropriate.  Anyone using the column is going to end up wanting to special-case overflow anyway and number-to-text conversion aside from overflow is simple enough if a number, and not just a display label, is needed.

David J.

Re: Add sub-transaction overflow status in pg_stat_activity

From
Tom Lane
Date:
"David G. Johnston" <david.g.johnston@gmail.com> writes:
> On Mon, Nov 14, 2022 at 9:41 AM Robert Haas <robertmhaas@gmail.com> wrote:
>> The overhead of fetching the information is not large, but Justin is
>> concerned about the effect on the display width. I feel that's kind of
>> a lost cause because it's so wide already anyway, but I don't see a
>> reason why we need *two* new columns. Can't we get by with just one?
>> It could be overflowed true/false, or it could be the number of
>> subtransaction XIDs but with NULL instead if overflowed.

> NULL when overflowed seems like the opposite of the desired effect, calling
> attention to the exceptional status.  Make it a text column and write
> "overflow" or "###" as appropriate.  Anyone using the column is going to
> end up wanting to special-case overflow anyway and number-to-text
> conversion aside from overflow is simple enough if a number, and not just a
> display label, is needed.

I'd vote for just overflowed true/false.  Why do people need to know
the exact number of subtransactions?  (If there is a use-case, that
would definitely be material for an auxiliary function instead of a
view column.)

            regards, tom lane



Re: Add sub-transaction overflow status in pg_stat_activity

From
Andres Freund
Date:
Hi,

On 2022-11-14 12:29:58 -0500, Tom Lane wrote:
> I'd vote for just overflowed true/false.  Why do people need to know
> the exact number of subtransactions?  (If there is a use-case, that
> would definitely be material for an auxiliary function instead of a
> view column.)

I'd go the other way. It's pretty unimportant whether it overflowed, it's
important how many subtxns there are. The cases where overflowing causes real
problems are when there's many thousand subtxns - which one can't judge just
from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
creeping closer to the danger zone.

Monitoring the number also has the advantage that we'd not embed an
implementation detail ("suboverflowed") in a view. The number of
subtransactions is far less prone to changing than the way we implement
subtransactions in the procarray.

But TBH, to me this still is something that'd be better addressed with a
tracepoint.

I don't buy the argument that the ship of pg_stat_activity width has entirely
sailed. A session still fits onto a reasonably sized terminal in \x output -
but not much longer.

Greetings,

Andres Freund



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
> I'd go the other way. It's pretty unimportant whether it overflowed, it's
> important how many subtxns there are. The cases where overflowing causes real
> problems are when there's many thousand subtxns - which one can't judge just
> from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
> creeping closer to the danger zone.

This is the opposite of what I believe to be true. I thought the
problem is that once a single backend overflows the subxid array, all
snapshots have to be created suboverflowed, and this makes visibility
checking more expensive. It's my impression that for some users this
creates and extremely steep performance cliff: the difference between
no backends overflowing and 1 backend overflowing is large, but
whether you are close to the limit makes no difference as long as you
don't reach it, and once you've passed it it makes little difference
how far past it you go.

> But TBH, to me this still is something that'd be better addressed with a
> tracepoint.

I think that makes it far, far less accessible to the typical user.

> I don't buy the argument that the ship of pg_stat_activity width has entirely
> sailed. A session still fits onto a reasonably sized terminal in \x output -
> but not much longer.

I guess it depends on what you mean by reasonable. For me, without \x,
it wraps across five times on an idle system with the 24x80 window
that I normally use, and even if I full screen my terminal window, it
still wraps around. With \x, sure, it fits, both only if the query is
shorter than the width of my window minus ~25 characters, which isn't
that likely to be the case IME because users write long queries. I
don't even try to use \x most of the time because the queries are
likely to be long enough to destroy any benefit, but it all depends on
how big your terminal is and how long your queries are.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
"David G. Johnston"
Date:
On Mon, Nov 14, 2022 at 11:43 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Nov 14, 2022 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
> I'd go the other way. It's pretty unimportant whether it overflowed, it's
> important how many subtxns there are. The cases where overflowing causes real
> problems are when there's many thousand subtxns - which one can't judge just
> from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
> creeping closer to the danger zone.

This is the opposite of what I believe to be true. I thought the
problem is that once a single backend overflows the subxid array, all
snapshots have to be created suboverflowed, and this makes visibility
checking more expensive. It's my impression that for some users this
creates and extremely steep performance cliff: the difference between
no backends overflowing and 1 backend overflowing is large, but
whether you are close to the limit makes no difference as long as you
don't reach it, and once you've passed it it makes little difference
how far past it you go.


Assuming getting an actual count value to print is fairly cheap, or even a sunk cost if you are going to report overflow, I don't see why we wouldn't want to provide the more detailed data.

My concern, through ignorance, with reporting a number is that it would have no context in the query result itself.  If I have two rows with numbers, one with 10 and one with 1,000, is the two orders of magnitude of the second number important or does overflow happen at, say, 65,000 and so both numbers are exceedingly small and thus not worth worrying about?  That can be handled by documentation just fine, so long as the reference number in question isn't a per-session variable.  Otherwise, showing some kind of "percent of max" computation seems warranted.  In which case maybe the two presentation outputs would be:

1,000 (13%)
Overflowed

David J.

Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 2:17 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:
> Assuming getting an actual count value to print is fairly cheap, or even a sunk cost if you are going to report
overflow,I don't see why we wouldn't want to provide the more detailed data. 
>
> My concern, through ignorance, with reporting a number is that it would have no context in the query result itself.
IfI have two rows with numbers, one with 10 and one with 1,000, is the two orders of magnitude of the second number
importantor does overflow happen at, say, 65,000 and so both numbers are exceedingly small and thus not worth worrying
about? That can be handled by documentation just fine, so long as the reference number in question isn't a per-session
variable. Otherwise, showing some kind of "percent of max" computation seems warranted.  In which case maybe the two
presentationoutputs would be: 
>
> 1,000 (13%)
> Overflowed

I think the idea of cramming a bunch of stuff into a text field is
dead on arrival. Data types are a wonderful invention because they let
people write queries, say looking for backends where overflowed =
true, or backends where subxids > 64. that gets much harder if the
query has to try to make sense of some random text representation.

If both values are separately important, then we need to report them
both, and the only question is whether to do that in pg_stat_activity
or via a side mechanism. What I don't yet understand is why that's
true. I think the important question is whether there are overflowed
backends, and Andres thinks it's how many subtransaction XIDs there
are, so there is a reasonable chance that both things actually matter
in separate scenarios. But I only know the scenario in which
overflowed matters, not the one in which subtransaction XID count
matters.

--
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Nov 14, 2022 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
>> I'd go the other way. It's pretty unimportant whether it overflowed, it's
>> important how many subtxns there are. The cases where overflowing causes real
>> problems are when there's many thousand subtxns - which one can't judge just
>> from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
>> creeping closer to the danger zone.

> This is the opposite of what I believe to be true. I thought the
> problem is that once a single backend overflows the subxid array, all
> snapshots have to be created suboverflowed, and this makes visibility
> checking more expensive.

Yeah, that's what I thought too.  Andres, please enlarge ...

            regards, tom lane



Re: Add sub-transaction overflow status in pg_stat_activity

From
Andres Freund
Date:
Hi,

On 2022-11-14 13:43:41 -0500, Robert Haas wrote:
> On Mon, Nov 14, 2022 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
> > I'd go the other way. It's pretty unimportant whether it overflowed, it's
> > important how many subtxns there are. The cases where overflowing causes real
> > problems are when there's many thousand subtxns - which one can't judge just
> > from suboverflowed alone. Nor can monitoring a boolean tell you whether you're
> > creeping closer to the danger zone.
> 
> This is the opposite of what I believe to be true. I thought the
> problem is that once a single backend overflows the subxid array, all
> snapshots have to be created suboverflowed, and this makes visibility
> checking more expensive. It's my impression that for some users this
> creates and extremely steep performance cliff: the difference between
> no backends overflowing and 1 backend overflowing is large, but
> whether you are close to the limit makes no difference as long as you
> don't reach it, and once you've passed it it makes little difference
> how far past it you go.

First, it's not good to have a cliff that you can't see coming - presumbly
you'd want to warn *before* you regularly reach PGPROC_MAX_CACHED_SUBXIDS
subxids, rather when the shit has hit the fan already.

IMO the number matters a lot when analyzing why this is happening / how to
react. A session occasionally reaching 65 subxids might be tolerable and not
necessarily indicative of a bug. But 100k subxids is something that one just
can't accept.


Perhaps this would better be tackled by a new "visibility" view. It could show
- number of sessions with a snapshot
- max age of backend xmin
- pid with max backend xmin
- number of sessions that suboverflowed
- pid of the session with the most subxids
- age of the oldest prepared xact
- age of the oldest slot
- age of the oldest walsender
- ...

Perhaps implemented in SQL, with new functions for accessing the properties we
don't expose today.  That'd address the pg_stat_activity width, while still
allowing very granular access when necessary. And provide insight into
something that's way to hard to query right now.


> > I don't buy the argument that the ship of pg_stat_activity width has entirely
> > sailed. A session still fits onto a reasonably sized terminal in \x output -
> > but not much longer.
> 
> I guess it depends on what you mean by reasonable. For me, without \x,
> it wraps across five times on an idle system with the 24x80 window
> that I normally use, and even if I full screen my terminal window, it
> still wraps around. With \x, sure, it fits, both only if the query is
> shorter than the width of my window minus ~25 characters, which isn't
> that likely to be the case IME because users write long queries.
>
> I don't even try to use \x most of the time because the queries are likely
> to be long enough to destroy any benefit, but it all depends on how big your
> terminal is and how long your queries are.

I pretty much always use less with -S/--chop-long-lines (via $LESS), otherwise
I find psql to be pretty hard to use.

Greetings,

Andres Freund



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Nov 15, 2022 at 2:47 AM Andres Freund <andres@anarazel.de> wrote:
>
> First, it's not good to have a cliff that you can't see coming - presumbly
> you'd want to warn *before* you regularly reach PGPROC_MAX_CACHED_SUBXIDS
> subxids, rather when the shit has hit the fan already.

I agree with the point that it is good to have a way to know that the
problem is about to happen.  So for that reason, we should show the
subtransaction count.  With showing count user can exactly know if
there are some sessions that could create problems in near future and
may take some action before the problem actually happens.

> IMO the number matters a lot when analyzing why this is happening / how to
> react. A session occasionally reaching 65 subxids might be tolerable and not
> necessarily indicative of a bug. But 100k subxids is something that one just
> can't accept.

Actually, we will see the problem as soon as it has crossed 64 because
after that for any visibility checking we need to check the SLRU.  So
I feel both count and overflow are important.  Count to know that we
are heading towards overflow and overflow to know that it has already
happened.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Mon, Nov 14, 2022 at 10:18 PM David G. Johnston
<david.g.johnston@gmail.com> wrote:

>> Do you have a view on this point?
>>
>
> NULL when overflowed seems like the opposite of the desired effect, calling attention to the exceptional status.
Makeit a text column and write "overflow" or "###" as appropriate.  Anyone using the column is going to end up wanting
tospecial-case overflow anyway and number-to-text conversion aside from overflow is simple enough if a number, and not
justa display label, is needed. 

+1, if we are interested to add only one column then this could be the
best way to show.


--
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Nov 14, 2022 at 4:17 PM Andres Freund <andres@anarazel.de> wrote:
> Perhaps this would better be tackled by a new "visibility" view. It could show
> - number of sessions with a snapshot
> - max age of backend xmin
> - pid with max backend xmin
> - number of sessions that suboverflowed
> - pid of the session with the most subxids
> - age of the oldest prepared xact
> - age of the oldest slot
> - age of the oldest walsender
> - ...
>
> Perhaps implemented in SQL, with new functions for accessing the properties we
> don't expose today.  That'd address the pg_stat_activity width, while still
> allowing very granular access when necessary. And provide insight into
> something that's way to hard to query right now.

I wouldn't be against a pg_stat_visibility view, but I don't think I'd
want it to just output a single summary row. I think we really need to
give people an easy way to track down which session is the problem;
the existence of the problem is already obvious from the SLRU-related
wait events.

If we moved backend_xid and backend_xmin out to this new view, added
these subtransaction-related things, and allowed for a join on pid, I
could get behind that, but it's probably a bit more painful for users
than just accepting that the view is going to further outgrow the
terminal window. It might be better in the long term because perhaps
we're going to find more things that would fit into this new view, but
I don't know.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Andres Freund
Date:
Hi,

On 2022-11-15 09:04:25 -0500, Robert Haas wrote:
> On Mon, Nov 14, 2022 at 4:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Perhaps this would better be tackled by a new "visibility" view. It could show
> > - number of sessions with a snapshot
> > - max age of backend xmin
> > - pid with max backend xmin
> > - number of sessions that suboverflowed
> > - pid of the session with the most subxids
> > - age of the oldest prepared xact
> > - age of the oldest slot
> > - age of the oldest walsender
> > - ...
> >
> > Perhaps implemented in SQL, with new functions for accessing the properties we
> > don't expose today.  That'd address the pg_stat_activity width, while still
> > allowing very granular access when necessary. And provide insight into
> > something that's way to hard to query right now.
> 
> I wouldn't be against a pg_stat_visibility view, but I don't think I'd
> want it to just output a single summary row.

I think it'd be more helpful to just have a single row (or maybe a fixed
number of rows) - from what I've observed the main problem people have is
condensing the available information, rather than not having information
available at all.


> I think we really need to
> give people an easy way to track down which session is the problem;
> the existence of the problem is already obvious from the SLRU-related
> wait events.

Hence the suggestion to show the pid of the session with the most subxacts. We
probably also should add a bunch of accessor functions for people that want
more detail... But just seeing in one place what's problematic would be the
big get, the rest will be a small percentage of users IME.

Greetings,

Andres Freund



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Nov 15, 2022 at 7:34 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Nov 14, 2022 at 4:17 PM Andres Freund <andres@anarazel.de> wrote:
> > Perhaps this would better be tackled by a new "visibility" view. It could show
> > - number of sessions with a snapshot
> > - max age of backend xmin
> > - pid with max backend xmin
> > - number of sessions that suboverflowed
> > - pid of the session with the most subxids
> > - age of the oldest prepared xact
> > - age of the oldest slot
> > - age of the oldest walsender
> > - ...
> >
> > Perhaps implemented in SQL, with new functions for accessing the properties we
> > don't expose today.  That'd address the pg_stat_activity width, while still
> > allowing very granular access when necessary. And provide insight into
> > something that's way to hard to query right now.
>
> I wouldn't be against a pg_stat_visibility view, but I don't think I'd
> want it to just output a single summary row. I think we really need to
> give people an easy way to track down which session is the problem;
> the existence of the problem is already obvious from the SLRU-related
> wait events.
>

Even I feel per backend-wise information would be more useful and easy
to use instead of a single summary row.  I think It's fine to create a
new view if we do not want to add more members to the existing view.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Tue, Nov 15, 2022 at 2:29 PM Andres Freund <andres@anarazel.de> wrote:
> Hence the suggestion to show the pid of the session with the most subxacts. We
> probably also should add a bunch of accessor functions for people that want
> more detail... But just seeing in one place what's problematic would be the
> big get, the rest will be a small percentage of users IME.

I guess all I can say here is that my experience differs.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Bruce Momjian
Date:
On Mon, Nov 14, 2022 at 10:09:57AM -0500, Robert Haas wrote:
> I think I fundamentally disagree with the idea that we should refuse
> to expose instrumentation data because some day the internals might
> change. If we accepted that argument categorically, we wouldn't have
> things like backend_xmin or backend_xid in pg_stat_activity, or wait
> events either, but we do have those things and users find them useful.
> They suck in the sense that you need to know quite a bit about how the
> internals work in order to use them to find problems, but people who
> want to support production PostgreSQL instances have to learn about
> how those internals work one way or the other because they
> demonstrably matter. It is absolutely stellar when we can say "hey, we

I originally thought having this value in pg_stat_activity was overkill,
but seeing the other internal/warning columns in that view, I think it
makes sense.  Oddly, is our 64 snapshot performance limit even
documented anywhere?  I know it is in Simon's patch I am working on.

-- 
  Bruce Momjian  <bruce@momjian.us>        https://momjian.us
  EDB                                      https://enterprisedb.com

  Indecision is a decision.  Inaction is an action.  Mark Batterson




Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Wed, Nov 23, 2022 at 2:01 PM Bruce Momjian <bruce@momjian.us> wrote:
> I originally thought having this value in pg_stat_activity was overkill,
> but seeing the other internal/warning columns in that view, I think it
> makes sense.  Oddly, is our 64 snapshot performance limit even
> documented anywhere?  I know it is in Simon's patch I am working on.

If it is, I'm not aware of it. We often don't document things that are
as internal as that.

One thing that I'd really like to see better documented is exactly
what it is that causes a problem. But first we'd have to understand it
ourselves. It's not as simple as "if you have more than 64 subxacts in
any top-level xact, kiss performance good-bye!" because for there to
be a problem, at least one backend (and probably many) have to take
snapshots that include that see that overflowed subxact cache and thus
get marked suboverflowed. Then after that, those snapshots have to be
used often enough that the additional visibility-checking cost becomes
a problem. But it's also not good enough to just use those snapshots
against any old tuples, because tuples that are older than the
snapshot's xmin aren't going to cause additional lookups, nor are
tuples newer than the snapshot's xmax.

So it feels a bit complicated to me to think through the workload
where this really hurts. What I'm imagining is that you need a
relatively long-running transaction that overflows its subxact
limitation but then doesn't commit, so that lots of other backends get
overflowed snapshots, and also so that the xmin and xmax of the
snapshots being taken get further apart. Or maybe you can have a
series short-running transactions that each overflow their subxact
cache briefly, but they overlap, so that there's usually at least 1
around in that state, but in that case I think you need a separate
long-running transaction to push xmin and xmax further apart. Either
way, the backends that get the overflowed snapshots then need to go
look at some table data that's been recently modified, so that there
are xmin and xmax values newer than the snapshot's xmin.

Intuitively, I feel like this should be pretty rare, and largely
avoidable if you just don't use long-running transactions, which is a
good thing to avoid for other reasons anyway. But there may be more to
it than I'm realizing, because I've seen customers hit this issue
multiple times. I wonder whether there's some subtlety to the
triggering conditions that I'm not fully understanding.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Andres Freund
Date:
Hi,

On 2022-11-23 15:25:39 -0500, Robert Haas wrote:
> One thing that I'd really like to see better documented is exactly
> what it is that causes a problem. But first we'd have to understand it
> ourselves. It's not as simple as "if you have more than 64 subxacts in
> any top-level xact, kiss performance good-bye!" because for there to
> be a problem, at least one backend (and probably many) have to take
> snapshots that include that see that overflowed subxact cache and thus
> get marked suboverflowed. Then after that, those snapshots have to be
> used often enough that the additional visibility-checking cost becomes
> a problem. But it's also not good enough to just use those snapshots
> against any old tuples, because tuples that are older than the
> snapshot's xmin aren't going to cause additional lookups, nor are
> tuples newer than the snapshot's xmax.

Indeed. This is why I was thinking that just alerting for overflowed xact
isn't particularly helpful. You really want to see how much they overflow and
how often.

But even that might not be that helpful. Perhaps what we actually need is an
aggregate measure showing the time spent doing subxact lookups due to
overflowed snapshots? Seeing a substantial amount of time spent doing subxact
lookups would be much more accurate call to action than seeing a that some
sessions have a lot of subxacts.


> Intuitively, I feel like this should be pretty rare, and largely
> avoidable if you just don't use long-running transactions, which is a
> good thing to avoid for other reasons anyway.

I think they're just not always avoidable, even in a very well operated
system.


I wonder if we could lower the impact of suboverflowed snapshots by improving
the representation in PGPROC and SnapshotData. What if we

a) Recorded the min and max assigned subxid in PGPROC

b) Instead of giving up in GetSnapshotData() once we see a suboverflowed
   PGPROC, store the min/max subxid of the proc in SnapshotData. We could
   reliably "steal" space for that from ->subxip, as we won't need to store
   subxids for that proc.

c) When determining visibility with a suboverflowed snapshot we use the
   ranges from b) to check whether we need to do a subtrans lookup. I think
   that'll often prevent subtrans lookups.

d) If we encounter a subxid whose parent is in progress and not in ->subxid,
   and subxcnt isn't the max, add that subxid to subxip. That's not free
   because we'd basically need to do an insertion sort, but likely still a lot
   cheaper than doing repeated subtrans lookups.

I think we'd just need a one or two additional fields in SnapshotData.

Greetings,

Andres Freund



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Thu, Nov 24, 2022 at 2:26 AM Andres Freund <andres@anarazel.de> wrote:
>
> Indeed. This is why I was thinking that just alerting for overflowed xact
> isn't particularly helpful. You really want to see how much they overflow and
> how often.

I think the way of monitoring the subtransaction count and overflow
status is helpful at least for troubleshooting purposes.  By regularly
monitoring user will know which backend(pid) is particularly using
more subtransactions and prone to overflow and which backends are
actually frequently causing sub-overflow.

> I think they're just not always avoidable, even in a very well operated
> system.
>
>
> I wonder if we could lower the impact of suboverflowed snapshots by improving
> the representation in PGPROC and SnapshotData. What if we
>
> a) Recorded the min and max assigned subxid in PGPROC
>
> b) Instead of giving up in GetSnapshotData() once we see a suboverflowed
>    PGPROC, store the min/max subxid of the proc in SnapshotData. We could
>    reliably "steal" space for that from ->subxip, as we won't need to store
>    subxids for that proc.
>
> c) When determining visibility with a suboverflowed snapshot we use the
>    ranges from b) to check whether we need to do a subtrans lookup. I think
>    that'll often prevent subtrans lookups.
>
> d) If we encounter a subxid whose parent is in progress and not in ->subxid,
>    and subxcnt isn't the max, add that subxid to subxip. That's not free
>    because we'd basically need to do an insertion sort, but likely still a lot
>    cheaper than doing repeated subtrans lookups.
>
> I think we'd just need a one or two additional fields in SnapshotData.

+1

I think this approach will be helpful in many cases, especially when
only some of the backend is creating sub-overflow and impacting
overall system performance.  Now, most of the xids especially the top
xid will not fall in that range (unless that sub-overflowing backend
is constantly generating subxids and increasing its range) and the
lookups for that xids can be done directly in the snapshot's xip
array.

On another thought, in XidInMVCCSnapshot() in case of sub-overflow why
don't we look into the snapshot's xip array first and see if the xid
exists there? if not then we can look into the pg_subtrans SLRU and
fetch the top xid and relook again into the xip array.  It will be
more costly in cases where we do not find xid in the xip array because
then we will have to search this array twice but I think looking into
this array is much cheaper than directly accessing SLRU.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Wed, Nov 23, 2022 at 3:56 PM Andres Freund <andres@anarazel.de> wrote:
> Indeed. This is why I was thinking that just alerting for overflowed xact
> isn't particularly helpful. You really want to see how much they overflow and
> how often.

I think if we just expose the is-overflowed feld and the count, people
can poll. It works fine for wait events and I think it's fine here,
too.

> But even that might not be that helpful. Perhaps what we actually need is an
> aggregate measure showing the time spent doing subxact lookups due to
> overflowed snapshots? Seeing a substantial amount of time spent doing subxact
> lookups would be much more accurate call to action than seeing a that some
> sessions have a lot of subxacts.

That's not responsive to the need that I have. I need users to be able
to figure out which backend(s) are overflowing their snapshots -- and
perhaps how badly and how often --- not which backends are incurring
an expense as a result. There may well be a use case for the latter
thing but it's a different problem.

> I wonder if we could lower the impact of suboverflowed snapshots by improving
> the representation in PGPROC and SnapshotData. What if we
>
> a) Recorded the min and max assigned subxid in PGPROC
>
> b) Instead of giving up in GetSnapshotData() once we see a suboverflowed
>    PGPROC, store the min/max subxid of the proc in SnapshotData. We could
>    reliably "steal" space for that from ->subxip, as we won't need to store
>    subxids for that proc.
>
> c) When determining visibility with a suboverflowed snapshot we use the
>    ranges from b) to check whether we need to do a subtrans lookup. I think
>    that'll often prevent subtrans lookups.

Wouldn't you basically need to take the union of all the ranges,
probably by keeping the lowest min and the highest max? I'm not sure
how much that would really help at that point.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Wed, Nov 30, 2022 at 11:01 AM Robert Haas <robertmhaas@gmail.com> wrote:
> That's not responsive to the need that I have. I need users to be able
> to figure out which backend(s) are overflowing their snapshots -- and
> perhaps how badly and how often --- not which backends are incurring
> an expense as a result. There may well be a use case for the latter
> thing but it's a different problem.

So ... I want to go ahead and commit Dilip's v4 patch, or something
very like it. Most people were initially supportive. Tom expressed
some opposition, but it sounds like that was mostly to the discussion
going on and on rather than the idea per se. Andres also expressed
some concerns, but I really think the problem he's worried about is
something slightly different and need not block this work. I note also
that the v4 patch is designed in such a way that it does not change
any view definitions, so the compatibility impact of committing it is
basically nil.

Any strenuous objections?

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Nathan Bossart
Date:
On Mon, Dec 12, 2022 at 11:15:43AM -0500, Robert Haas wrote:
> Any strenuous objections?

Nope.  In fact, +1.  Until more work is done to alleviate the performance
issues, this information will likely prove useful.

-- 
Nathan Bossart
Amazon Web Services: https://aws.amazon.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Justin Pryzby
Date:
On Mon, Dec 12, 2022 at 09:33:51AM -0800, Nathan Bossart wrote:
> On Mon, Dec 12, 2022 at 11:15:43AM -0500, Robert Haas wrote:
> > Any strenuous objections?
> 
> Nope.  In fact, +1.  Until more work is done to alleviate the performance
> issues, this information will likely prove useful.

The docs could use a bit of attention.  Otherwise +1.

diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
index 4efa1d5fca0..ac15e2ce789 100644
--- a/doc/src/sgml/monitoring.sgml
+++ b/doc/src/sgml/monitoring.sgml
@@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
         <returnvalue>record</returnvalue>
        </para>
        <para>
-        Returns a record of information about the backend's subtransactions.
-        The fields returned are <parameter>subxact_count</parameter> identifies
-        number of active subtransaction and <parameter>subxact_overflow
-        </parameter> shows whether the backend's subtransaction cache is
-        overflowed or not.
-       </para></entry>
+        Returns a record of information about the subtransactions of the backend
+        with the specified ID.
+        The fields returned are <parameter>subxact_count</parameter>, which
+        identifies the number of active subtransaction and
+        <parameter>subxact_overflow</parameter>, which shows whether the
+        backend's subtransaction cache is overflowed or not.
        </para></entry>
       </row>
 



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Dec 12, 2022 at 12:42 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> index 4efa1d5fca0..ac15e2ce789 100644
> --- a/doc/src/sgml/monitoring.sgml
> +++ b/doc/src/sgml/monitoring.sgml
> @@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
>          <returnvalue>record</returnvalue>
>         </para>
>         <para>
> -        Returns a record of information about the backend's subtransactions.
> -        The fields returned are <parameter>subxact_count</parameter> identifies
> -        number of active subtransaction and <parameter>subxact_overflow
> -        </parameter> shows whether the backend's subtransaction cache is
> -        overflowed or not.
> -       </para></entry>
> +        Returns a record of information about the subtransactions of the backend
> +        with the specified ID.
> +        The fields returned are <parameter>subxact_count</parameter>, which
> +        identifies the number of active subtransaction and
> +        <parameter>subxact_overflow</parameter>, which shows whether the
> +        backend's subtransaction cache is overflowed or not.
>         </para></entry>
>        </row>

Makes sense.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Mon, Dec 12, 2022 at 11:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 12, 2022 at 12:42 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > index 4efa1d5fca0..ac15e2ce789 100644
> > --- a/doc/src/sgml/monitoring.sgml
> > +++ b/doc/src/sgml/monitoring.sgml
> > @@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
> >          <returnvalue>record</returnvalue>
> >         </para>
> >         <para>
> > -        Returns a record of information about the backend's subtransactions.
> > -        The fields returned are <parameter>subxact_count</parameter> identifies
> > -        number of active subtransaction and <parameter>subxact_overflow
> > -        </parameter> shows whether the backend's subtransaction cache is
> > -        overflowed or not.
> > -       </para></entry>
> > +        Returns a record of information about the subtransactions of the backend
> > +        with the specified ID.
> > +        The fields returned are <parameter>subxact_count</parameter>, which
> > +        identifies the number of active subtransaction and
> > +        <parameter>subxact_overflow</parameter>, which shows whether the
> > +        backend's subtransaction cache is overflowed or not.
> >         </para></entry>
> >        </row>
>
> Makes sense.

+1

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Julien Rouhaud
Date:
On Tue, Dec 13, 2022 at 5:09 AM Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Mon, Dec 12, 2022 at 11:21 PM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Dec 12, 2022 at 12:42 PM Justin Pryzby <pryzby@telsasoft.com> wrote:
> > > diff --git a/doc/src/sgml/monitoring.sgml b/doc/src/sgml/monitoring.sgml
> > > index 4efa1d5fca0..ac15e2ce789 100644
> > > --- a/doc/src/sgml/monitoring.sgml
> > > +++ b/doc/src/sgml/monitoring.sgml
> > > @@ -5680,12 +5680,12 @@ FROM pg_stat_get_backend_idset() AS backendid;
> > >          <returnvalue>record</returnvalue>
> > >         </para>
> > >         <para>
> > > -        Returns a record of information about the backend's subtransactions.
> > > -        The fields returned are <parameter>subxact_count</parameter> identifies
> > > -        number of active subtransaction and <parameter>subxact_overflow
> > > -        </parameter> shows whether the backend's subtransaction cache is
> > > -        overflowed or not.
> > > -       </para></entry>
> > > +        Returns a record of information about the subtransactions of the backend
> > > +        with the specified ID.
> > > +        The fields returned are <parameter>subxact_count</parameter>, which
> > > +        identifies the number of active subtransaction and
> > > +        <parameter>subxact_overflow</parameter>, which shows whether the
> > > +        backend's subtransaction cache is overflowed or not.
> > >         </para></entry>
> > >        </row>
> >
> > Makes sense.
>
> +1

+1



Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Tue, Dec 13, 2022 at 2:29 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > > Makes sense.
> >
> > +1
>
> +1

Committed with a bit more word-smithing on the documentation.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Ted Yu
Date:


On Mon, Dec 19, 2022 at 11:57 AM Robert Haas <robertmhaas@gmail.com> wrote:
On Tue, Dec 13, 2022 at 2:29 AM Julien Rouhaud <rjuju123@gmail.com> wrote:
> > > Makes sense.
> >
> > +1
>
> +1

Committed with a bit more word-smithing on the documentation.

--
Robert Haas
EDB: http://www.enterprisedb.com

Hi,
It seems the comment for `backend_subxact_overflowed` missed a word.

Please see the patch.

Thanks 
Attachment

Re: Add sub-transaction overflow status in pg_stat_activity

From
Robert Haas
Date:
On Mon, Dec 19, 2022 at 3:48 PM Ted Yu <yuzhihong@gmail.com> wrote:
> It seems the comment for `backend_subxact_overflowed` missed a word.
>
> Please see the patch.

Committed this fix, thanks.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Dilip Kumar
Date:
On Tue, Dec 20, 2022 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Mon, Dec 19, 2022 at 3:48 PM Ted Yu <yuzhihong@gmail.com> wrote:
> > It seems the comment for `backend_subxact_overflowed` missed a word.
> >
> > Please see the patch.
>
> Committed this fix, thanks.

Thanks, Robert!

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



Re: Add sub-transaction overflow status in pg_stat_activity

From
Kirill Reshke
Date:
On Tue, 20 Dec 2022 at 09:23, Dilip Kumar <dilipbalaut@gmail.com> wrote:
>
> On Tue, Dec 20, 2022 at 2:32 AM Robert Haas <robertmhaas@gmail.com> wrote:
> >
> > On Mon, Dec 19, 2022 at 3:48 PM Ted Yu <yuzhihong@gmail.com> wrote:
> > > It seems the comment for `backend_subxact_overflowed` missed a word.
> > >
> > > Please see the patch.
> >
> > Committed this fix, thanks.
>
> Thanks, Robert!
>
> --
> Regards,
> Dilip Kumar
> EnterpriseDB: http://www.enterprisedb.com
>
>


Hi hackers!

Nice patch, seems it may be useful in cases like alerting that subxid
overflow happened is database or whatever.
But I'm curious, what is the following work on this? I think it may be
way more helpful to, for example, log queries, causing sub-tx
overflow,
or even kill the backend, causing sub-tx overflow with GUC variables,
setting server behaviour.
For example, in Greenplum there is gp_subtransaction_overflow
extension and GUC for simply logging problematic queries[1]. Can we
have something
similar in PostgreSQL on the server-side?

[1]
https://github.com/greenplum-db/gpdb/blob/6X_STABLE/gpcontrib/gp_subtransaction_overflow/gp_subtransaction_overflow.c#L42