Re: [GENERAL] Potential bug with pg_notify - Mailing list pgsql-general

From François Beaulieu
Subject Re: [GENERAL] Potential bug with pg_notify
Date
Msg-id A778EE74-B317-4927-80D5-9E2C381EF064@sbktelecom.com
Whole thread Raw
In response to Re: [GENERAL] Potential bug with pg_notify  (Adrian Klaver <adrian.klaver@aklaver.com>)
Responses Re: [GENERAL] Potential bug with pg_notify  (Adrian Klaver <adrian.klaver@aklaver.com>)
List pgsql-general
> On Feb 13, 2017, at 1:56 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
> 
> On 02/13/2017 09:04 AM, François Beaulieu wrote:
>> 
>>> On Feb 13, 2017, at 11:45 AM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:
>>> 
>                                                                     |
>>>> 
>>>>> 3) Are the first row and the second row in the same partition?
>>>> 
>>>> Doubtful, the problem occurs several times a day and we only have one partition a day. Let me check with the above
example.What would be the best way to determine which child a row is in, and the relative position in the child table?
 
>>> 
>>> As to position, maybe ctid though it has caveats:
>> 
>> The three rows in my example return a ctid of (742,17), (742,18) and (742,19) respectively, in their child table.
So,probably not at a partition boundary.
 
>> 
>>>> Also; my worker in written in perl and uses DBD::Pg. I haven’t been able to 100% eliminate the module itself as
thecause of the bug. Any suggestions on how I might go about doing that efficiently?
 
>>> 
>>> What does the worker do?
>> 
>> Sorry, that's my employer’s classified IP. :-)
>> Does it matter?
> 
> Only that it makes it harder to give any suggestions on eliminating it as a source of error if it is a black box.  I
don'tthink, at this point, it is necessary to see the actual source. If it is possible a high level synopsis of what it
doesmight be sufficient.
 

Suffice it to say the worker uses a read-only connection to the database to receive these notices and to query the
tablefor the matching rows in certain circumstances. It never modifies the database in any way; it only uses this
informationto act upon a completely different subsystem. I loosely based it on a snippet of code from this very mailing
list:

https://www.postgresql.org/message-id/20050104031937.GA80695@winnie.fuhr.org

>>> Could it be the module is not dealing with time zones correctly? Though thinking about this that would seem to
manifesta problem only around the 7th day boundary. So put this down to thinking aloud.
 
>> 
>> No, the partitioning scheme seems to be respecting the timezone properly, and my issue is happening every few hours
inthe middle of the day and we’re in UTC+5, so not near the end of the day in UTC. Besides, I believe timestamp without
timezoneassumes the local timezone of the server, which is set to UTC anyway.
 
>> 
>> Has the schema eliminated your original theory regarding the delaying of the generation of the _id? I don’t think
thatwould normally be an issue that occurs sporadically and the _id seems to be part of the INSERT, which would
indicatethat, as it should, it’s done generating before my trigger is called.
 
> 
> I don't see anything that would explain a delay. Still the fact remains that in most cases the notify captures the
_id,but in some cases it does not. Going back to your OP I realized I missed that the NEW.userfield was also not coming
through.So that seems to confirm that pg_notify() is firing before it gets access to NEW.*. Having said that I have no
ideawhy?
 
> 
> The only thing I can think to do is(untested):
> 
> CREATE OR REPLACE FUNCTION notify_trigger() RETURNS trigger AS $$
> DECLARE
>     _newid integer;
> BEGIN
>  SELECT NEW._id INTO _newid;
>  IF _newid IS NULL OR NOT FOUND THEN
>     RAISE NOTICE 'NEW._id is NULL/NOT FOUND';
>     pg_sleep(0.1); --Or whatever interval you want.
>  END IF;
>  PERFORM pg_notify('watchers', TG_TABLE_NAME || ',' || NEW._id|| ',' || NEW.userfield);
>  RETURN new;
> END;
> $$ LANGUAGE plpgsql;
> 
> Not really a solution but it might help determine whether it is a timing issue. Also this is probably something that
shouldbe done on a test server to be safe.
 

Thanks for the tip, I’ll try that in my lab. I want to try to replicate the issue more consistently first, so that my
testsafter the change will be more conclusive. 
 

In the meantime, if anyone has any other suggestions, please don’t hesitate.


Thanks,
-=François Beaulieu
SBK Telecom


pgsql-general by date:

Previous
From: Jeff Janes
Date:
Subject: Re: [GENERAL] Bad planning data resulting in OOM killing of postgres
Next
From: David Hinkle
Date:
Subject: Re: [GENERAL] Bad planning data resulting in OOM killing of postgres