Thread: bug on log generation ?

bug on log generation ?

From
Marcos Pegoraro
Date:
I have to fix log files because its content is not properly formatted, I´m using version 14.4 but that happened when I was using version 11 too. It happens only when that statement is huge, or because it is a long sequence of updates in a WITH or DO statements, or because i´m updating a bytea field. This kind of statement occurs on log dozens of times a day, but just 2 or 3 are wrongly formatted.

one example is a WITH statement with dozens of inserts and updates on it
2022-08-06 10:05:10.822 -03,"user_0566","db",1956432,"10.158.0.17:56098",62ee66ee.1dda50,1,"SELECT",2022-08-06 10:04:46 -03,266/383801,229907216,LOG,00000,"duration: 1735.107 ms  execute PRSTMTST155939968154/PORTALST155939968154: /*Cobranca.ImportaCobranca*/ WITH Receber1 AS (UPDATE fin_Re>
NNLiquidado1 AS (UPDATE fin_ReceberNossoNumero SET Status = 7 WHERE NossoNumero = any($${90062164}$$)),
--statement continues, with some more dozens of update/inserts
NNDesconsiderado48 AS (UPDATE fin_recebernossonumero SET status = 9 WHERE receber_id = 104201 AND status = 1 AND nossonumero <> 90086321),
NNExcluir48 AS (UPDATE fin_recebernossonumero SET status = 4 WHERE receber_id = 104201 AND status = any($IN${2,3}$IN$) AND nossonumero <> 90086321 RETURNING recebernossonumero_id),
Baixa48 AS (INSERT INTO fin_ReceberBaixa(Historico, Desconto, Valor, Lancamento) VALUES ($vs$Paga2022-08-06 10:07:07.505 -03,"user_0566","db",1956432,"10.158.0.17:56098",62ee66ee.1dda50,2,"idle",2022-08-06 10:04:46 -03,266/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"2022062701adriana.aguiar","client backend",,0
2022-08-06 10:07:07.507 -03,"user_0328","db",1957035,"10.158.0.17:57194",62ee6730.1ddcab,1,"idle",2022-08-06 10:05:52 -03,410/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"2022062701tmk06.madureira","client backend",,0

if you search for "$vs$Paga2022-08-06 10:07:07.505" you´ll see that "$vs$Paga" is still part of first statement but "2022-08-06 10:07:07.505" is the starting of next statement, but there are some missing chars of previous statement.

another example is just one update with a large bytea field on it
2022-08-06 15:57:46.955 -03,"user_0591","db",2103042,"10.158.0.17:43662",62eeb9aa.201702,1,"INSERT",2022-08-06 15:57:46 -03,49/639939,230013197,LOG,00000,"duration: 11.012 ms  execute PRSTMTST1612483842/PORTALST1612483842: WITH upsert AS (
UPDATE sys_var SET varBlob =  $1  WHERE name =  $2  RETURNING *) INSERT INTO sys_var (Name, varBlob) SELECT  $3 ,  $4  WHERE NOT EXISTS (SELECT * FROM upsert)","parameters: $1 =
'\x3c3f786d6 --field content continues
$2 = '/users/39818/XMLConfig', $3 = '/users/39818/XMLConfig', $4 = '\x3c3f786d6 --field content continues
e4445583e2d313c2f47524f555045445f494e4445583e32022-08-06 15:58:42.436 -03,"user_0591","db",2103042,"10.158.0.17:43662",62eeb9aa.201702,2,"idle",2022-08-06 15:57:46 -03,49/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"","client backend",,-4199849316459484872
2022-08-06 15:58:42.436 -03,"user_0143","db",2103112,"10.158.0.17:43794",62eeb9bf.201748,1,"idle",2022-08-06 15:58:07 -03,44/0,0,FATAL,57P01,"terminating connection due to administrator command",,,,,,,,,"2022062701joyceb.l@hotmail.com","client backend",,0

Here "4445583e32022-08-06 15:58:42.436", where bytea content "4445583e" was being displayed and the next statement started with "32022-08-06 15:58:42.436".

Obviously because that the previous line is not finished correctly and I cannot import log files properly, so I have to edit those log files to properly import them.

thanks
Marcos

Re: bug on log generation ?

From
Tom Lane
Date:
Marcos Pegoraro <marcos@f10.com.br> writes:
> I have to fix log files because its content is not properly formatted,

What mechanism are you using to store the log?  If syslog is involved,
it's reputed to drop data under load.

            regards, tom lane



Re: bug on log generation ?

From
Marcos Pegoraro
Date:
it´s csvlog only

Atenciosamente, 




Em dom., 7 de ago. de 2022 às 11:12, Tom Lane <tgl@sss.pgh.pa.us> escreveu:
Marcos Pegoraro <marcos@f10.com.br> writes:
> I have to fix log files because its content is not properly formatted,

What mechanism are you using to store the log?  If syslog is involved,
it's reputed to drop data under load.

                        regards, tom lane

Re: bug on log generation ?

From
Andres Freund
Date:
Hi,

On 2022-08-07 11:56:44 -0300, Marcos Pegoraro wrote:
> it´s csvlog only

How are you running postgres? If the logger process runs into trouble it might
write to stderr.

Is there a chance your huge statements would make you run out of space?

Greetings,

Andres Freund



Re: bug on log generation ?

From
Marcos Pegoraro
Date:

How are you running postgres? If the logger process runs into trouble it might
write to stderr.

Is there a chance your huge statements would make you run out of space?

Well, I don't think it is a  out of space problem, because it doesn´t stop logging, it just splits that message. As you can see, the next message is logged properly. And that statement is not so huge, these statements have not more than 10 or 20kb. And as I said these statements occur dozens of times a day, but only once or twice is not correctly logged 
An additional info, that splitted message has an out of order log time. At that time the log file was having 2 or 3 logs per second, and that message was 1 or 2 minutes later. It seems like it occurs now but it's stored a minute or two later.

thanks
Marcos

Re: bug on log generation ?

From
Andrew Dunstan
Date:
On 2022-08-08 Mo 07:34, Marcos Pegoraro wrote:
>
>
>     How are you running postgres? If the logger process runs into
>     trouble it might
>     write to stderr.
>
>     Is there a chance your huge statements would make you run out of
>     space?
>
> Well, I don't think it is a  out of space problem, because it
> doesn´t stop logging, it just splits that message. As you can see, the
> next message is logged properly. And that statement is not so huge,
> these statements have not more than 10 or 20kb. And as I said these
> statements occur dozens of times a day, but only once or twice is
> not correctly logged 
> An additional info, that splitted message has an out of order log
> time. At that time the log file was having 2 or 3 logs per second, and
> that message was 1 or 2 minutes later. It seems like it occurs now but
> it's stored a minute or two later.
>
>

It looks like a failure of the log chunking protocol, with long messages
being improperly interleaved. I don't think we've had reports of such a
failure since commit c17e863bc7 back in 2012, but maybe my memory is
failing.

What platform is this on? Is it possible that on some platform the chunk
size we're using is not doing an atomic write?


syslogger.h says:


    #ifdef PIPE_BUF
    /* Are there any systems with PIPE_BUF > 64K?  Unlikely, but ... */
    #if PIPE_BUF > 65536
    #define PIPE_CHUNK_SIZE  65536
    #else
    #define PIPE_CHUNK_SIZE  ((int) PIPE_BUF)
    #endif
    #else                           /* not defined */
    /* POSIX says the value of PIPE_BUF must be at least 512, so use that */
    #define PIPE_CHUNK_SIZE  512
    #endif


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: bug on log generation ?

From
Tom Lane
Date:
Andrew Dunstan <andrew@dunslane.net> writes:
> What platform is this on? Is it possible that on some platform the chunk
> size we're using is not doing an atomic write?

Another idea is that some of the write() calls are failing --- elog.c
doesn't check for that.  Eyeing the POSIX spec for write(), I wonder
if somehow the pipe has gotten set into O_NONBLOCK mode and we're
not retrying EAGAIN failures.

            regards, tom lane



Re: bug on log generation ?

From
Marcos Pegoraro
Date:
What platform is this on? Is it possible that on some platform the chunk
size we're using is not doing an atomic write?
 
Until last year we were Ubuntu 16.04 and Postgres 11 with the latest minor update.
This January we changed to Ubuntu 20.04 and Postgres 14, now updated to 14.4.

But the problem occured on both old and new SO and Postgres versions.
Right now I opened the current log file and there are 20 or 30 of these statements and all of them are fine, maybe tomorrow the problem comes back, maybe this afternoon.

thanks
Marcos

Re: bug on log generation ?

From
Andrew Dunstan
Date:
On 2022-08-08 Mo 11:07, Marcos Pegoraro wrote:
>
>     What platform is this on? Is it possible that on some platform the
>     chunk
>     size we're using is not doing an atomic write?
>
>  
> Until last year we were Ubuntu 16.04 and Postgres 11 with the latest
> minor update.
> This January we changed to Ubuntu 20.04 and Postgres 14, now updated
> to 14.4.
>
> But the problem occured on both old and new SO and Postgres versions.
> Right now I opened the current log file and there are 20 or 30 of
> these statements and all of them are fine, maybe tomorrow the problem
> comes back, maybe this afternoon.
>
>

OK, we really need a repeatable test if possible. Perhaps a pgbench run
with lots of concurrent runs of a some very long query would do the trick.


cheers


andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Re: bug on log generation ?

From
Andres Freund
Date:
Hi,

On 2022-08-08 10:32:22 -0400, Tom Lane wrote:
> Andrew Dunstan <andrew@dunslane.net> writes:
> > What platform is this on? Is it possible that on some platform the chunk
> > size we're using is not doing an atomic write?
> 
> Another idea is that some of the write() calls are failing --- elog.c
> doesn't check for that.

I was suspicious of those as well. It might be a good idea to at least write
such failures to stderr, otherwise it's just about impossible to debug. Not
that stderr will always point anywhere useful...

I can imagine that a system under heavy memory pressure might fail writing, if
there's a lot of writes in a row or such.

Greetings,

Andres Freund



Re: bug on log generation ?

From
Tom Lane
Date:
Andres Freund <andres@anarazel.de> writes:
> On 2022-08-08 10:32:22 -0400, Tom Lane wrote:
>> Another idea is that some of the write() calls are failing --- elog.c
>> doesn't check for that.

> I was suspicious of those as well. It might be a good idea to at least write
> such failures to stderr, otherwise it's just about impossible to debug. Not
> that stderr will always point anywhere useful...

Uh ... what we are talking about is a failure to write to stderr.
It's not likely that adding more output will help.

Having said that, looping on EAGAIN seems like a reasonably harmless
change.  Whether it will help here is really hard to say, though.

            regards, tom lane



Re: bug on log generation ?

From
Andres Freund
Date:
Hi,

On 2022-08-08 12:19:22 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2022-08-08 10:32:22 -0400, Tom Lane wrote:
> >> Another idea is that some of the write() calls are failing --- elog.c
> >> doesn't check for that.
> 
> > I was suspicious of those as well. It might be a good idea to at least write
> > such failures to stderr, otherwise it's just about impossible to debug. Not
> > that stderr will always point anywhere useful...
> 
> Uh ... what we are talking about is a failure to write to stderr.
> It's not likely that adding more output will help.

I forgot that we don't preserve the original stderr in some other fd, likely
because the logger itself still has it open and can use write_stderr().

Greetings,

Andres Freund



Re: bug on log generation ?

From
Marcos Pegoraro
Date:
OK, we really need a repeatable test if possible. Perhaps a pgbench run
with lots of concurrent runs of a some very long query would do the trick.

OK, I can do it but ... strangely that error usually occurs at random times, sometimes at 08:00, sometimes at 19:00, and it's busier between 10:00 and 16:00. If I cron some of those queries to run every second is enough ? What exactly do you expect to see on log files ?