Thread: Intermittent missing chunk ERRORS

Intermittent missing chunk ERRORS

From
Sbob
Date:

All;


I am working with a client, they have an app that is running into error's like this:

SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896


I thought it was a corruption / REINDEX issue, however I have run REINDEXon the index, REINDEX on the table and even a reindexdb on the database. Today they passed me another error (the one above) and once I was able to login I ran the same query the app team referenced and it ran without any errors. I asked them to re-run the query and they informed me that it now works for them as well.


I assume that if this were a corrupt index issue then the query would continue to fail every time.

Has anyone seen this type of intermittent index error before? and thoughts on where to look /how to debug this?


Thanks in advance


Re: Intermittent missing chunk ERRORS

From
Vijaykumar Jain
Date:
a lot in this gist is something you can debug to start with.


if not the index than the toast table might be corrupt.
the bigger problem might be to figure out how it got corrupt. a one off power incident, or bug or storage is having issues etc.
once that is clear,
a simple start would be to run a pg_dump of the database to /dev/null and check if the dump is clean without errors.

Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain

On Sat, Apr 26, 2025, 10:39 PM Sbob <sbob@quadratum-braccas.com> wrote:

All;


I am working with a client, they have an app that is running into error's like this:

SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896


I thought it was a corruption / REINDEX issue, however I have run REINDEXon the index, REINDEX on the table and even a reindexdb on the database. Today they passed me another error (the one above) and once I was able to login I ran the same query the app team referenced and it ran without any errors. I asked them to re-run the query and they informed me that it now works for them as well.


I assume that if this were a corrupt index issue then the query would continue to fail every time.

Has anyone seen this type of intermittent index error before? and thoughts on where to look /how to debug this?


Thanks in advance


Re: Intermittent missing chunk ERRORS

From
Vijaykumar Jain
Date:
forgot about the core tool to check for the problem.


this can help find corruption issues much faster.


Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain

On Sun, Apr 27, 2025, 1:34 AM Vijaykumar Jain <vijaykumarjain.github@gmail.com> wrote:
a lot in this gist is something you can debug to start with.


if not the index than the toast table might be corrupt.
the bigger problem might be to figure out how it got corrupt. a one off power incident, or bug or storage is having issues etc.
once that is clear,
a simple start would be to run a pg_dump of the database to /dev/null and check if the dump is clean without errors.

Thanks,
Vijay

Open to work
Resume - Vijaykumar Jain

On Sat, Apr 26, 2025, 10:39 PM Sbob <sbob@quadratum-braccas.com> wrote:

All;


I am working with a client, they have an app that is running into error's like this:

SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896


I thought it was a corruption / REINDEX issue, however I have run REINDEXon the index, REINDEX on the table and even a reindexdb on the database. Today they passed me another error (the one above) and once I was able to login I ran the same query the app team referenced and it ran without any errors. I asked them to re-run the query and they informed me that it now works for them as well.


I assume that if this were a corrupt index issue then the query would continue to fail every time.

Has anyone seen this type of intermittent index error before? and thoughts on where to look /how to debug this?


Thanks in advance


Re: Intermittent missing chunk ERRORS

From
Laurenz Albe
Date:
On Sat, 2025-04-26 at 11:09 -0600, Sbob wrote:
> I am working with a client, they have an app that is running into error's like this:
>
> SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896
>
> I thought it was a corruption / REINDEX issue, however I have run REINDEXon the index,
> REINDEX on the table and even a reindexdb on the database. Today they passed me another
> error (the one above) and once I was able to login I ran the same query the app team
> referenced and it ran without any errors. I asked them to re-run the query and they
> informed me that it now works for them as well.
>
> I assume that if this were a corrupt index issue then the query would continue to fail
> every time.
>
> Has anyone seen this type of intermittent index error before? and thoughts on where to
> look /how to debug this?

The error is data corruption.

If it is transient, that could mean a couple of things:

- a TOAST index was corrupted, and REINDEX fixed it

- faulty RAM corrupted data, but as soon as the corrupted buffer was evicted,
  the problem was gone

- somebody updated or deleted the row, and the corruption is hidden

Yours,
Laurenz Albe



Re: Intermittent missing chunk ERRORS

From
Tom Lane
Date:
Laurenz Albe <laurenz.albe@cybertec.at> writes:
> On Sat, 2025-04-26 at 11:09 -0600, Sbob wrote:
>> SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896
>> Has anyone seen this type of intermittent index error before? and thoughts on where to
>> look /how to debug this?

> The error is data corruption.

We've seen transient errors of this sort arise from what are basically
timing problems, and fixed at least a few cases.  I wonder exactly
which PG version the problematic installation is running.

            regards, tom lane



Re: Intermittent missing chunk ERRORS

From
Sbob
Date:
On 4/26/25 2:28 PM, Tom Lane wrote:
> Laurenz Albe <laurenz.albe@cybertec.at> writes:
>> On Sat, 2025-04-26 at 11:09 -0600, Sbob wrote:
>>> SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896
>>> Has anyone seen this type of intermittent index error before? and thoughts on where to
>>> look /how to debug this?
>> The error is data corruption.
> We've seen transient errors of this sort arise from what are basically
> timing problems, and fixed at least a few cases.  I wonder exactly
> which PG version the problematic installation is running.
>
>             regards, tom lane


We are on V14





Re: Intermittent missing chunk ERRORS

From
Sbob
Date:
On 4/26/25 2:24 PM, Laurenz Albe wrote:
> On Sat, 2025-04-26 at 11:09 -0600, Sbob wrote:
>> I am working with a client, they have an app that is running into error's like this:
>>
>> SQL Error [XX001]: ERROR: missing chunk number 0 for toast value 44915477 in pg_toast_5321896
>>
>> I thought it was a corruption / REINDEX issue, however I have run REINDEXon the index,
>> REINDEX on the table and even a reindexdb on the database. Today they passed me another
>> error (the one above) and once I was able to login I ran the same query the app team
>> referenced and it ran without any errors. I asked them to re-run the query and they
>> informed me that it now works for them as well.
>>
>> I assume that if this were a corrupt index issue then the query would continue to fail
>> every time.
>>
>> Has anyone seen this type of intermittent index error before? and thoughts on where to
>> look /how to debug this?
> The error is data corruption.
>
> If it is transient, that could mean a couple of things:
>
> - a TOAST index was corrupted, and REINDEX fixed it
>
> - faulty RAM corrupted data, but as soon as the corrupted buffer was evicted,
>    the problem was gone
>
> - somebody updated or deleted the row, and the corruption is hidden
>
> Yours,
> Laurenz Albe


I did REINDEX and the error came back, however this last time when they 
informed me about the error they sent the select statement throwing the 
error and once I logged in I ran their select statement and it worked, 
no REINDEX had been run, and then I asked them to run it again and it 
worked for them as well

So, the first option above is not the answer and I dont think anyone 
updated the row, their process is mostly reads but I will verify, so 
maybe the RAM issue? the servers are cloud VM's in a private cloud






Re: Intermittent missing chunk ERRORS

From
Tom Lane
Date:
Sbob <sbob@quadratum-braccas.com> writes:
> On 4/26/25 2:28 PM, Tom Lane wrote:
>> We've seen transient errors of this sort arise from what are basically
>> timing problems, and fixed at least a few cases.  I wonder exactly
>> which PG version the problematic installation is running.

> We are on V14

14-what?  We are talking about bug fixes, so it matters.

            regards, tom lane



Re: Intermittent missing chunk ERRORS

From
Sbob
Date:
On 4/26/25 6:03 PM, Tom Lane wrote:
> Sbob <sbob@quadratum-braccas.com> writes:
>> On 4/26/25 2:28 PM, Tom Lane wrote:
>>> We've seen transient errors of this sort arise from what are basically
>>> timing problems, and fixed at least a few cases.  I wonder exactly
>>> which PG version the problematic installation is running.
>> We are on V14
> 14-what?  We are talking about bug fixes, so it matters.
>
>             regards, tom lane


postgres=# select version();
version
----------------------------------------------------------------------------------------------------------
  PostgreSQL 14.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 
20210514 (Red Hat 8.5.0-23), 64-bit
(1 row)





Re: Intermittent missing chunk ERRORS

From
Tom Lane
Date:
Sbob <sbob@quadratum-braccas.com> writes:
> On 4/26/25 6:03 PM, Tom Lane wrote:
>> 14-what?  We are talking about bug fixes, so it matters.

>   PostgreSQL 14.17 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 
> 20210514 (Red Hat 8.5.0-23), 64-bit

OK, well that's current at least...

I don't suppose you can build a test case that reproduces this
failure?  If it's timing-dependent as I suspect, it might only fail
once in awhile, but that would be good enough for investigation.

            regards, tom lane



Re: Intermittent missing chunk ERRORS

From
Laurenz Albe
Date:
On Sat, 2025-04-26 at 16:54 -0600, Sbob wrote:
> > If it is transient, that could mean a couple of things:
> >
> > - a TOAST index was corrupted, and REINDEX fixed it
> >
> > - faulty RAM corrupted data, but as soon as the corrupted buffer was evicted,
> >     the problem was gone
> >
> > - somebody updated or deleted the row, and the corruption is hidden
>
> I did REINDEX and the error came back, however this last time when they
> informed me about the error they sent the select statement throwing the
> error and once I logged in I ran their select statement and it worked,
> no REINDEX had been run, and then I asked them to run it again and it
> worked for them as well
>
> So, the first option above is not the answer and I dont think anyone
> updated the row, their process is mostly reads but I will verify, so
> maybe the RAM issue? the servers are cloud VM's in a private cloud

Maybe.  Check the hardware.
Or maybe it is as Tom says.

Yours,
Laurenz Albe