Thread: running ANALYZE results in => duplicate key value violates unique constraint "pg_statistic_relid_att_inh_index"

Hi,

I am running that one (official docker image)

PostgreSQL 13.11 (Debian 13.11-1.pgdg110+1) on x86_64-pc-linux-gnu,
compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

and one of my nightly jobs reported that error yesterday when running
an "ANALYZE":

FEHLER: doppelter Schlüsselwert verletzt Unique-Constraint »pg_statistic_relid_att_inh_index«
  Detail: Schlüssel »(starelid, staattnum, stainherit)=(2609, 4, f)« existiert bereits.

which should translate to something like:

ERROR:  duplicate key value violates unique constraint "pg_statistic_relid_att_inh_index"
> DETAIL:  Key (starelid, staattnum, stainherit)=(2609, 4, f) already exists.


Anyone an idea what's wrong?

Maybe (not?) related but sometimes the analyze does fail with:

ERROR: attempted to delete invisible tuple


Both errors are only happening here and there - so I don't have a
reproducer, but still I am curious what is wrong here with me running
an "ANALYZE" after my data import.

thanks for insights :)

kind regards

Torsten




On Wed, 2023-09-06 at 09:46 +0200, Torsten Krah wrote:
> I am running that one (official docker image)
>
> PostgreSQL 13.11 (Debian 13.11-1.pgdg110+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
>
> and one of my nightly jobs reported that error yesterday when running
> an "ANALYZE":
>
> FEHLER: doppelter Schlüsselwert verletzt Unique-Constraint »pg_statistic_relid_att_inh_index«
>   Detail: Schlüssel »(starelid, staattnum, stainherit)=(2609, 4, f)« existiert bereits.
>
> which should translate to something like:
>
> ERROR:  duplicate key value violates unique constraint "pg_statistic_relid_att_inh_index"
> > DETAIL:  Key (starelid, staattnum, stainherit)=(2609, 4, f) already exists.
>
>
> Anyone an idea what's wrong?

Yes: the metadata table pg_statistic has data corruption.

> Maybe (not?) related but sometimes the analyze does fail with:
>
> ERROR: attempted to delete invisible tuple

That also looks like data corrupton, albeit different one.

> Both errors are only happening here and there - so I don't have a
> reproducer, but still I am curious what is wrong here with me running
> an "ANALYZE" after my data import.

To fix the "pg_statistic" error:

- take down time

- set "allow_system_mods = on"

- TRUNCATE pg_statistic;

- ANALYZE;


You are lucky that the corrupted table is one that holds data that can be rebuilt.

Yours,
Laurenz Albe



Am Mittwoch, dem 06.09.2023 um 10:21 +0200 schrieb Laurenz Albe:
> You are lucky that the corrupted table is one that holds data that
> can be rebuilt.

It is a test instance / container anyway which is deleted afterwards
and can be setup again as often as I want.

But how is that corruption happening - I mean it is a docker image,
freshly fetched from the registry.

After that I am starting a container from that image, (re)importing
data (different tests => different data so the cycle of delete data /
import data / analyze the data happens quite often) and running my
tests.
The OS does not report anything which would relate nor does any other
tool / system fail nor does postgresl itself fail on any other table
here - it always fails only on that analyze part.

That happens all in about 8-10 minutes for the whole process - what is
causing that corruption in that short timeframe here?

regards

Torsten




On Wed, 2023-09-06 at 10:33 +0200, Torsten Krah wrote:
> Am Mittwoch, dem 06.09.2023 um 10:21 +0200 schrieb Laurenz Albe:
> > You are lucky that the corrupted table is one that holds data that
> > can be rebuilt.
>
> It is a test instance / container anyway which is deleted afterwards
> and can be setup again as often as I want.
>
> But how is that corruption happening - I mean it is a docker image,
> freshly fetched from the registry.
>
> After that I am starting a container from that image, (re)importing
> data (different tests => different data so the cycle of delete data /
> import data / analyze the data happens quite often) and running my
> tests.
> The OS does not report anything which would relate nor does any other
> tool / system fail nor does postgresl itself fail on any other table
> here - it always fails only on that analyze part.
>
> That happens all in about 8-10 minutes for the whole process - what is
> causing that corruption in that short timeframe here?

If you have a reproducible way to create the data corruption, that would
be very interesting.  It micht be a software bug.

Yours,
Laurenz Albe



On 06/09/2023 09:46 CEST Torsten Krah <krah.tm@gmail.com> wrote:

> I am running that one (official docker image)
>
> PostgreSQL 13.11 (Debian 13.11-1.pgdg110+1) on x86_64-pc-linux-gnu,
> compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit

Have you also tried with 13.12?

> and one of my nightly jobs reported that error yesterday when running
> an "ANALYZE":
>
> FEHLER: doppelter Schlüsselwert verletzt Unique-Constraint »pg_statistic_relid_att_inh_index«
>   Detail: Schlüssel »(starelid, staattnum, stainherit)=(2609, 4, f)« existiert bereits.
>
> which should translate to something like:
>
> ERROR:  duplicate key value violates unique constraint "pg_statistic_relid_att_inh_index"
> > DETAIL:  Key (starelid, staattnum, stainherit)=(2609, 4, f) already exists.
>
> Anyone an idea what's wrong?
>
> Maybe (not?) related but sometimes the analyze does fail with:
>
> ERROR: attempted to delete invisible tuple
>
>
> Both errors are only happening here and there - so I don't have a
> reproducer, but still I am curious what is wrong here with me running
> an "ANALYZE" after my data import.

Does the unique constraint violation always occur for the same row?  OID 2609
is pg_description.

--
Erik



Am Mittwoch, dem 06.09.2023 um 12:04 +0200 schrieb Erik Wienhold:
> > I am running that one (official docker image)
> >
> > PostgreSQL 13.11 (Debian 13.11-1.pgdg110+1) on x86_64-pc-linux-gnu,
> > compiled by gcc (Debian 10.2.1-6) 10.2.1 20210110, 64-bit
>
> Have you also tried with 13.12?

Yes, but it did also happen on previous versions before 13.11 / 13.12
sometimes (I just ignored it until now because it happens so rarely).

>
>
> Does the unique constraint violation always occur for the same row? 
> OID 2609
> is pg_description.

As I don't have a reproducer yet (I did not track stats but lets say it
runs fine for 100 / 200 times and 1 or 2 of those are failing with
those mentioned analyze errors - it may even be less than that) I can't
tell you if it fails always over that OID - I need to wait for it to
happen again, I will report here if it is the same - may take some time
;).


Torsten




> But how is that corruption happening - I mean it is a docker image,
> freshly fetched from the registry.

Hi Torsten,

Maybe you have to increase the "--stop-timeout" value ;   ( or "stop_grace_period" in docker-compose )
https://github.com/docker-library/postgres/issues/544#issuecomment-455738848

docker run: " --stop-timeout  Timeout (in seconds) to stop a container "
or

And recommended in the Dockerfile:
STOPSIGNAL SIGINT
#
# An additional setting that is recommended for all users regardless of this
# value is the runtime "--stop-timeout" (or your orchestrator/runtime's
# equivalent) for controlling how long to wait between sending the defined
# STOPSIGNAL and sending SIGKILL (which is likely to cause data corruption).
#
# The default in most runtimes (such as Docker) is 10 seconds, and the
# documentation at https://www.postgresql.org/docs/12/server-start.html notes
# that even 90 seconds may not be long enough in many instances.


regards,
 Imre


Torsten Krah <krah.tm@gmail.com> ezt írta (időpont: 2023. szept. 6., Sze, 14:45):
Am Mittwoch, dem 06.09.2023 um 10:21 +0200 schrieb Laurenz Albe:
> You are lucky that the corrupted table is one that holds data that
> can be rebuilt.

It is a test instance / container anyway which is deleted afterwards
and can be setup again as often as I want.

But how is that corruption happening - I mean it is a docker image,
freshly fetched from the registry.

After that I am starting a container from that image, (re)importing
data (different tests => different data so the cycle of delete data /
import data / analyze the data happens quite often) and running my
tests.
The OS does not report anything which would relate nor does any other
tool / system fail nor does postgresl itself fail on any other table
here - it always fails only on that analyze part.

That happens all in about 8-10 minutes for the whole process - what is
causing that corruption in that short timeframe here?

regards

Torsten



Am Mittwoch, dem 06.09.2023 um 20:42 +0200 schrieb Imre Samu:
> Maybe you have to increase the "*--stop-timeout" value

That is totally unrelated in my case, it is an anonymous volume anyway
which gets created on start and deleted afterwards.

Torsten