Thread: ERROR: invalid memory alloc request size 1073741824
Hello, We've tripped over an error when doing a "COPY.. TO STDOUT WITH BINARY" query. "ERROR: invalid memory alloc request size 1073741824" (exactly 1GB) So a palloc() call is failing on the AllocSizeIsValid() check. Does anyone know if this a safety catch we are tripping with a bad query - or whether this check is something that should never be hit (i.e. a bug). This has been some discussion before about making a change before the 9.5beta1: https://www.postgresql.org/message-id/flat/9A28C8860F777E439AA12E8AEA7694F8010F6F3F%40BPXM15GP.gisp.nec.co.jp https://www.postgresql.org/message-id/flat/CAKJS1f_sxPyBHF6%3DnEYwPJdKKDNjVDP91b_EJApfyGTei%3DViDw%40mail.gmail.com postgresql 9.5.4 (2PGDG.rhel6) work_mem = 256MB Thanks, Stefan
On 01/30/2018 06:35 PM, Stefan Blanke wrote: > Hello, > > We've tripped over an error when doing a "COPY.. TO STDOUT WITH BINARY" > query. > > "ERROR: invalid memory alloc request size 1073741824" > (exactly 1GB) > > So a palloc() call is failing on the AllocSizeIsValid() check. > > Does anyone know if this a safety catch we are tripping with a bad > query - or whether this check is something that should never be hit > (i.e. a bug). > That very much depends on where exactly the failure happens - it might be both (or relying on an assumption that changed over). It's not clear if it fails during query execution or formatting the output (I seem to remember there were/are issues with rows containing multiple large bytea values, for example). Can you attach gdb to the backend, set breakpoint on that elog and share the backtrace? Then we can investigate further. In case you don't have experience with doing that, here's a howto on the community wiki: https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD > This has been some discussion before about making a change before the > 9.5beta1: > > https://www.postgresql.org/message-id/flat/9A28C8860F777E439AA12E8AEA7694F8010F6F3F%40BPXM15GP.gisp.nec.co.jp > > https://www.postgresql.org/message-id/flat/CAKJS1f_sxPyBHF6%3DnEYwPJdKKDNjVDP91b_EJApfyGTei%3DViDw%40mail.gmail.com > Both of those threads are about a hashjoin issue, which should have been fixed in 9.5. Moreover, you haven't shared the query and it's query plan, so it's unclear if it's doing hash joins at all. That would be useful for investigating this issue, BTW. > > postgresql 9.5.4 (2PGDG.rhel6) > work_mem = 256MB > FWIW you're missing more than a year of bugfixes regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke <stefan.blanke@framestore.com> wrote:
Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH BINARY" query.
"ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)
I have my money on a corrupted TOAST entry. Is this happening on trustworthy hardware or beige box with no ECC or RAID?
Regards, Jan
So a palloc() call is failing on the AllocSizeIsValid() check.
Does anyone know if this a safety catch we are tripping with a bad query - or whether this check is something that should never be hit (i.e. a bug).
This has been some discussion before about making a change before the 9.5beta1:
https://www.postgresql.org/message-id/flat/9A28C8860F777E439 AA12E8AEA7694F8010F6F3F% 40BPXM15GP.gisp.nec.co.jp
https://www.postgresql.org/message-id/flat/CAKJS1f_sxPyBHF6% 3DnEYwPJdKKDNjVDP91b_EJApfyGTe i%3DViDw%40mail.gmail.com
postgresql 9.5.4 (2PGDG.rhel6)
work_mem = 256MB
Thanks,
Stefan
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
On 01/30/2018 10:43 PM, Jan Wieck wrote: > > > On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke > <stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote: > > Hello, > > We've tripped over an error when doing a "COPY.. TO STDOUT WITH > BINARY" query. > > "ERROR: invalid memory alloc request size 1073741824" > (exactly 1GB) > > > I have my money on a corrupted TOAST entry. Is this happening on > trustworthy hardware or beige box with no ECC or RAID? > I'll bet you it's not that. It's quite unlikely that would fail with exactly 1GB request size. It seems much more like a buffer that we keep to be power of 2. The question is which one. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
> That very much depends on where exactly the failure happens - it might > be both (or relying on an assumption that changed over). It's not clear > if it fails during query execution or formatting the output (I seem to > remember there were/are issues with rows containing multiple large bytea > values, for example). > > Can you attach gdb to the backend, set breakpoint on that elog and share > the backtrace? Then we can investigate further. I can't easily get a meaningful query plan (i.e. from the live system) as the query that failed is based on a temporary table built by some pl/pgsql (the database is more of a conveyor belt than a long term store). Unfortunately it's difficult to reproduce the ERROR. This has only occurred a couple of times on a live system in the last year and these queries run frequently. I think we will move to 9.5.10 and see if it ever happens again; then if it does get a backtrace on that newer version. Incidentally we never saw this error on 8.4 where this code was running from 2009 until 2017. > I'll bet you it's not that. It's quite unlikely that would fail with > exactly 1GB request size. It seems much more like a buffer that we keep > to be power of 2. The question is which one. I had dismissed corruption before writing in. It's exactly 1GB every time this has happened - and we can dump the full dataset periodically without issue. >> I have my money on a corrupted TOAST entry. Is this happening on >> trustworthy hardware or beige box with no ECC or RAID? It's good quality commercial hardware in our colo - no exactly sure what. Thanks for taking the time to look at this! Stefan On 01/30/18 22:00, Tomas Vondra wrote: > > > On 01/30/2018 10:43 PM, Jan Wieck wrote: >> >> >> On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke >> <stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote: >> >> Hello, >> >> We've tripped over an error when doing a "COPY.. TO STDOUT WITH >> BINARY" query. >> >> "ERROR: invalid memory alloc request size 1073741824" >> (exactly 1GB) >> >> >> I have my money on a corrupted TOAST entry. Is this happening on >> trustworthy hardware or beige box with no ECC or RAID? >> > > I'll bet you it's not that. It's quite unlikely that would fail with > exactly 1GB request size. It seems much more like a buffer that we keep > to be power of 2. The question is which one. > > > regards >
On Wed, Jan 31, 2018 at 12:32 PM, Stefan Blanke <stefan.blanke@framestore.com> wrote:
>
> I'll bet you it's not that. It's quite unlikely that would fail with
> exactly 1GB request size. It seems much more like a buffer that we keep
> to be power of 2. The question is which one.
I had dismissed corruption before writing in. It's exactly 1GB every time this has happened - and we can dump the full dataset periodically without issue.
>> I have my money on a corrupted TOAST entry. Is this happening on
>> trustworthy hardware or beige box with no ECC or RAID?
It's good quality commercial hardware in our colo - no exactly sure what.
If it is a sporadic issue and you can dump the full dataset, then I just lost my money (Tomas, you coming to PGConf in Jersey City?).
But then, if this is a plain COPY to stdout ... I am wondering what could possibly be in that path that wants to allocate 1GB. Or is this not so plain but rather a COPY ... SELECT ...?
Regards, Jan
Thanks for taking the time to look at this!
Stefan
On 01/30/18 22:00, Tomas Vondra wrote:
On 01/30/2018 10:43 PM, Jan Wieck wrote:
On Tue, Jan 30, 2018 at 12:35 PM, Stefan Blanke
<stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote:
Hello,
We've tripped over an error when doing a "COPY.. TO STDOUT WITH
BINARY" query.
"ERROR: invalid memory alloc request size 1073741824"
(exactly 1GB)
I have my money on a corrupted TOAST entry. Is this happening on
trustworthy hardware or beige box with no ECC or RAID?
I'll bet you it's not that. It's quite unlikely that would fail with
exactly 1GB request size. It seems much more like a buffer that we keep
to be power of 2. The question is which one.
regards
Jan Wieck
Senior Postgres Architect
Senior Postgres Architect
On 01/31/2018 09:51 PM, Jan Wieck wrote: > > > On Wed, Jan 31, 2018 at 12:32 PM, Stefan Blanke > <stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote: > > > > > I'll bet you it's not that. It's quite unlikely that would fail with > > exactly 1GB request size. It seems much more like a buffer that we keep > > to be power of 2. The question is which one. > > I had dismissed corruption before writing in. It's exactly 1GB every > time this has happened - and we can dump the full dataset > periodically without issue. > > >> I have my money on a corrupted TOAST entry. Is this happening on > >> trustworthy hardware or beige box with no ECC or RAID? > > It's good quality commercial hardware in our colo - no exactly sure > what. > > > If it is a sporadic issue and you can dump the full dataset, then I just > lost my money (Tomas, you coming to PGConf in Jersey City?). > Unfortunately no, but I'm sure there will be other opportunities to buy me a beer ;-) Like pgcon, for example. > > But then, if this is a plain COPY to stdout ... I am wondering what > could possibly be in that path that wants to allocate 1GB. Or is this > not so plain but rather a COPY ... SELECT ...? > That's what I've been guessing, and why I was asking for a query plan. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Hi, We've upgraded to PostgreSQL 11.5 (postgresql.org rhel 6 rpm) and I have had another occurrence of this invalid alloc of 1GB. Apologies for never providing a query plan when discussing this two years ago; we decided to move to a newer PostgreSQL to see if the issue went away but took a while to complete the move. The invalid alloc still only occurs occasionally every few months on a query that we run every minute; so unfortunately we still don't have a contained reproducible test case. This is the SQL we are running with a query plan - the query plan is from an new empty database so the planner has no stats. CREATE OR REPLACE FUNCTION create_table() RETURNS TABLE (a INTEGER, b INTEGER) AS $$ UPDATE y SET c = true, d = false WHERE e IS NOT true AND f IS NOT true AND g = 1 AND h = 0 AND i = 0 AND (j IS NULL OR j > 0) RETURNING y.a, y.b; $$ LANGUAGE SQL; -- Prepared statement (PQprepare) CREATE TEMPORARY TABLE temp_table AS SELECT * FROM create_table() -- Prepared statement (PQprepare) SELECT y.a, y.b, x.k, x.l, y.m, y.n, y.o FROM temp_table INNER JOIN y ON temp_table.b = y.b AND temp_table.a = y.a INNER JOIN x ON x.a = y.a -- The two prepared statements are executed one after another -- in the order shown many times an hour. The query plan for the second prepared statement is: Nested Loop (cost=17.14..64.38 rows=16 width=112) -> Hash Join (cost=17.00..61.47 rows=16 width=80) Hash Cond: ((temp_table.b = y.b) AND (temp_table.a = y.a)) -> Seq Scan on temp_table (cost=0.00..32.60 rows=2260 width=8) -> Hash (cost=12.80..12.80 rows=280 width=76) -> Seq Scan on y (cost=0.00..12.80 rows=280 width=76) -> Index Scan using x_pkey on x (cost=0.14..0.18 rows=1 width=40) Index Cond: (a = temp_table.a) Thanks, Stefan On 31/01/2018 21:23, Tomas Vondra wrote: > > > On 01/31/2018 09:51 PM, Jan Wieck wrote: >> >> >> On Wed, Jan 31, 2018 at 12:32 PM, Stefan Blanke >> <stefan.blanke@framestore.com <mailto:stefan.blanke@framestore.com>> wrote: >> >> > >> > I'll bet you it's not that. It's quite unlikely that would fail with >> > exactly 1GB request size. It seems much more like a buffer that we keep >> > to be power of 2. The question is which one. >> >> I had dismissed corruption before writing in. It's exactly 1GB every >> time this has happened - and we can dump the full dataset >> periodically without issue. >> >> >> I have my money on a corrupted TOAST entry. Is this happening on >> >> trustworthy hardware or beige box with no ECC or RAID? >> >> It's good quality commercial hardware in our colo - no exactly sure >> what. >> >> >> If it is a sporadic issue and you can dump the full dataset, then I just >> lost my money (Tomas, you coming to PGConf in Jersey City?). >> > > Unfortunately no, but I'm sure there will be other opportunities to buy > me a beer ;-) Like pgcon, for example. > >> >> But then, if this is a plain COPY to stdout ... I am wondering what >> could possibly be in that path that wants to allocate 1GB. Or is this >> not so plain but rather a COPY ... SELECT ...? >> > > That's what I've been guessing, and why I was asking for a query plan. > > > regards >
Stefan Blanke <stefan.blanke@framestore.com> writes: > We've upgraded to PostgreSQL 11.5 (postgresql.org rhel 6 rpm) and I have > had another occurrence of this invalid alloc of 1GB. Apologies for never > providing a query plan when discussing this two years ago; we decided to > move to a newer PostgreSQL to see if the issue went away but took a > while to complete the move. > The invalid alloc still only occurs occasionally every few months on a > query that we run every minute; so unfortunately we still don't have a > contained reproducible test case. Given the lack of stats, I wonder whether the issue could be related to the plan sometimes being horribly bad, eg due to the temp table being much larger than expected. (A possible mechanism would be hash table bloat, perhaps, but that's getting way ahead of the evidence.) Could you adjust your process to log the actual temp table size each time, ie "select count(*) from temp_table" in between the two steps, and then note whether the failures are correlated with unusual temp table sizes? regards, tom lane
> Could you adjust your process to log the actual temp table size > each time, ie "select count(*) from temp_table" in between the > two steps, and then note whether the failures are correlated > with unusual temp table sizes? Thanks, I'll add a count there and come back with a number when we next hit the "invalid memory alloc". Stefan On 11/03/2020 15:37, Tom Lane wrote: > Stefan Blanke <stefan.blanke@framestore.com> writes: >> We've upgraded to PostgreSQL 11.5 (postgresql.org rhel 6 rpm) and I have >> had another occurrence of this invalid alloc of 1GB. Apologies for never >> providing a query plan when discussing this two years ago; we decided to >> move to a newer PostgreSQL to see if the issue went away but took a >> while to complete the move. > >> The invalid alloc still only occurs occasionally every few months on a >> query that we run every minute; so unfortunately we still don't have a >> contained reproducible test case. > > Given the lack of stats, I wonder whether the issue could be related > to the plan sometimes being horribly bad, eg due to the temp table > being much larger than expected. (A possible mechanism would be > hash table bloat, perhaps, but that's getting way ahead of the > evidence.) > > Could you adjust your process to log the actual temp table size > each time, ie "select count(*) from temp_table" in between the > two steps, and then note whether the failures are correlated > with unusual temp table sizes? > > regards, tom lane >