Re: remaining sql/json patches - Mailing list pgsql-hackers

From jian he
Subject Re: remaining sql/json patches
Date
Msg-id CACJufxH7Dc6Uo73_Ne+=+C_9a8-DZiExRo72gGe7u2CQV6BnFQ@mail.gmail.com
Whole thread Raw
In response to Re: remaining sql/json patches  (Amit Langote <amitlangote09@gmail.com>)
Responses Re: remaining sql/json patches
List pgsql-hackers
On Wed, Mar 6, 2024 at 12:07 PM Amit Langote <amitlangote09@gmail.com> wrote:
>
> Hi Tomas,
>
> On Wed, Mar 6, 2024 at 6:30 AM Tomas Vondra
> <tomas.vondra@enterprisedb.com> wrote:
> >
> > Hi,
> >
> > I know very little about sql/json and all the json internals, but I
> > decided to do some black box testing. I built a large JSONB table
> > (single column, ~7GB of data after loading). And then I did a query
> > transforming the data into tabular form using JSON_TABLE.
> >
> > The JSON_TABLE query looks like this:
> >
> > SELECT jt.* FROM
> >   title_jsonb t,
> >   json_table(t.info, '$'
> >     COLUMNS (
> >       "id" text path '$."id"',
> >       "type" text path '$."type"',
> >       "title" text path '$."title"',
> >       "original_title" text path '$."original_title"',
> >       "is_adult" text path '$."is_adult"',
> >       "start_year" text path '$."start_year"',
> >       "end_year" text path '$."end_year"',
> >       "minutes" text path '$."minutes"',
> >       "genres" text path '$."genres"',
> >       "aliases" text path '$."aliases"',
> >       "directors" text path '$."directors"',
> >       "writers" text path '$."writers"',
> >       "ratings" text path '$."ratings"',
> >       NESTED PATH '$."aliases"[*]'
> >         COLUMNS (
> >           "alias_title" text path '$."title"',
> >           "alias_region" text path '$."region"'
> >         ),
> >       NESTED PATH '$."directors"[*]'
> >         COLUMNS (
> >           "director_name" text path '$."name"',
> >           "director_birth_year" text path '$."birth_year"',
> >           "director_death_year" text path '$."death_year"'
> >         ),
> >       NESTED PATH '$."writers"[*]'
> >         COLUMNS (
> >           "writer_name" text path '$."name"',
> >           "writer_birth_year" text path '$."birth_year"',
> >           "writer_death_year" text path '$."death_year"'
> >         ),
> >       NESTED PATH '$."ratings"[*]'
> >         COLUMNS (
> >           "rating_average" text path '$."average"',
> >           "rating_votes" text path '$."votes"'
> >         )
> >     )
> >   ) as jt;
> >
> > again, not particularly complex. But if I run this, it consumes multiple
> > gigabytes of memory, before it gets killed by OOM killer. This happens
> > even when ran using
> >
> >   COPY (...) TO '/dev/null'
> >
> > so there's nothing sent to the client. I did catch memory context info,
> > where it looks like this (complete stats attached):
> >
> > ------
> > TopMemoryContext: 97696 total in 5 blocks; 13056 free (11 chunks);
> >                   84640 used
> >   ...
> >   TopPortalContext: 8192 total in 1 blocks; 7680 free (0 chunks); ...
> >     PortalContext: 1024 total in 1 blocks; 560 free (0 chunks); ...
> >       ExecutorState: 2541764672 total in 314 blocks; 6528176 free
> >                      (1208 chunks); 2535236496 used
> >         printtup: 8192 total in 1 blocks; 7952 free (0 chunks); ...
> >         ...
> > ...
> > Grand total: 2544132336 bytes in 528 blocks; 7484504 free
> >              (1340 chunks); 2536647832 used
> > ------
> >
> > I'd say 2.5GB in ExecutorState seems a bit excessive ... Seems there's
> > some memory management issue? My guess is we're not releasing memory
> > allocated while parsing the JSON or building JSON output.
> >
> > I'm not attaching the data, but I can provide that if needed - it's
> > about 600MB compressed. The structure is not particularly complex, it's
> > movie info from [1] combined into a JSON document (one per movie).
>
> Thanks for the report.
>
> Yeah, I'd like to see the data to try to drill down into what's piling
> up in ExecutorState.  I want to be sure of if the 1st, query functions
> patch, is not implicated in this, because I'd like to get that one out
> of the way sooner than later.
>

I did some tests. it generally looks like:

create or replace function random_text() returns text
as $$select string_agg(md5(random()::text),'') from
generate_Series(1,8) s $$ LANGUAGE SQL;
DROP TABLE if exists s;
create table s(a jsonb);
INSERT INTO s SELECT (
'{"id": "' || random_text() || '",'
'"type": "' || random_text() || '",'
'"title": "' || random_text() || '",'
'"original_title": "' || random_text() || '",'
'"is_adult": "' || random_text() || '",'
'"start_year": "' || random_text() || '",'
'"end_year": "' || random_text() || '",'
'"minutes": "' || random_text() || '",'
'"genres": "' || random_text() || '",'
'"aliases": "' || random_text() || '",'
'"genres": "' || random_text() || '",'
'"directors": "' || random_text() || '",'
'"writers": "' || random_text() || '",'
'"ratings": "' || random_text() || '",'
'"director_name": "' || random_text() || '",'
'"alias_title": "' || random_text() || '",'
'"alias_region": "' || random_text() || '",'
'"director_birth_year": "' || random_text() || '",'
'"director_death_year": "' || random_text() || '",'
'"rating_average": "' || random_text() || '",'
'"rating_votes": "' || random_text() || '"'
||'}' )::jsonb
FROM generate_series(1, 1e6);
SELECT pg_size_pretty(pg_table_size('s')); -- 5975 MB

It's less complex than Tomas's version.

attached, 3 test files:
1e5 rows, each key's value is small.  Total table size is 598 MB.
1e6 rows, each key's value is small. Total table size is 5975 MB.
27 rows, total table size is 5066 MB.
The test file's comment is the output I extracted using
pg_log_backend_memory_contexts,
mainly ExecutorState and surrounding big number memory context.

Conclusion, I come from the test:
if each json is big (5066 MB/27) , then it will take a lot of memory.
if each json is small(here is 256 byte), then it won't take a lot of
memory to process.

Another case, I did test yet: more keys in a single json, but the
value is small.

Attachment

pgsql-hackers by date:

Previous
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Synchronizing slots from primary to standby
Next
From: "Zhijie Hou (Fujitsu)"
Date:
Subject: RE: Synchronizing slots from primary to standby