Re: Memory Issue with array_agg? - Mailing list pgsql-general

From Robert Sosinski
Subject Re: Memory Issue with array_agg?
Date
Msg-id CAH1hji1ESS-B3iwXn2XQ5WpTtUHbYYXgV+bvMnsntzzqK9tcNg@mail.gmail.com
Whole thread Raw
In response to Re: Memory Issue with array_agg?  (Pavel Stehule <pavel.stehule@gmail.com>)
List pgsql-general
Hi Pavel,

Here are the explains you asked for:

explain analyze select string_agg(id::text,',') from things group by guid;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=400357.78..433784.93 rows=1337086 width=37) (actual time=41434.485..53195.185 rows=2378626 loops=1)
   ->  Sort  (cost=400357.78..403700.49 rows=1337086 width=37) (actual time=41434.433..44992.736 rows=2378626 loops=1)
         Sort Key: guid
         Sort Method: quicksort  Memory: 284135kB
         ->  Seq Scan on things  (cost=0.00..264304.86 rows=1337086 width=37) (actual time=0.027..21429.179 rows=2378626 loops=1)
 Total runtime: 56295.362 ms
(6 rows)


explain analyze select array_agg(id::text) from things group by guid;
                                                           QUERY PLAN                                                            
---------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=400357.78..433784.93 rows=1337086 width=37) (actual time=23953.922..38157.059 rows=2378626 loops=1)
   ->  Sort  (cost=400357.78..403700.49 rows=1337086 width=37) (actual time=23953.847..27527.316 rows=2378626 loops=1)
         Sort Key: guid
         Sort Method: quicksort  Memory: 284135kB
         ->  Seq Scan on things  (cost=0.00..264304.86 rows=1337086 width=37) (actual time=0.007..4941.752 rows=2378626 loops=1)
 Total runtime: 41280.897 ms
(6 rows)



These seem to be running on the machine now, and the memory is not inflating, I just run this one, and it blew up.

explain with t as (select id, guid, md5(concat_ws(':', fields -> 'a', fields -> 'b', fields -> 'c', fields -> 'd', fields -> 'e', foo_id::text)) from things) select md5, count(id), array_agg(id) from t group by 1 having count(id) > 1;

-Robert


On Tue, Aug 20, 2013 at 1:53 PM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Can you send a EXPLAIN result in both use cases?

Pavel


2013/8/19 Robert Sosinski <rsosinski@ticketevolution.com>
At the moment, all guids are distinct, however before I zapped the duplicates, there were 280 duplicates.

Currently, there are over 2 million distinct guids.

-Robert


On Mon, Aug 19, 2013 at 11:12 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:



2013/8/19 Robert Sosinski <rsosinski@ticketevolution.com>
Hi Pavel,

What kind of example do you need?  I cant give you the actual data I have in the table, but I can give you an example query and the schema attached below.  From there, I would just put in 2 million rows worth 1.2 Gigs of data.  Average size of the the extended columns (using the pg_column_size function) in bytes are:

guid: 33
name: 2.41
currency: 4
fields: 120.32

example query:

-- find duplicate records using a guid
select guid, array_agg(id) from orders group by guid;

how much distinct guid is there, and how much duplicates

??

regards

Pavel

 

example schema:
                                     Table "public.things"                                                        
   Column   |            Type             |                      Modifiers                      | Storage  | Stats target | Description 
------------+-----------------------------+-----------------------------------------------------+----------+--------------+-------------
 id         | integer                     | not null default nextval('things_id_seq'::regclass) | plain    |              | 
 version    | integer                     | not null                                            | plain    |              | 
 created_at | timestamp without time zone | not null                                            | plain    |              | 
 updated_at | timestamp without time zone | not null                                            | plain    |              | 
 foo_id     | integer                     | not null                                            | plain    |              | 
 bar_id     | integer                     | not null                                            | plain    |              | 
 baz_id     | integer                     | not null                                            | plain    |              | 
 guid       | character varying           | not null                                            | extended |              | 
 name       | character varying           | not null                                            | extended |              | 
 price      | numeric(12,2)               | not null                                            | main     |              | 
 currency   | character varying           | not null                                            | extended |              | 
 amount     | integer                     | not null                                            | plain    |              | 
 the_date   | date                        | not null                                            | plain    |              | 
 fields     | hstore                      |                                                     | extended |              | 
Indexes:
    "things_pkey" PRIMARY KEY, btree (id)
    "things_foo_id_idx" btree (foo_id)
    "things_bar_id_idx" btree (bar_id)
    "things_baz_id_idx" btree (baz_id)
    "things_guid_uidx" UNIQUE, btree (guid)
    "things_lpad_lower_name_eidx" btree (lpad(lower(name::text), 10, '0'::text))
    "things_price_idx" btree (price)
    
Foreign-key constraints:
    "things_foo_id_fkey" FOREIGN KEY (foo_id) REFERENCES foos(id)
    "things_bar_id_fkey" FOREIGN KEY (bar_id) REFERENCES bars(id)
    "things_baz_id_fkey" FOREIGN KEY (baz_id) REFERENCES bazs(id)
Triggers:
    timestamps_trig BEFORE INSERT OR UPDATE ON things FOR EACH ROW EXECUTE PROCEDURE timestamps_tfun()

Let me know if you need anything else.

Thanks,


On Mon, Aug 19, 2013 at 3:29 AM, Pavel Stehule <pavel.stehule@gmail.com> wrote:
Hello

please, can you send some example or test?

Regards

Pavel Stehule


2013/8/19 Robert Sosinski <rsosinski@ticketevolution.com>
When using array_agg on a large table, memory usage seems to spike up until Postgres crashes with the following error:

2013-08-17 18:41:02 UTC [2716]: [2] WARNING: terminating connection because of crash of another server process
2013-08-17 18:41:02 UTC [2716]: [3] DETAIL: The postmaster has commanded this server process to roll back the current transaction and exit, because another server process exited abnormally and possibly corrupted shared memory.
2013-08-17 18:41:02 UTC [2716]: [4] HINT: In a moment you should be able to reconnect to the database and repeat your command.


I've definitely isolated it down to using array_agg, as when I changed the query to use string_agg, it worked fine.  I also tried using array_agg on a few different queries, all yielding the same issue.  Swapping in string_agg fixed the issue once more.

This particular table has over 2 million rows and is 1.2 Gigs, and when I ran the query while viewing htop, the virtual size of the Postgres process ballooned to 13.9G until crashing.

The version of Postgres I am using is: PostgreSQL 9.2.4 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu/Linaro 4.7.2-2ubuntu1) 4.7.2, 64-bit

Any help would be much appreciated, thanks!

-Robert






pgsql-general by date:

Previous
From: Don Parris
Date:
Subject: Re: Locale Issue
Next
From: Ivan Radovanovic
Date:
Subject: Unique constraint and unique index