Partitioning into thousands of tables? - Mailing list pgsql-general

From Data Growth Pty Ltd
Subject Partitioning into thousands of tables?
Date
Msg-id AANLkTimZaiO+7mTuR=sX0jCQwU7Uk+_XZnudL4qRMUsh@mail.gmail.com
Whole thread Raw
Responses Re: Partitioning into thousands of tables?
Re: Partitioning into thousands of tables?
List pgsql-general
I have a table of around 200 million rows, occupying around 50G of disk.  It is slow to write, so I would like to partition it better.

The table is roughly:
  id: integer  # unique from sequence
  external_id : varchar(255) # unique, used to interface with external systems, not updated (only select or insert)
  sid : integer # secondary partial id, not unique, not updated (only select or insert)
  columns: many, including text, date etc, frequently updated

Currently I have the table partitioned by ranges of id, with the aim of keeping each partition table to around 1G (so needing around 50 such tables).  Almost every select query is referenced by id, so access to an individual record is OK and scanning is avoided.  "external_id" is moved to its own table, tableB (id, external_id), as it has a unique index, which would otherwise result in all 50 sub-tables being searched.

The problem is that when I need to update say 1 million records (which happens a lot), it is very slow because "id" is essentially uncorrelated with the updates I need to perform.  Thus all partitions are hammered.

It turns our that writes (inserts and updates) are highly localised to the column "sid" (all update transactions and most batches of transactions share a single value for "sid" or a pair of values for "sid"). So I would like to partition on "sid" instead.  But "sid" currently has around 2500 unique values, with some values being very common, and some rare.  In hindsight I can measure the size of these sub-ranges, but I can't really predict in advance which will prove to be the highly updated or numerically large sub-ranges.  New values are added almost daily (the total is increasing by around 200 per year).

So my questions:

Is there any significant performance problem associated with partitioning a table into 2500 sub-tables?  I realise a table scan would be horrendous, but what if all accesses specified the partitioning criteria "sid".  Such a scheme would be the simplest to maintain (I think) with the best localisation of writes.

Is there a particular size for a sub-table that I should aim for or avoid?

If say 50 tables is much better than 2500, is there a better way to perform the partitioning than writing a giant rule such as:

CREATE OR REPLACE FUNCTION my_insert_trigger()
RETURNS TRIGGER AS $$
BEGIN
IF ( NEW.sid in (1, 7, 14)) THEN
INSERT INTO subtable_1 VALUES (NEW.*);
ELSIF ( NEW.sid in (2, 3, 31, 32, 1027, 1028, 1029, 1965)) THEN
INSERT INTO subtable_2 VALUES (NEW.*);
...
ELSE
RAISE EXCEPTION 'SID out of range. Fix the my_insert_trigger() function!';
END IF;
RETURN NULL;
END;
$$
LANGUAGE plpgsql;
which would need to be updated fairly regularly as new values of "sid" are added, and the frequency of existing "sid" groupings changes.  Can partitioning triggers perform a table lookup, or is this a big performance no-no?

Or should I adopt a simpler, more stable "level 1" partitioning, that would result in very uneven partition sizes (say a simple hash function on "sid").  And then partition any of those sub-tables that are too big with a further partition (partition within a partition)?


Chapter 5.9 in the manual is a bit brief on examples (particularly the maintenance of complex partitions).  So any pointers or tips would be appreciated.

Stephen

pgsql-general by date:

Previous
From: Devrim GÜNDÜZ
Date:
Subject: Application name and psql in 9.0
Next
From: Fujii Masao
Date:
Subject: Re: Application name and psql in 9.0