Home > mailing lists

Thread: creating a subset DB efficiently ?

creating a subset DB efficiently ?

From

David Gauthier

Date:

08 March 2024, 16:22:17

Here's the situation....

- The DB contains data for several projects.

- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)

- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade". So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.

- New projects come in, and old ones need to be removed and "archived" in DBs of their own. So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).

- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".

- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.

The brute-force method I've been using is...

1) pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).

2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.

3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources. So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever. I fear it's because it's trying to journal everything in case I want to rollback. But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction. How can I run a delete command like this without the rollback recovery overhead ?

Re: creating a subset DB efficiently ?

From

Ron Johnson

Date:

08 March 2024, 17:05:11

On Fri, Mar 8, 2024 at 11:22 AM David Gauthier <dfgpostgres@gmail.com> wrote:

Here's the situation....

- The DB contains data for several projects.
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade". So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own. So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.

The brute-force method I've been using is...
1) pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources. So I'm looking for ideas on how to do this better.

I wrote this function to list all of the FK "children" that reference a "parent" table. It might help you determine all of the children, grandchildren, etc.

CREATE OR REPLACE FUNCTION dba.get_fk_referenced_by_child(_p_table_name TEXT)
RETURNS TABLE (downstream_table TEXT, downstream_column NAME, downstream_index NAME)
LANGUAGE plpgsql
AS
$$
DECLARE
_v_name TEXT;
_v_downstream_index_name NAME;
_c_downstream CURSOR FOR
select cla2.relnamespace::regnamespace::text||'.'||co.conrelid::regclass::text AS downstream_table
, co.conrelid AS downstream_oid
, att.attname AS downstream_column
from pg_constraint co
inner join pg_class cla1
on co.confrelid = cla1.oid
inner join pg_class cla2
on co.conrelid = cla2.oid
inner join pg_attribute att
on co.conrelid = att.attrelid
where co.contype = 'f'
and cla1.relnamespace::regnamespace::text||'.'||cla1.relname = _v_name
and att.attnum = co.conkey[1]
order by 1;
BEGIN
_v_name := TRIM(BOTH FROM _p_table_name);
PERFORM relname
FROM pg_class
WHERE relnamespace::regnamespace::text||'.'||relname = _v_name
AND relkind = 'r';
IF NOT FOUND THEN
downstream_table := _v_name;
downstream_column := 'does not exist';
downstream_index := NULL;
RETURN NEXT;
RETURN;
END IF;
FOR i IN _c_downstream
LOOP
downstream_table := i.downstream_table;
downstream_column := i.downstream_column;
SELECT indexrelid::regclass
into downstream_index
FROM pg_index
INNER JOIN pg_attribute ON indexrelid = attrelid
WHERE indrelid = i.downstream_oid
AND attname = i.downstream_column;
IF downstream_index IS NULL THEN
downstream_index := 'ALERT!!! MISSING INDEX';
END IF;
RETURN NEXT;
END LOOP;
END;
$$;

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever. I fear it's because it's trying to journal everything in case I want to rollback. But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction. How can I run a delete command like this without the rollback recovery overhead ?

Re: creating a subset DB efficiently ?

From

hector vass

Date:

09 March 2024, 14:56:07

On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostgres@gmail.com> wrote:

Here's the situation....

- The DB contains data for several projects.
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade". So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own. So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.

The brute-force method I've been using is...
1) pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources. So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever. I fear it's because it's trying to journal everything in case I want to rollback. But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction. How can I run a delete command like this without the rollback recovery overhead ?

>(I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).

That's a shame. You can do something similar with tablespaces

Template your existing schema to create a new schema for the project (pg_dump -s)

Create tablespace for this new project and schema

You can then move the physical tablespace to cheaper disk and use symbolic links or... archive and/or back it up at the schema level with pg_dump -n

...as long as you don't put anything in the public schema all you are really sharing is roles otherwise a bit like a separate database

Re: creating a subset DB efficiently ?

From

Adrian Klaver

Date:

09 March 2024, 17:02:23

On 3/8/24 08:22, David Gauthier wrote:
> Here's the situation....
> 
> - The DB contains data for several projects.
> - The tables of the DB contain data for all projects (data is not 
> partitioned on project name or anything like that)
> - The "project" identifier (table column) exists in a few "parent" 
> tables with many child... grandchild,... tables under them connected 
> with foreign keys defined with "on delete cascade".  So if a record in 
> one of the parent table records is deleted, all of its underlying, 
> dependent records get deleted too.

How many "... child... grandchild,... tables" ?

Do these tables constitute all the tables in the database?


> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.  I 
> fear it's because it's trying to journal everything in case I want to 
> rollback.  But this is just in the archive DB and I don't mind taking 
> the risk if I can speed this up outside of a transaction.  How can I run 
> a delete command like this without the rollback recovery overhead ?

I am assuming that at the point you do  "delete from par_tbl_a where 
project <> 'a' " project a is no longer receiving data and its records 
are static. Further assuming there is a PK that you could order by, then 
it would seem the way to go would be to delete in batches as determined 
by the PK.


-- 
Adrian Klaver
adrian.klaver@aklaver.com

Re: creating a subset DB efficiently ?

From

Shaheed Haque

Date:

09 March 2024, 18:44:50

I go more or less the other way. I have a script which follows the child...parent...grandparent...etc foreign keys in the source database and dumps only the records belonging to the selected "project" (your terminology, in my case it is "client"). I save the dumped data to an archive file.

The script has hardcoded knowledge only about the absolute minimum number of the root ancestor tables (and certain special cases mostly relating to the fact I'm talking about a Django ORM-managed schema) and otherwise tries to avoid knowing too much so that the schema can evolve mostly without bothering the script.

The script includes meta data about the dump in a "manifest". The manifest records, amongst other things, the order in which the tables were dumped. The restore operation uses this to correctly order the table restores.

I can then simply restore the archive, table by table, using the individual dumps using a script which walks the manifest.

On Sat, 9 Mar 2024, 14:56 hector vass, <hector.vass@gmail.com> wrote:

On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostgres@gmail.com> wrote:
Here's the situation....

- The DB contains data for several projects.
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade". So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own. So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.

The brute-force method I've been using is...
1) pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources. So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever. I fear it's because it's trying to journal everything in case I want to rollback. But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction. How can I run a delete command like this without the rollback recovery overhead ?

>(I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).

That's a shame. You can do something similar with tablespaces
Template your existing schema to create a new schema for the project (pg_dump -s)
Create tablespace for this new project and schema
You can then move the physical tablespace to cheaper disk and use symbolic links or... archive and/or back it up at the schema level with pg_dump -n
...as long as you don't put anything in the public schema all you are really sharing is roles otherwise a bit like a separate database

Re: creating a subset DB efficiently ?

From

"Peter J. Holzer"

Date:

09 March 2024, 21:34:55

On 2024-03-08 11:22:17 -0500, David Gauthier wrote:
> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.

Have you checked whether all the FK columns have an index? Otherwise
checking for references means a full table scan which may take a long
time.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment

signature.asc