Thread: creating a subset DB efficiently ?

creating a subset DB efficiently ?

From
David Gauthier
Date:
Here's the situation....

- The DB contains data for several projects.  
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade".  So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own.  So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.  

The brute-force method I've been using is...
1)  pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).  
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.  
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources.  So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever.  I fear it's because it's trying to journal everything in case I want to rollback.  But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction.  How can I run a delete command like this without the rollback recovery overhead ?     

Re: creating a subset DB efficiently ?

From
Ron Johnson
Date:
On Fri, Mar 8, 2024 at 11:22 AM David Gauthier <dfgpostgres@gmail.com> wrote:
Here's the situation....

- The DB contains data for several projects.  
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade".  So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own.  So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.  

The brute-force method I've been using is...
1)  pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).  
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.  
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources.  So I'm looking for ideas on how to do this better.


I wrote this function to list all of the FK "children" that reference a "parent" table.  It might help you determine all of the children, grandchildren, etc.

CREATE OR REPLACE FUNCTION dba.get_fk_referenced_by_child(_p_table_name TEXT)
    RETURNS TABLE (downstream_table TEXT, downstream_column NAME, downstream_index NAME)
    LANGUAGE plpgsql
    AS
$$
    DECLARE
        _v_name TEXT;
        _v_downstream_index_name NAME;
        _c_downstream CURSOR FOR
            select cla2.relnamespace::regnamespace::text||'.'||co.conrelid::regclass::text AS downstream_table
                    , co.conrelid AS downstream_oid
                    , att.attname AS downstream_column
            from pg_constraint co
                inner join pg_class cla1
                    on co.confrelid = cla1.oid
                inner join pg_class cla2
                    on co.conrelid = cla2.oid
                inner join pg_attribute att
                    on co.conrelid = att.attrelid
            where co.contype = 'f'
                and cla1.relnamespace::regnamespace::text||'.'||cla1.relname = _v_name
                and att.attnum = co.conkey[1]
            order by 1;
    BEGIN
        _v_name := TRIM(BOTH FROM _p_table_name);
        PERFORM relname
        FROM pg_class
        WHERE relnamespace::regnamespace::text||'.'||relname = _v_name
          AND relkind = 'r';
        IF NOT FOUND THEN
            downstream_table := _v_name;
            downstream_column := 'does not exist';
            downstream_index := NULL;
            RETURN NEXT;
            RETURN;
        END IF;
        FOR i IN _c_downstream
        LOOP
            downstream_table := i.downstream_table;
            downstream_column := i.downstream_column;
            SELECT indexrelid::regclass
            into downstream_index
            FROM pg_index
                INNER JOIN pg_attribute ON indexrelid = attrelid
            WHERE indrelid = i.downstream_oid
              AND attname = i.downstream_column;
            IF downstream_index IS NULL THEN
                downstream_index := 'ALERT!!! MISSING INDEX';
            END IF;
            RETURN NEXT;
        END LOOP;
    END;
$$;


 
Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever.  I fear it's because it's trying to journal everything in case I want to rollback.  But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction.  How can I run a delete command like this without the rollback recovery overhead ?     

Re: creating a subset DB efficiently ?

From
hector vass
Date:
On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostgres@gmail.com> wrote:
Here's the situation....

- The DB contains data for several projects.  
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade".  So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own.  So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.  

The brute-force method I've been using is...
1)  pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).  
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.  
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources.  So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever.  I fear it's because it's trying to journal everything in case I want to rollback.  But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction.  How can I run a delete command like this without the rollback recovery overhead ?     


>(I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).   

That's a shame.  You can do something similar with tablespaces
  Template your existing schema to create a new schema for the project (pg_dump -s)
  Create tablespace for this new project and schema

 You can then move the physical tablespace to cheaper disk and use symbolic links or... archive and/or back it up at the schema level with pg_dump -n

...as long as you don't put anything in the public schema all you are really sharing is roles otherwise a bit like a separate database




Re: creating a subset DB efficiently ?

From
Adrian Klaver
Date:
On 3/8/24 08:22, David Gauthier wrote:
> Here's the situation....
> 
> - The DB contains data for several projects.
> - The tables of the DB contain data for all projects (data is not 
> partitioned on project name or anything like that)
> - The "project" identifier (table column) exists in a few "parent" 
> tables with many child... grandchild,... tables under them connected 
> with foreign keys defined with "on delete cascade".  So if a record in 
> one of the parent table records is deleted, all of its underlying, 
> dependent records get deleted too.

How many "... child... grandchild,... tables" ?

Do these tables constitute all the tables in the database?


> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.  I 
> fear it's because it's trying to journal everything in case I want to 
> rollback.  But this is just in the archive DB and I don't mind taking 
> the risk if I can speed this up outside of a transaction.  How can I run 
> a delete command like this without the rollback recovery overhead ?

I am assuming that at the point you do  "delete from par_tbl_a where 
project <> 'a' " project a is no longer receiving data and its records 
are static. Further assuming there is a PK that you could order by, then 
it would seem the way to go would be to delete in batches as determined 
by the PK.


-- 
Adrian Klaver
adrian.klaver@aklaver.com




Re: creating a subset DB efficiently ?

From
Shaheed Haque
Date:
I go more or less the other way. I have a script which follows the child...parent...grandparent...etc foreign keys in the source database and dumps only the records belonging to the selected "project" (your terminology, in my case it is "client"). I save the dumped data to an archive file. 

The script has hardcoded knowledge only about the absolute minimum number of the root ancestor tables (and certain special cases mostly relating to the fact I'm talking about a Django ORM-managed schema) and otherwise tries to avoid knowing too much so that the schema can evolve mostly without bothering the script. 

The script includes meta data about the dump in a "manifest". The manifest records, amongst other things, the order in which the tables were dumped. The restore operation uses this to correctly order the table restores. 

I can then simply restore the archive, table by table, using the individual dumps using a script which walks the manifest. 






On Sat, 9 Mar 2024, 14:56 hector vass, <hector.vass@gmail.com> wrote:
On Fri, Mar 8, 2024 at 4:22 PM David Gauthier <dfgpostgres@gmail.com> wrote:
Here's the situation....

- The DB contains data for several projects.  
- The tables of the DB contain data for all projects (data is not partitioned on project name or anything like that)
- The "project" identifier (table column) exists in a few "parent" tables with many child... grandchild,... tables under them connected with foreign keys defined with "on delete cascade".  So if a record in one of the parent table records is deleted, all of its underlying, dependent records get deleted too.
- New projects come in, and old ones need to be removed and "archived" in DBs of their own.  So there's a DB called "active_projects" and there's a DB called "project_a_archive" (identical metadata).
- The idea is to copy the data for project "a" that's in "active_projects" to the "project_a_arhchive" DB AND delete the project a data out of "active_projects".
- Leave "project_a_archive" up and running if someone needs to attach to that and get some old/archived data.  

The brute-force method I've been using is...
1)  pg_dump "active_projects" to a (huge) file then populate "project_a_archive" using that (I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).  
2) go into the "project_a_archive" DB and run... "delete from par_tbl_1 where project <> 'a' ", "delete from par_tbl_2 where project <> 'a' ", etc... leaving only project "a" data in the DB.  
3) go into the "active_projects" DB and "delete from par_tbl_1 where project = 'a' ", etc... removing project "a" from the "active_projects DB.

Ya, not very elegant, it takes a long time and it takes a lot of resources.  So I'm looking for ideas on how to do this better.

Related question...
The "delete from par_tbl_a where project <> 'a' " is taking forever.  I fear it's because it's trying to journal everything in case I want to rollback.  But this is just in the archive DB and I don't mind taking the risk if I can speed this up outside of a transaction.  How can I run a delete command like this without the rollback recovery overhead ?     


>(I don't have the privs to create database, IT creates an empty one for me, so this is how I do it).   

That's a shame.  You can do something similar with tablespaces
  Template your existing schema to create a new schema for the project (pg_dump -s)
  Create tablespace for this new project and schema

 You can then move the physical tablespace to cheaper disk and use symbolic links or... archive and/or back it up at the schema level with pg_dump -n

...as long as you don't put anything in the public schema all you are really sharing is roles otherwise a bit like a separate database




Re: creating a subset DB efficiently ?

From
"Peter J. Holzer"
Date:
On 2024-03-08 11:22:17 -0500, David Gauthier wrote:
> Related question...
> The "delete from par_tbl_a where project <> 'a' " is taking forever.

Have you checked whether all the FK columns have an index? Otherwise
checking for references means a full table scan which may take a long
time.

        hp

--
   _  | Peter J. Holzer    | Story must make more sense than reality.
|_|_) |                    |
| |   | hjp@hjp.at         |    -- Charles Stross, "Creative writing
__/   | http://www.hjp.at/ |       challenge!"

Attachment