Thread: Separating data sets in a table

Separating data sets in a table

From

Andreas Tille

Date:

22 August 2002, 05:47:21

Hello,

I want to solve the following problem:

CREATE TABLE Ref    ( Id int ) ;
CREATE TABLE Import ( Id    int,                     Other varchar(42),                     Flag  int,
  Ts    timestamp ) ;
 
CREATE TABLE Data   ( Id    int,                     Other varchar(42) ) ;

The table Import will be filled by a COPY FROM statement and contains
no checks for referential integrity.  The columns Id and Other have to
be moved to the table Data if the table Ref contains the Id.  If not
Flag should get a certain value that something went wrong.  Moreover
Import should only contain one representation  of a dataset with equal
Id and Other column and I would like to store the newest one (this
is the reason for the timestamp).

I tried to do the following approach:
  CREATE LOCAL TEMPORARY TABLE ImportOK () INHERITS (Import) ;
  INSERT INTO ImportOK SELECT * FROM Import i         INNER JOIN  Ref r ON i.Id = r.Id;
  DELETE FROM Import WHERE Id IN (SELECT Id FROM ImportOK) ;

The idea was that the latest statement should get rid of all valid
data sets from Import.  The valid datasets now could be moved to Data
and I could afterwards check Import for duplicated data sets.
Unfortunately the latest statement is so terribly slow that I can't
imagine that there is a better way to do this.

It seems like a very beginner question but I have no real clue how
to do this right.  Probably the solution has to be done completely
different.

Thanks for your patience
         Andreas.

Updatable View (Was: Separating data sets in a table)

From

Andreas Tille

Date:

23 August 2002, 06:01:15

On Thu, 22 Aug 2002, Andreas Tille wrote:

> I want to solve the following problem:
>
> CREATE TABLE Ref    ( Id int ) ;
> CREATE TABLE Import ( Id    int,
>                       Other varchar(42),
>                       Flag  int,
>                       Ts    timestamp ) ;
> CREATE TABLE Data   ( Id    int,
>                       Other varchar(42) ) ;
>
> The table Import will be filled by a COPY FROM statement and contains
> no checks for referential integrity.  The columns Id and Other have to
> be moved to the table Data if the table Ref contains the Id.  If not
> Flag should get a certain value that something went wrong.  Moreover
> Import should only contain one representation  of a dataset with equal
> Id and Other column and I would like to store the newest one (this
> is the reason for the timestamp).
> ...
While I fiddled around with my previous solution I found out that perhaps
an updatable view could solve my problem.  Thus I tried to implement:
  CREATE TABLE ViewImport ( Id    int,                            Other varchar(42),                            Flag
int,                           Ts    timestamp ) ;  CREATE RULE "_RETviewimport" AS   ON SELECT TO ViewImport   DO
INSTEAD  SELECT i.* FROM Import i              INNER JOIN Ref r ON i.Id = r.Id;
 

Now I wonder if it is possible to find a clever rule for update which
contains the JOIN I placed into this working example to set the Flag
just right now for the appropriate data sets.

Any help for this approach or any other solution for this problem?

Kind regards
        Andreas.

Re: Separating data sets in a table

From

Mark Stosberg

Date:

24 August 2002, 19:27:28

On Thu, 22 Aug 2002, Andreas Tille wrote:
> Hello,
>
> I want to solve the following problem:
>
> CREATE TABLE Ref    ( Id int ) ;
> CREATE TABLE Import ( Id    int,
>                       Other varchar(42),
>                       Flag  int,
>                       Ts    timestamp ) ;
> CREATE TABLE Data   ( Id    int,
>                       Other varchar(42) ) ;

Hello Andreas,

I it might be easier to help you with your problem if you take a step
back from it. What you are describing here is an at attempt at solving
your particular problem, but you don't clearly define exactly what the
larger problem. I get the sense that you have data you importing on a
regular basis from outside Postgres, and you want to check it before
it get moves into production, but I'm not exactly sure what's happening.

Could you clarify what you are trying to do, not in the SQL, but just
the english version of what's happening? I think it might be easier to
help you solve your problem with that knowledge.

Thanks,
 -mark

http://mark.stosberg.com/

Re: Separating data sets in a table

From

Andreas Tille

Date:

25 August 2002, 16:12:11

On Sat, 24 Aug 2002, Mark Stosberg wrote:

> On Thu, 22 Aug 2002, Andreas Tille wrote:
> > Hello,
> >
> > I want to solve the following problem:
> >
> > CREATE TABLE Ref    ( Id int ) ;
> > CREATE TABLE Import ( Id    int,
> >                       Other varchar(42),
> >                       Flag  int,
> >                       Ts    timestamp ) ;
> > CREATE TABLE Data   ( Id    int,
> >                       Other varchar(42) ) ;
> larger problem. I get the sense that you have data you importing on a
> regular basis from outside Postgres, and you want to check it before
> it get moves into production, but I'm not exactly sure what's happening.

You are completely right.  I just do an import from an external database.
The person I obtain the data from does an output of the table in a form
to do a "COPY FROM".  The problem is that it might happen that there are
some data rows which infringe referential integrity and I have to ask
back the data provider for additional data which describe additional data
which are referenced by the Id mentioned above.  So I have to sort out those
data sets who have no known Id in my production data.

Kind regards
             Andreas.

Re: Separating data sets in a table

From

Mark Stosberg

Date:

25 August 2002, 17:35:09

On Sun, 25 Aug 2002, Andreas Tille wrote:

> On Sat, 24 Aug 2002, Mark Stosberg wrote:
>
> > On Thu, 22 Aug 2002, Andreas Tille wrote:
> > > Hello,
> > >
> > > I want to solve the following problem:
> > >
> > > CREATE TABLE Ref    ( Id int ) ;
> > > CREATE TABLE Import ( Id    int,
> > >                       Other varchar(42),
> > >                       Flag  int,
> > >                       Ts    timestamp ) ;
> > > CREATE TABLE Data   ( Id    int,
> > >                       Other varchar(42) ) ;
> > larger problem. I get the sense that you have data you importing on a
> > regular basis from outside Postgres, and you want to check it before
> > it get moves into production, but I'm not exactly sure what's happening.
>
> You are completely right.  I just do an import from an external database.
> The person I obtain the data from does an output of the table in a form
> to do a "COPY FROM".  The problem is that it might happen that there are
> some data rows which infringe referential integrity and I have to ask
> back the data provider for additional data which describe additional data
> which are referenced by the Id mentioned above.  So I have to sort out those
> data sets who have no known Id in my production data.

Andreas,

Thanks for the clarification. Here's an idea about how to solve your
problem. As you are importing your data, instead of doing it all at
once, try import it a row at a time into a table that has the RI turned
on. Check each insert to see if it's successful. It if it's not
successful, then insert that row into a table that /doesn't/ have RI
(maybe "import_failures"),
perhaps also including the error that Postgres returned. (This may be
stored in $DBH::errstr). Then when you are done, you can look in the
import_failures for a report of which rows need some assistance. If you
need every row to succeed that's imported into the production table, you
can do all this inside of a transaction, and roll it back if any of the
inserts fail. [ thinks for a moment. ] Of course, that would normally
rollback your inserts into import_failures too, so perhaps you can use a
second database connection to make sure those always happen.

I hope that helps. Perhaps thinking in terms of "row-at-a-time
processing" will help you solve your problem.
   -mark

http://mark.stosberg.com/

Re: Separating data sets in a table

From

Andreas Tille

Date:

26 August 2002, 02:26:56

On Sun, 25 Aug 2002, Mark Stosberg wrote:

> Thanks for the clarification. Here's an idea about how to solve your
> problem. As you are importing your data, instead of doing it all at
> once, try import it a row at a time into a table that has the RI turned
> on. Check each insert to see if it's successful. It if it's not
> successful, then insert that row into a table that /doesn't/ have RI
> (maybe "import_failures"),
> perhaps also including the error that Postgres returned. (This may be
> stored in $DBH::errstr). Then when you are done, you can look in the
> import_failures for a report of which rows need some assistance. If you
> need every row to succeed that's imported into the production table, you
> can do all this inside of a transaction, and roll it back if any of the
> inserts fail. [ thinks for a moment. ] Of course, that would normally
> rollback your inserts into import_failures too, so perhaps you can use a
> second database connection to make sure those always happen.
>
> I hope that helps. Perhaps thinking in terms of "row-at-a-time
> processing" will help you solve your problem.
Well for sure this might be an option but as I said I receive the data
in the dump format apropriate to use "COPY <tablemane> FROM <file>".  Would
you really like to suggest me to split those data sets into single lines?
Moreover I'm not sure about how to catch the error messages of failed
COPY statements.

I've thought that including all data and handling them afterwards would
be agood idea and it is just my lack of SQL knowledge which prevents
me from finding a clever solution to sort the stuff out.

Kind regards
        Andreas.

Re: Separating data sets in a table

From

Oliver Elphick

Date:

26 August 2002, 04:36:55

On Mon, 2002-08-26 at 07:26, Andreas Tille wrote:
> Well for sure this might be an option but as I said I receive the data
> in the dump format apropriate to use "COPY <tablemane> FROM <file>".  Would
> you really like to suggest me to split those data sets into single lines?
> Moreover I'm not sure about how to catch the error messages of failed
> COPY statements.

How about this approach:      Create a temporary table (no constraints)          CREATE TEMP TABLE temptable AS
(SELECT* FROM tablename LIMIT 1);       DELETE FROM temptable;      Copy all data into the temporary table
COPYtemptable FROM 'filepath';      Select from the temporary table all items that satisfy the   constraints, insert
theminto the real table and delete them from   the temporary table:              BEGIN;       INSERT INTO tablename
(SELECT* FROM temptable WHERE ...);       DELETE FROM temptable WHERE ...;       COMMIT;      All good data should now
bein place.  The temporary table should   now contain only those items that do not satisfy the constraints for   the
realtable.

-- 
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight, UK                            
http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
========================================   "Seeing then that all these things shall be dissolved,      what manner of
personsought ye to be? You ought to     live holy and godly lives as you look forward to the     day of God and speed
itscoming."                                II Peter 3:11,12

Re: Separating data sets in a table

From

Andreas Tille

Date:

26 August 2002, 05:46:43

On 26 Aug 2002, Oliver Elphick wrote:

>     Create a temporary table (no constraints)
>
>         CREATE TEMP TABLE temptable AS
>           (SELECT * FROM tablename LIMIT 1);
>         DELETE FROM temptable;
>
>     Copy all data into the temporary table
>
>         COPY temptable FROM 'filepath';
Up to this point I have no problems.  The difference is that I have to
mark the invalid data sets by a flag which represents a "reason" why
the data were invalid.  That's why I use an additional flag in the
table and I do not use a temporary table because I have to store the
"histrory" of invalid data (to bother the provider of the data to
fix it).

>     Select from the temporary table all items that satisfy the
>     constraints, insert them into the real table and delete them from
>     the temporary table:
>
>         BEGIN;
>         INSERT INTO tablename (SELECT * FROM temptable WHERE ...);
>         DELETE FROM temptable WHERE ...;
>         COMMIT;
>
>     All good data should now be in place.  The temporary table should
>     now contain only those items that do not satisfy the constraints for
>     the real table.
This was in  my first atempt here.
The problem I have is that I need a JOIN to a further table and
I've got errors from the parser which let me guess that joins are not
allowed in INSERT statements ... at least I do not know how to do it
right if it should be possible.  That's why I had the idea just to
set a certain flag and then do the insert of all data sets where flag = OK.

Kind regards
        Andreas.

Re: Separating data sets in a table

From

Oliver Elphick

Date:

26 August 2002, 06:31:44

On Mon, 2002-08-26 at 10:46, Andreas Tille wrote:
> On 26 Aug 2002, Oliver Elphick wrote:
> 
> >     Select from the temporary table all items that satisfy the
> >     constraints, insert them into the real table and delete them from
> >     the temporary table:
> >
> >         BEGIN;
> >         INSERT INTO tablename (SELECT * FROM temptable WHERE ...);
> >         DELETE FROM temptable WHERE ...;
> >         COMMIT;
> >
> >     All good data should now be in place.  The temporary table should
> >     now contain only those items that do not satisfy the constraints for
> >     the real table.
> This was in  my first atempt here.
> The problem I have is that I need a JOIN to a further table and
> I've got errors from the parser which let me guess that joins are not
> allowed in INSERT statements ... at least I do not know how to do it
> right if it should be possible. 

A natural join seems to work fine:      bray=# insert into junk (select b.* from batch as b, product as p   where
b.product= p.id and p.eancode is not null);   INSERT 0 7552
 

I don't have any earlier messages from this thread; what was the syntax
you were trying to use?

-- 
Oliver Elphick                                Oliver.Elphick@lfix.co.uk
Isle of Wight, UK                            
http://www.lfix.co.uk/oliver
GPG: 1024D/3E1D0C1C: CA12 09E0 E8D5 8870 5839  932A 614D 4C34 3E1D 0C1C
========================================   "Seeing then that all these things shall be dissolved,      what manner of
personsought ye to be? You ought to     live holy and godly lives as you look forward to the     day of God and speed
itscoming."                                II Peter 3:11,12

Re: Separating data sets in a table

From

"OU"

Date:

28 August 2002, 10:51:58

"Andreas Tille" <tillea@rki.de> a �crit dans le message de news:
Pine.LNX.4.44.0208221126490.1478-100000@wr-linux02.rki.ivbb.bund.de...

...
> I tried to do the following approach:
>
>    CREATE LOCAL TEMPORARY TABLE ImportOK () INHERITS (Import) ;
>
>    INSERT INTO ImportOK SELECT * FROM Import i
>           INNER JOIN  Ref r ON i.Id = r.Id;
>
>    DELETE FROM Import WHERE Id IN (SELECT Id FROM ImportOK) ;
...

> Unfortunately the latest statement is so terribly slow that I can't
> imagine that there is a better way to do this.
>

You must use EXISTS if you work with big tables.
EXISTS use indexes, and IN use temporary tables.

-- this change nothing for IN :
CREATE INDEX import_id_index ON import(id);
CREATE INDEX import_ok_id_index ON import_ok(id);
-- slow :
-- DELETE FROM import WHERE id IN (SELECT id FROM import_ok) ;
DELETE FROM import WHERE EXISTS ( SELECT id FROM import_ok AS ok   WHERE ok.id = import.id );