Re: Dependency graph of all tuples relied upon in a query answer - Mailing list pgsql-general

From Randall Lucas
Subject Re: Dependency graph of all tuples relied upon in a query answer
Date
Msg-id 20060901015446.GB17085@ontology.tercent.com
Whole thread Raw
In response to Re: Dependency graph of all tuples relied upon in a query answer  ("Merlin Moncure" <mmoncure@gmail.com>)
Responses Re: Dependency graph of all tuples relied upon in a query answer  ("Merlin Moncure" <mmoncure@gmail.com>)
List pgsql-general
On Thu, Aug 31, 2006 at 09:17:54AM -0400, Merlin Moncure wrote:
> On 8/30/06, Randall Lucas <rlucas@tercent.com> wrote:
>
> >An inverted way of thinking about the problem is the notion of getting a
> >source document (say, a "company registration form") and parsing and
> >storing it in a normalized format.  How do you "tag" the origin of each
> >and every row in every table that resulted from that source document?
>
> your form should have a code, either entered by the customer or by the
> preparer who enters it into the database, which becomes the key that
> identifies the registration document.  Put that key into other tables.

Yes -- I have flirted with this idea for the case of importing a set of
facts from a single source.  But where it breaks down is in being able
to do the reverse -- ex post facto, to certify the results of a query
as being accurate, and thereby attesting to the underlying facts.

> be careful, you are flirting with EAV thinking.  I think EAV designs
> are terrible.

(Though religion is not at issue here, I am pretty devoutly relational.
I just want a good way to perform audits and other functions on my
relations.)

> > It seems to me that the elegant way to do this would be to get the
> >entire graph of dependencies for not only all tables that reference the
> >entity, but only those rows within those tables that refer specifically
> >to my entity.
> >
> >The query engine should have a pretty good idea about which tables and
> >which rows would actually be used in forming the responses to a given
> >query.  How can I get this information?  Or am I off the deep end (-ency
> >graph) with this one?
>
> I am not sure where you are going with this.  Maybe you should mock up
> some simple tables and repost your question.

Imagine that I am the SEC.  I look for cases of insider trading, and
when I find a case, I have to go back and double-check all the pieces
of evidence I am using against them.

    create table company (
        id      serial primary key,
        name    text not null
    );

    create table officer (
        id      serial primary key,
        company_id  int not null references company(id) on delete cascade,
        name    text not null,
        title   text not null
    );

    create table insider_trade (
        id      serial primary key,
        officer_id int not null references officer(id) on delete cascade,
        shares_traded   numeric,
        share_price     numeric,
        trade_date      timestamptz
    );

    insert into company (name) values ('goodco');
    insert into company (name) values ('badco');
    insert into officer (company_id, name, title) values (1, 'johnny b. good', 'ceo');
    insert into officer (company_id, name, title) values (1, 'mother teresa', 'saint');
    insert into officer (company_id, name, title) values (2, 'leroy brown', 'ceo');
    insert into insider_trade (officer_id, shares_traded, share_price, trade_date) values
        (3, '50000', '6.66', '2006-07-04 1:23 PM PST');

Now, my database looks like this:

    select * from company left join officer on company.id=officer.company_id left join insider_trade on
officer.id=insider_trade.officer_id;
     id |  name  | id | company_id |      name      | title | id | officer_id | shares_traded | share_price |
trade_date       

----+--------+----+------------+----------------+-------+----+------------+---------------+-------------+------------------------
      1 | goodco |  1 |          1 | johnny b. good | ceo   |    |            |               |             |
      1 | goodco |  2 |          1 | mother teresa  | saint |    |            |               |             |
      2 | badco  |  3 |          2 | leroy brown    | ceo   |  1 |          3 |         50000 |        6.66 |
2006-07-0414:23:00-07 
    (3 rows)

If I want to know whom to investigate, I might do a query like this:

    select * from company left join officer on company.id=officer.company_id left join insider_trade on
officer.id=insider_trade.officer_idwhere insider_trade.id is not null; 
     id | name  | id | company_id |    name     | title | id | officer_id | shares_traded | share_price |
trade_date       

----+-------+----+------------+-------------+-------+----+------------+---------------+-------------+------------------------
      2 | badco |  3 |          2 | leroy brown | ceo   |  1 |          3 |         50000 |        6.66 | 2006-07-04
14:23:00-07
    (1 row)

Now that I have this query, in order to make my case, I need to "sign
off" on all of the individual data that went into it.  I would like to
do something like:

    select last_query_shown_tuples();
     schema |  table_name   | pk_columns | pk_values
    --------+---------------+------------+-----------
     public |   company     |    [id]    |   [2]
     public |   officer     |    [id]    |   [3]
     public | insider_trade |    [id]    |   [1]
     (3 rows)

(I am aware that, at least theoretically, my query is just as reliant
upon the tuples that it *didn't* show, since they were acted upon by
the where clause.)

(I am also aware that this example is sub-optimal in that the primary
keys are all single ints called 'id' and that they are present
individually in the query result -- imagine that I used an aggregate
function in a GROUP BY, for example, and you can see how the ids relied
upon would become opaque, or imagine a multicolumn pk and you
understand my odd notation.)

Finally, I'd want to end up with something like:

    select last_query_shown_tuples();
     schema |  table_name   | pk_columns | pk_values | audited_ts
    --------+---------------+------------+-----------+------------
     public |   company     |    [id]    |   [2]     | 2006-08-31 18:52
     public |   officer     |    [id]    |   [3]     | 2006-08-31 18:55
     public | insider_trade |    [id]    |   [1]     | 2006-08-31 18:57
     (3 rows)

Is something like this even possible, much less doable at present?  I
have an inkling that with all of the capabilities of information_schema,
rules, and triggers, it should be pretty close to doable.

Best,

Randall

--
Randall Lucas       Tercent, Inc.       DF93EAD1

pgsql-general by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: [pgsql-advocacy] Thought provoking piece on
Next
From: Alvaro Herrera
Date:
Subject: Re: [pgsql-advocacy] Thought provoking piece on