Thread: Query involving views

Query involving views

From
Laurent Martelli
Date:
Hello again,

This question is related to my previous one (Unused table of view, see
http://archives.postgresql.org/pgsql-performance/2004-06/msg00043.php).
For the moment, I have queries where I join tables by hand. Since a
few tables are always joined together, I thought I could define a view
to centralize this and make my queries more readable. But I then
observe a drop in performances on some queries because it seems the
view is not "broken" by the planner, so some optimizations cannot
occur anymore. Maybe this assertion is plain wrong, it's just my
feeling of the situation.

I'm using postgresql 7.4.2 on Debian GNU/Linux.

Here are the details of my tables, queries and views:

CREATE TABLE pictures (
    PictureID serial PRIMARY KEY,
    RollID character varying(64) NOT NULL REFERENCES rolls,
    FrameID character varying(64) NOT NULL,
    Description character varying(255),
    Filename character varying(255),
    Owner integer NOT NULL REFERENCES users,
    EntryDate datetime DEFAULT now(),
    Date datetime,
    NbClick integer DEFAULT 0,
    NbRates integer DEFAULT 0,
    MaxRate int2,
    MinRate int2,
    AverageRate float4 DEFAULT 5,
    SumRates integer DEFAULT 0);

-- Each picture can belong to a number of topics
CREATE TABLE topicscontent (
    TopicID integer REFERENCES topics ON DELETE cascade,
    PictureID integer REFERENCES pictures ON DELETE cascade,
    Direct boolean NOT NULL,
    PRIMARY KEY (TopicID,PictureID) );

-- Each picture can be viewed by a number of groups
CREATE TABLE permissions (
    GroupID integer NOT NULL REFERENCES groups ON DELETE cascade,
    PictureID integer NOT NULL REFERENCES pictures ON DELETE cascade,
    UNIQUE (GroupID, PictureID));

-- Each user can belong to a number of groups
CREATE TABLE groupsdef (
    UserID integer REFERENCES users,
    GroupID integer REFERENCES groups,
    PRIMARY KEY (UserID,GroupID));

-- Each picture can have a number of keywords
CREATE TABLE keywords (
        Type integer,
        PictureID integer NOT NULL REFERENCES pictures ON DELETE cascade,
        Value character varying(128) NOT NULL,
    UNIQUE (Type,PictureID,Value));


Without views, if I want all the picture with a keyword value of
'laurent' that a user with ID of 2 can see, sorted by AverageRate:

SELECT DISTINCT ON (AverageRate,PictureID) P.*
       FROM Pictures AS P, GroupsDef AS G, Permissions AS A, Keywords AS K
       WHERE P.PictureID=A.PictureID
         AND G.GroupID=A.GroupID
         AND K.Value in ('laurent')
         AND K.PictureID=P.PictureID
         AND UserID=2
       ORDER BY AverageRate,PictureID;

QUERY PLAN

 Unique  (cost=528.93..532.71 rows=504 width=97) (actual time=32.447..33.062 rows=274 loops=1)
   ->  Sort  (cost=528.93..530.19 rows=504 width=97) (actual time=32.443..32.590 rows=505 loops=1)
         Sort Key: p.averagerate, p.pictureid
         ->  Hash Join  (cost=297.36..506.31 rows=504 width=97) (actual time=12.495..29.312 rows=505 loops=1)
               Hash Cond: ("outer".groupid = "inner".groupid)
               ->  Hash Join  (cost=292.14..466.79 rows=900 width=101) (actual time=12.056..26.180 rows=750 loops=1)
                     Hash Cond: ("outer".pictureid = "inner".pictureid)
                     ->  Seq Scan on permissions a  (cost=0.00..125.05 rows=8305 width=8) (actual time=0.007..6.271
rows=8305loops=1) 
                     ->  Hash  (cost=291.43..291.43 rows=285 width=101) (actual time=11.961..11.961 rows=0 loops=1)
                           ->  Hash Join  (cost=110.26..291.43 rows=285 width=101) (actual time=6.378..11.573 rows=274
loops=1)
                                 Hash Cond: ("outer".pictureid = "inner".pictureid)
                                 ->  Seq Scan on pictures p  (cost=0.00..68.33 rows=2933 width=97) (actual
time=0.007..2.426rows=2933 loops=1) 
                                 ->  Hash  (cost=109.55..109.55 rows=285 width=4) (actual time=6.163..6.163 rows=0
loops=1)
                                       ->  Seq Scan on keywords k  (cost=0.00..109.55 rows=285 width=4) (actual
time=0.032..5.929rows=274 loops=1) 
                                             Filter: ((value)::text = 'laurent'::text)
               ->  Hash  (cost=5.19..5.19 rows=12 width=4) (actual time=0.217..0.217 rows=0 loops=1)
                     ->  Seq Scan on groupsdef g  (cost=0.00..5.19 rows=12 width=4) (actual time=0.038..0.197 rows=11
loops=1)
                           Filter: (userid = 2)
 Total runtime: 33.554 ms

Now, if I use the following view to abstract access rights:

CREATE VIEW userpictures (
       PictureID,RollID,FrameID,Description,Filename,
       Owner,EntryDate,Date,
       NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
       UserID)
   AS SELECT DISTINCT ON (Permissions.PictureID,UserID)
         Pictures.PictureID,RollID,FrameID,Description,Filename,Owner,
         EntryDate,Date,NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
         UserID
     FROM Permissions
          JOIN Groupsdef using (GroupID)
          JOIN pictures using (PictureID);

The query I use is:

SELECT P.*
       FROM UserPictures AS P, Keywords AS K
       WHERE P.PictureID=K.PictureID
         AND K.Value in ('laurent')
         AND UserID=2
       ORDER BY AverageRate,PictureID;

QUERY PLAN

 Sort  (cost=1126.98..1128.54 rows=622 width=438) (actual time=142.053..142.132 rows=274 loops=1)
   Sort Key: p.averagerate, p.pictureid
   ->  Merge Join  (cost=995.75..1098.12 rows=622 width=438) (actual time=116.412..140.481 rows=274 loops=1)
         Merge Cond: ("outer".pictureid = "inner".pictureid)
         ->  Subquery Scan p  (cost=874.58..955.99 rows=4652 width=438) (actual time=108.709..130.049 rows=2361
loops=1)
               ->  Unique  (cost=874.58..909.47 rows=4652 width=105) (actual time=108.685..119.661 rows=2361 loops=1)
                     ->  Sort  (cost=874.58..886.21 rows=4652 width=105) (actual time=108.676..110.185 rows=4403
loops=1)
                           Sort Key: permissions.pictureid, groupsdef.userid
                           ->  Hash Join  (cost=388.35..591.19 rows=4652 width=105) (actual time=32.031..63.322
rows=5076loops=1) 
                                 Hash Cond: ("outer".pictureid = "inner".pictureid)
                                 ->  Seq Scan on pictures  (cost=0.00..68.33 rows=2933 width=97) (actual
time=0.011..2.836rows=2933 loops=1) 
                                 ->  Hash  (cost=376.72..376.72 rows=4652 width=8) (actual time=31.777..31.777 rows=0
loops=1)
                                       ->  Merge Join  (cost=5.40..376.72 rows=4652 width=8) (actual time=0.285..27.805
rows=5076loops=1) 
                                             Merge Cond: ("outer".groupid = "inner".groupid)
                                             ->  Index Scan using permissions_groupid_key on permissions
(cost=0.00..280.77rows=8305 width=8) (actual time=0.031..13.018 rows=7633 loops=1) 
                                             ->  Sort  (cost=5.40..5.43 rows=12 width=8) (actual time=0.237..1.762
rows=5078loops=1) 
                                                   Sort Key: groupsdef.groupid
                                                   ->  Seq Scan on groupsdef  (cost=0.00..5.19 rows=12 width=8) (actual
time=0.034..0.203rows=11 loops=1) 
                                                         Filter: (userid = 2)
         ->  Sort  (cost=121.17..121.88 rows=285 width=4) (actual time=6.987..7.065 rows=274 loops=1)
               Sort Key: k.pictureid
               ->  Seq Scan on keywords k  (cost=0.00..109.55 rows=285 width=4) (actual time=0.056..6.656 rows=274
loops=1)
                     Filter: ((value)::text = 'laurent'::text)
 Total runtime: 144.045 ms

To me the only difference between the queries is that second one
includes a UserID column. One strange thing is that width from 105
jumps to 438 during "Subquery Scan p".

I think it's because of DISTINCT ON in the view. If I use this view:


CREATE VIEW userpictures (
       PictureID,RollID,FrameID,Description,Filename,
       Owner,EntryDate,Date,
       NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
       UserID)
   AS SELECT Pictures.PictureID,RollID,FrameID,Description,Filename,Owner,
         EntryDate,Date,NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
         UserID
     FROM Permissions
          JOIN Groupsdef using (GroupID)
          JOIN pictures using (PictureID);

and this query:

SELECT DISTINCT ON (AverageRate,PictureID) P.*
       FROM UserPictures AS P, Keywords AS K
       WHERE P.PictureID=K.PictureID
         AND K.Value in ('laurent')
         AND UserID=2
       ORDER BY AverageRate,PictureID;

The result is similar to the query without the view:

QUERY PLAN

 Unique  (cost=525.39..529.17 rows=504 width=101) (actual time=34.689..35.287 rows=274 loops=1)
   ->  Sort  (cost=525.39..526.65 rows=504 width=101) (actual time=34.686..34.828 rows=505 loops=1)
         Sort Key: pictures.averagerate, pictures.pictureid
         ->  Hash Join  (cost=297.36..502.76 rows=504 width=101) (actual time=11.786..31.739 rows=505 loops=1)
               Hash Cond: ("outer".groupid = "inner".groupid)
               ->  Hash Join  (cost=292.14..466.79 rows=807 width=101) (actual time=11.409..28.389 rows=750 loops=1)
                     Hash Cond: ("outer".pictureid = "inner".pictureid)
                     ->  Seq Scan on permissions  (cost=0.00..125.05 rows=8305 width=8) (actual time=0.004..9.177
rows=8305loops=1) 
                     ->  Hash  (cost=291.43..291.43 rows=285 width=101) (actual time=11.319..11.319 rows=0 loops=1)
                           ->  Hash Join  (cost=110.26..291.43 rows=285 width=101) (actual time=5.919..10.961 rows=274
loops=1)
                                 Hash Cond: ("outer".pictureid = "inner".pictureid)
                                 ->  Seq Scan on pictures  (cost=0.00..68.33 rows=2933 width=97) (actual
time=0.004..2.258rows=2933 loops=1) 
                                 ->  Hash  (cost=109.55..109.55 rows=285 width=4) (actual time=5.700..5.700 rows=0
loops=1)
                                       ->  Seq Scan on keywords k  (cost=0.00..109.55 rows=285 width=4) (actual
time=0.029..5.471rows=274 loops=1) 
                                             Filter: ((value)::text = 'laurent'::text)
               ->  Hash  (cost=5.19..5.19 rows=12 width=8) (actual time=0.198..0.198 rows=0 loops=1)
                     ->  Seq Scan on groupsdef  (cost=0.00..5.19 rows=12 width=8) (actual time=0.031..0.178 rows=11
loops=1)
                           Filter: (userid = 2)
 Total runtime: 35.657 ms


--
Laurent Martelli
laurent@aopsys.com                                Java Aspect Components
http://www.aopsys.com/                          http://jac.objectweb.org


Re: Query involving views

From
Tom Lane
Date:
Laurent Martelli <laurent@aopsys.com> writes:
> Now, if I use the following view to abstract access rights:

> CREATE VIEW userpictures (
>        PictureID,RollID,FrameID,Description,Filename,
>        Owner,EntryDate,Date,
>        NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
>        UserID)
>    AS SELECT DISTINCT ON (Permissions.PictureID,UserID)
>          Pictures.PictureID,RollID,FrameID,Description,Filename,Owner,
>          EntryDate,Date,NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
>          UserID
>      FROM Permissions
>           JOIN Groupsdef using (GroupID)
>           JOIN pictures using (PictureID);

> [ performance sucks ]

Find a way to get rid of the DISTINCT ON.  That's essentially an
optimization fence.  Worse, the way you are using it here, it doesn't
even give well-defined results, since there's no ORDER BY constraining
which row will be selected out of a set of duplicates.  (I think it may
not matter to you, since you don't really care which groupsdef row is
selected, but in general a view constructed like this is broken.)

It might work to do the view as

SELECT ... all that stuff ...
FROM pictures p, users u
WHERE
  EXISTS (SELECT 1 FROM permissions prm, groupsdef g
          WHERE p.pictureid = prm.pictureid AND prm.groupid = g.groupid
                AND g.userid = u.userid);

I'm not sure offhand about the performance properties of this either,
but it would be worth trying.

A cruder answer is just to accept that the view may give you multiple
hits, and put the DISTINCT in the top-level query.

I think though that in the long run you're going to need to rethink this
representation of permissions.  It's nice and simple but it's not going
to scale well.  Even your "fast" query is going to look like a dog once
you get to many thousands of permission entries.

It might work to maintain a derived table (basically a materialized
view) of the form (userid, groupid, pictureid) signifying that a user
can access a picture through membership in a group.  Put a nonunique
index on (userid, pictureid) on it.  This could then drive the EXISTS
test efficiently.

            regards, tom lane

Re: Query involving views

From
Laurent Martelli
Date:
>>>>> "Tom" == Tom Lane <tgl@sss.pgh.pa.us> writes:

  Tom> Laurent Martelli <laurent@aopsys.com> writes:
  >> Now, if I use the following view to abstract access rights:

  >> CREATE VIEW userpictures (
  >> PictureID,RollID,FrameID,Description,Filename,
  >> Owner,EntryDate,Date,
  >> NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates, UserID) AS
  >> SELECT DISTINCT ON (Permissions.PictureID,UserID)
  >> Pictures.PictureID,RollID,FrameID,Description,Filename,Owner,
  >> EntryDate,Date,NbClick,NbRates,MaxRate,MinRate,AverageRate,SumRates,
  >> UserID FROM Permissions JOIN Groupsdef using (GroupID) JOIN
  >> pictures using (PictureID);

  >> [ performance sucks ]

  Tom> Find a way to get rid of the DISTINCT ON.  That's essentially
  Tom> an optimization fence.  Worse, the way you are using it here,
  Tom> it doesn't even give well-defined results, since there's no
  Tom> ORDER BY constraining which row will be selected out of a set
  Tom> of duplicates.  (I think it may not matter to you, since you
  Tom> don't really care which groupsdef row is selected,

That's true. I do not use columns from groupsdef in the end.

  Tom> but in general a view constructed like this is broken.)

  Tom> It might work to do the view as

  Tom> SELECT ... all that stuff ...  FROM pictures p, users u WHERE
  Tom> EXISTS (SELECT 1 FROM permissions prm, groupsdef g WHERE
  Tom> p.pictureid = prm.pictureid AND prm.groupid = g.groupid AND
  Tom> g.userid = u.userid);

  Tom> I'm not sure offhand about the performance properties of this
  Tom> either, but it would be worth trying.

This one does not yield very good performance. In fact, the best
performances I have is when I use a where clause like this one:

    WHERE PictureID IN
       (SELECT PictureID FROM permissions JOIN groupsdef USING(GroupID)
              WHERE groupsdef.UserID=2)

But it's not as elegant to write as the initial view using "distinct
on". I could create a view like this:

    CREATE VIEW userpictures (PictureID,UserID)
        AS SELECT pictureid,userid
            FROM permissions JOIN groupsdef USING(GroupID)

and then do queries like this:

    SELECT * FROM pictures
        WHERE PictureID IN (SELECT PictureID FROM userpictures WHERE UserID=2)

but it's stillnot as elegant as

    SELECT * FROM userpictures WHERE UserID=2

I think I'll try a function:

CREATE FUNCTION picturesID(int) RETURNS SETOF int AS '
    SELECT PictureID FROM permissions JOIN groupsdef USING(GroupID)
    WHERE groupsdef.UserID=$1
' LANGUAGE sql;

SELECT * FROM pictures WHERE PictureID IN (select * from picturesID(2));

Here's something funny: using a function seems gives slihtly better results
than inlining the query (I did a dozen of runs and the timings were consistent):

SELECT * FROM pictures WHERE PictureID IN (select * from picturesID(2));
QUERY PLAN
 Hash Join  (cost=15.50..100.49 rows=200 width=97) (actual time=28.609..46.568 rows=2906 loops=1)
   Hash Cond: ("outer".pictureid = "inner".picturesid)
   ->  Seq Scan on pictures  (cost=0.00..68.33 rows=2933 width=97) (actual time=0.018..2.610 rows=2933 loops=1)
   ->  Hash  (cost=15.00..15.00 rows=200 width=4) (actual time=28.467..28.467 rows=0 loops=1)
         ->  HashAggregate  (cost=15.00..15.00 rows=200 width=4) (actual time=23.698..26.201 rows=2906 loops=1)
               ->  Function Scan on picturesid  (cost=0.00..12.50 rows=1000 width=4) (actual time=16.202..19.952
rows=5076loops=1) 
 Total runtime: 48.601 ms

SELECT * FROM pictures WHERE PictureID IN (
    SELECT PictureID FROM permissions JOIN groupsdef USING(GroupID)
        WHERE groupsdef.UserID=2);
QUERY PLAN

 Hash Join  (cost=394.93..504.24 rows=2632 width=97) (actual time=35.770..53.574 rows=2906 loops=1)
   Hash Cond: ("outer".pictureid = "inner".pictureid)
   ->  Seq Scan on pictures  (cost=0.00..68.33 rows=2933 width=97) (actual time=0.014..2.543 rows=2933 loops=1)
   ->  Hash  (cost=388.35..388.35 rows=2632 width=4) (actual time=35.626..35.626 rows=0 loops=1)
         ->  HashAggregate  (cost=388.35..388.35 rows=2632 width=4) (actual time=30.988..33.502 rows=2906 loops=1)
               ->  Merge Join  (cost=5.40..376.72 rows=4652 width=4) (actual time=0.247..26.628 rows=5076 loops=1)
                     Merge Cond: ("outer".groupid = "inner".groupid)
                     ->  Index Scan using permissions_groupid_key on permissions  (cost=0.00..280.77 rows=8305 width=8)
(actualtime=0.031..11.629 rows=7633 loops=1) 
                     ->  Sort  (cost=5.40..5.43 rows=12 width=4) (actual time=0.207..1.720 rows=5078 loops=1)
                           Sort Key: groupsdef.groupid
                           ->  Seq Scan on groupsdef  (cost=0.00..5.19 rows=12 width=4) (actual time=0.030..0.182
rows=11loops=1) 
                                 Filter: (userid = 2)
 Total runtime: 54.748 ms

  Tom> A cruder answer is just to accept that the view may give you
  Tom> multiple hits, and put the DISTINCT in the top-level query.

I thought of that. But it has the drawback that if you use an ORDER
BY, you must have the same columns in the DISTINCT.

  Tom> I think though that in the long run you're going to need to
  Tom> rethink this representation of permissions.  It's nice and
  Tom> simple but it's not going to scale well.  Even your "fast"
  Tom> query is going to look like a dog once you get to many
  Tom> thousands of permission entries.

  Tom> It might work to maintain a derived table (basically a
  Tom> materialized view) of the form (userid, groupid, pictureid)
  Tom> signifying that a user can access a picture through membership
  Tom> in a group.  Put a nonunique index on (userid, pictureid) on
  Tom> it.  This could then drive the EXISTS test efficiently.

I'll probably do that if perf goes down when the database grows
bigger.

Thanks for all the advice.

Best regards,
Laurent

--
Laurent Martelli
laurent@aopsys.com                                Java Aspect Components
http://www.aopsys.com/                          http://jac.objectweb.org