Re: Tricky SQL - assistance appreicated. DDL and DML supplied. - Mailing list pgsql-novice

From Matt Clement
Subject Re: Tricky SQL - assistance appreicated. DDL and DML supplied.
Date
Msg-id CAEhQjh5w-sGwr2sddDuDmeox+nr71P5QKEWDc1C1ef_BRsF2mw@mail.gmail.com
Whole thread Raw
In response to Tricky SQL - assistance appreicated. DDL and DML supplied.  (Pál Teleki <ellenallhatatlan@gmail.com>)
Responses Re: Tricky SQL - assistance appreicated. DDL and DML supplied.
Re: Tricky SQL - assistance appreicated. DDL and DML supplied.
List pgsql-novice
Hey there,

I think you were pretty close!

Let's start with the query that you came up with:

SELECT t1.c_id AS theid1, t1.guid AS theguid1, MAX(t1.ts) AS mydate1 FROM ex t1
INNER JOIN ex t2
  ON t1.c_id = t2.c_id AND t1.guid = t2.guid
GROUP BY t1.c_id, t1.guid
ORDER BY t1.guid, MAX(t1.ts)


This will get the c_id, guid, and ts of the rows you want to keep. I'll clean this up a bit, as the join isn't really useful at this point (none of the selected columns are from the joined table).

SELECT c_id, guid, MAX(ts) AS mydate1 FROM ex
GROUP BY c_id, guid
ORDER BY guid, MAX(ts);

We can use this as a subquery to find the ex_ids that we want to keep and bring back the same join conditions that you had before:

SELECT ex_id from ex t1
INNER JOIN (
SELECT c_id, guid, MAX(ts) AS mydate1 FROM ex
GROUP BY c_id, guid
ORDER BY guid, MAX(ts);
) t2 ON t1.c_id = t2.c_id AND t1.guid = t2.guid AND t1.ts = t2.mydate1

Now we have all of the ex_id's that we want to keep. We can use a simple NOT IN condition to get all of the rows that should be deleted:

SELECT * from ex
WHERE ex_id NOT IN (
SELECT ex_id from ex t1
INNER JOIN (
SELECT c_id, guid, MAX(ts) AS mydate1 FROM ex
GROUP BY c_id, guid
ORDER BY guid, MAX(ts);
) t2 ON t1.c_id = t2.c_id AND t1.guid = t2.guid AND t1.ts = t2.mydate1
)

If you actually want to delete those rows you could do a DELETE FROM instead of SELECT FROM. Here's the query that I actually ran and the resulting rows:

matt=# DELETE from ex where ex_id not in (SELECT t1.ex_id FROM ex t1 join (SELECT c_id, guid, max(ts) mts from ex group by c_id, guid) as t2 on t1.c_id = t2.c_id AND t1.guid = t2.guid AND t1.ts = t2.mts);
DELETE 4
matt=# SELECT * from ex;
 ex_id | c_id | guid | supplier |         ts         
-------+------+------+----------+---------------------
     3 |    1 | xxxx |       50 | 2016-07-15 22:05:01
     4 |    1 | xxxx |       50 | 2016-07-15 22:05:01
     7 |    2 | yyyy |       71 | 2016-07-17 22:05:01
     8 |    2 | yyyy |       74 | 2016-07-17 22:05:01
     9 |    3 | zzzz |       60 | 2016-07-01 22:05:01
    10 |    4 | aaaa |       61 | 2016-07-01 22:05:01
(6 rows)


I believe this solution should work with mysql as well, although I've only done some light testing.
Hope that helps!

On Sat, Jul 2, 2016 at 8:05 PM, Pál Teleki <ellenallhatatlan@gmail.com> wrote:



Hi all,

I have an SQL problem and I'm stumped.


I have the following data (see DDL and DML at end of post)

"ex_id"; "c_id";   "guid"; "supplier";                      "ts"
-------- -------  -------- -----------     ---------------------
   42;        4;   "aaaa";         61;     "2016-07-01 22:05:01"
   34;        1;   "xxxx";         50;     "2016-07-01 22:05:01"
   33;        1;   "xxxx";         54;     "2016-07-01 22:05:01"
   35;        1;   "xxxx";         50;     "2016-07-15 22:05:01"
   36;        1;   "xxxx";         50;     "2016-07-15 22:05:01"
   37;        2;   "yyyy";         78;     "2016-07-01 22:05:01"
   38;        2;   "yyyy";         79;     "2016-07-01 22:05:01"
   39;        2;   "yyyy";         71;     "2016-07-17 22:05:01"
   40;        2;   "yyyy";         74;     "2016-07-17 22:05:01"
   41;        3;   "zzzz";         60;     "2016-07-01 22:05:01"


and the resulting table - i.e. the records I want to retain in the table are:


"ex_id";   "c_id";   "guid"; "supplier";                      "ts"
--------   -------  -------- -----------     ---------------------
     42;        4;   "aaaa";         61;     "2016-07-01 22:05:01"
     35;        1;   "xxxx";         50;     "2016-07-15 22:05:01"
     36;        1;   "xxxx";         50;     "2016-07-15 22:05:01"
     39;        2;   "yyyy";         71;     "2016-07-17 22:05:01"
     40;        2;   "yyyy";         74;     "2016-07-17 22:05:01"
     41;        3;   "zzzz";         60;     "2016-07-01 22:05:01"



i.e. Take the combination of c_id, guid and date which is the last, no matter
how many records there are with a given combination of same. The (guid) 'aaaa'
and the 'zzzz' records obviously stay since there's only 1 of each.

But the records with 'xxxx' and 'yyyy' have to be "pruned" - only keep those
with the latest date where c_id matches and guid matches.

1) This query will also have to work with MySQL, so CTEs and Analytic functions
are out - good old fashioned SQL only.

2) Ideally, I would like two queries, one under the assumption that ex_id is
sequential with time and the other that ex_id is not necessarily in sync with
the timestamp.

3) This appears to be a "top-n-per-group" problem of some sort - I would
appreciate an explanation of the query, not just the raw query itself - I'm
trying to learn rather than just a bald answer. References, URLs &c appreciated.

Just to show that I'm not totally lazy :-), I have formulated what I think is
the first step, but I'm stuck. It would be easy enough to do this with CTEs but,
as I've said, the query also has to work with MySQL. If you want to throw in
solutions based on CTEs and/or SET operators and/or Analytic functions,
for pedagological purposes, great, but I do need a MySQL solution.



SELECT t1.c_id AS theid1, t1.guid AS theguid1, MAX(t1.ts) AS mydate1 FROM ex t1
INNER JOIN ex t2
  ON t1.c_id = t2.c_id AND t1.guid = t2.guid
GROUP BY t1.c_id, t1.guid
ORDER BY t1.guid, MAX(t1.ts)

"theid1";   "theguid1";               "mydate1"
---------   -----------   ---------------------
       4;       "aaaa";   "2016-07-01 22:05:01"
       1;       "xxxx";   "2016-07-15 22:05:01"
       2;       "yyyy";   "2016-07-17 22:05:01"
       3;       "zzzz";   "2016-07-01 22:05:01"



Now, these are the records that **shouldn't** be deleted - without
the supplier data - which also has to be retained - i.e. I need to
retain more records than this - but this should be a start. The data
are correct for guid, datetime and c_id.


I just can't figure out how to formulate the statement to delete the
other records.

Maybe with SET operators, if MySQL had those?

Anyway, all and any input appreciated. If any further info. is required, please
let me know.


============= DDL and DML for the table ===============


CREATE TABLE ex
(
  ex_id serial,
  c_id int,
  guid varchar(32),
  supplier int,
  ts timestamp,
  PRIMARY KEY (ex_id)
);



insert into ex (c_id, guid, supplier, ts) VALUES(1, 'xxxx', 54, '2016-07-01 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(1, 'xxxx', 50, '2016-07-01 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(1, 'xxxx', 50, '2016-07-15 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(1, 'xxxx', 50, '2016-07-15 22:05:01');


insert into ex (c_id, guid, supplier, ts) VALUES(2, 'yyyy', 78, '2016-07-01 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(2, 'yyyy', 79, '2016-07-01 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(2, 'yyyy', 71, '2016-07-17 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(2, 'yyyy', 74, '2016-07-17 22:05:01');

insert into ex (c_id, guid, supplier, ts) VALUES(3, 'zzzz', 60, '2016-07-01 22:05:01');
insert into ex (c_id, guid, supplier, ts) VALUES(4, 'aaaa', 61, '2016-07-01 22:05:01');



--

Pál Teleki

pgsql-novice by date:

Previous
From: Pál Teleki
Date:
Subject: Tricky SQL - assistance appreicated. DDL and DML supplied.
Next
From: Tom Lane
Date:
Subject: Re: Tricky SQL - assistance appreicated. DDL and DML supplied.