Re: Retrieve most recent 1 record from joined table - Mailing list pgsql-sql

From agharta
Subject Re: Retrieve most recent 1 record from joined table
Date
Msg-id 53FD9F21.7040908@gmail.com
Whole thread Raw
In response to Re: Retrieve most recent 1 record from joined table  (Vik Fearing <vik.fearing@dalibo.com>)
List pgsql-sql
On 08/23/2014 02:15 PM, Vik Fearing wrote:
> On 08/22/2014 10:05 AM, agharta wrote:
>> Hi all,
>> This is my first question, don't hate me please if it is the wrong place.
> This is the right place.  See below for my answer.
>
>
>> Mi question is relative seimple, but goes me crazy.
>>
>> First, create some example tables:
>>
>> ----
>> create table table1(
>>      t1_ID SERIAL primary key,
>>      t1_value text
>> );
>>
>> create table table2(
>>      t2_ID SERIAL primary key,
>>      t2_value text,
>>      t1_id integer
>> );
>>
>>
>> create table table3(
>>      t3_ID SERIAL primary key,
>>      t3_value text,
>>      t3_date timestamp,
>>      t2_id integer
>> );
>>
>> ALTER TABLE table2 ADD FOREIGN KEY (t1_id) REFERENCES table1 (t1_ID) ON
>> DELETE CASCADE;
>> ALTER TABLE table3 ADD FOREIGN KEY (t2_id) REFERENCES table2 (t2_ID) ON
>> DELETE CASCADE;
>>
>> CREATE INDEX IDX_TABLE1_T1_value ON table1 (t1_value);
>> CREATE INDEX IDX_TABLE2_T2_value ON table2 (t2_value);
>> CREATE INDEX IDX_TABLE3_T3_value ON table3 (t3_value);
>> CREATE INDEX IDX_TABLE3_T3_DATE ON table3 (t3_date);
>> -----
>>
>> As you can see, table3 is referenced to table2 and table2 is referenced
>> to table1
>>
>> Next, insert some data on tables:
>>
>> --one millon records into table1
>> insert into table1(t1_value) select md5(random()::text) from (select
>> generate_series(1,1000000) as a) as c;
>>
>> --100.000 records on table2 where t1_id is a random number between 1 and
>> 10.001. This guarantees many records pointing to same table1 t1_id
>> insert into table2(t2_value, t1_id) select md5(random()::text),
>> trunc((random()+1)*10000)::integer from (select
>> generate_series(1,100000) as a) as c;
>>
>> --again, 1.022.401 records on table3 where t2_id is a random number
>> between 1 and 10.001. This guarantee many records pointing to same
>> table2 t2_id.
>> -- random numbers (1.022.401) are generated by generated_series function
>> with timestamp (1 minute) interval
>> insert into table3(t3_value, t2_id, t3_date) select md5(random()::text),
>> trunc((random()+1)*10000)::integer, c.date_val from (select
>> generate_series(timestamp '2013-01-01',timestamp '2014-12-12', interval
>> '1 minute') as date_val) as c;
>>
>>
>> So, now we should have table3 with many rows per table2(t2_id) and many
>> rows in table2 with same t1_id.
>>
>>
>> Now, the question:
>>
>>
>> Joining the tables, how to get ONLY most recent record per
>> table3(t3_date)??
>>
>> Query example:
>>
>> select * from table1 as t1
>> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') )
>> inner join table3 t3 on (t2.t2_id = t3.t2_id and t3.t3_date <= timestamp
>> '2014-08-20')
>> order by t3.t2_id, t3.t3_date desc
>>
>> produces to me this dataset (extract)
>>
>>
>> t1_id       t1_value      t2_id      t2_value           t1_id
>> t3_id         t3_value                             t3_date
>>                t2_id
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    857683
>> 0fb6cb380522dd83b6ac4beba2c6e98f    2014-08-19 14:42:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    857088
>> 9c3481bfc0bdea51e62b338a1777cde6    2014-08-19 04:47:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    853208
>> 253dc2083e70dd9e276867f43889973f    2014-08-16 12:07:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    851237
>> 970acf901c4232c178b5dbeda4e44ac8    2014-08-15 03:16:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    847436
>> cb8d52255eaa850f9d7f203092a2ce13    2014-08-12 11:55:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    819339
>> d9e36ad622b1db499b3f623cdd46a811    2014-07-23 23:38:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    818022
>> b9d13239f522e3a22f20d36ea6dab8ad    2014-07-23 01:41:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    803046
>> c36164f3077894a1986c4922dfb632ec    2014-07-12 16:05:00.0    10020
>> 17098    74127bc80ca759678892c957b6a34fc7    10020
>> 9182c1f48cd008e31d781abc45723a10    17098    788129
>> 32c966feab2212a29f86bebbaa6dfec9    2014-07-02 07:28:00.0    10020
>>
>>
>> As you can see, there are many t3_id  per single t2_id.
>>
>> I need the same dataset, but i want only the most recent (one) record
>> per table3, and, of course, directly in the join rule (in production i
>> have a long and complex query).
>>
>> My personal solution (very slow, i can't use it):
>>
>> select * from table1 as t1
>> inner join table2 t2 on (t1.t1_id = t2.t1_id and t2.t2_value like('%ab%') )
>> inner join table3 t3 on (
>> t3.t3_id = (select t3_id from table3 where t2_id = t2.t2_id and t3_date
>> <= timestamp '2014-08-20' order by t3_date desc fetch first 1 rows only)
>> and t3.t2_id = t2.t2_id
>> )
>> order by t3.t2_id, t3.t3_date desc
>>
>>
>> it gives me the right result, but performances are poor........
>>
>> Same bad performance result with a function that performs the table3 query.
>>
>> Suggestions?
>>
>> Thanks to anyone who can answer to me!!!!
> You didn't say what version of PostgreSQL you're using.  The following
> solution works with 9.3.
>
> select *
> from table1 as t1
> join table2 as t2 on t2.t1_id = t1.t1_id and t2.t2_value like '%ab%'
> join lateral (select *
>                from table3
>                where t2_id = t2.t2_id
>                  and t3_date <= '2014-08-20'
>                order by t3_date desc
>                limit 1) as t3 on true
> order by t3.t2_id, t3.t3_date desc;
>
> You will need the following indexes for it to give good performance:
>
> create index on t2 using gist (t2_value gist_trgm_ops);
> create index on t3 using btree (t2_id, t3_date desc);
>
> That first index will require the pg_trgm extension that you can get with:
>
> create extension pg_trgm;
>
>
> If you are not using 9.3, I can give you a (much) more complicated
> version that will perform well, but ideally you'll want to use the above.


Hi Vik,

Sorry for late, i was testing your solution and.....what that performance!!!


I'm using 9.3, i've don't say it, my mistake.


Using join lateral with a lot of analize command to see where put the 
right index, performance becomes amazing!!


Thanks again for the priceless suggestion!!!


.....join lateral.... what a beautiful keywords :-)

Cheers,

Agharta







pgsql-sql by date:

Previous
From: Thomas Kellerer
Date:
Subject: Re: Retrieve most recent 1 record from joined table
Next
From: John Lamb
Date:
Subject: xpath does not seem to escape HTML correctly