Re: Query on indexed table too slow - Mailing list pgsql-novice

From Shmagi Kavtaradze
Subject Re: Query on indexed table too slow
Date
Msg-id CAHY6mazDUs-jUC0zoMVOfVkCe-BhPv2k8sZwZavgcznp5cjwGA@mail.gmail.com
Whole thread Raw
In response to Re: Query on indexed table too slow  (Fábio Moreira <fabio@dias.moreira.nom.br>)
List pgsql-novice
Actually I have dataset at least with 4623 sentences and each bag of words representation will consists of 10,000 characters. So 250 chunks of 40 character long strings. I tried to create table instead of displaying results. It takes less for small dataset to create table instead of just selecting, but with increasing up to 250 chunks, the time of displaying and just creating the table is almost the same.

On Tue, Mar 8, 2016 at 1:20 AM, Fábio Moreira <fabio@dias.moreira.nom.br> wrote:
Hi Shmagi,

Your table is tiny -- tiny enough that the actual working set (the rows of chunks2, without the "doc" field) easily fits in memory; you're unlikely to get a large improvement here, specially considering how, by the looks of the query planner, you have relatively few distinct chunkid (is that so? can you post SELECT COUNT(DISTINCT chunkid) FROM chunks2?).

Two things come to mind here:

1) that similarity comparison you're doing looks expensive, with a cast to TEXT and string manipulation and what not. I'm not sure if PostgreSQL has a native bit count method to speed this part of the code up; you could implement one in C to see if that improves things, but it might be easier (if indeed you only have 10 bits) to do instead:

CREATE TEMPORARY TABLE popcount AS
SELECT
  i::BIT(10) AS value,
  Length(Replace(Cast(i::BIT(10) AS TEXT), '0', '')) AS popcount
FROM
  generate_series(0, 1023, 1) AS temp(i);

and join against that table (again, it's tiny, so no real need for an index here).

2) do you really need all 100M rows of the output? I assume this is some document similarity metric you're after? What will be done with this data after it is generated?

[]s, Fábio.

On Tue, Mar 8, 2016 at 2:12 AM, Shmagi Kavtaradze <kavtaradze.s@gmail.com> wrote:
I have a table with 46230 rows(1 doc, 4623 sentences and 10 chunks for each sentence):

create table chunks(
doc varchar,
sentenceid int,
chunkid int,
chunk bit(10)
);

With the query I want to compare sentence chunks with other sentence chunks that have same chunkid:

SELECT
       a.sentenceid,
       b.sentenceid, a.chunkid,
       Length(Replace(Cast(a.chunk & b.chunk AS TEXT), '0', ''))::float / Length(a.chunk)::float
FROM  chunks2 a
INNER JOIN chunks2 b
        ON a.sentenceid < b.sentenceid and  a.chunkid = b.chunkid;

I ran explain analyze on unindexed table, composite index and both indexed separately,but time is the same for all:

Indexed on  (sentenceid, chunkid):

Hash Join  (cost=1335.17..4549476.28 rows=71249559 width=26) (actual time=144.376..1156178.110 rows=106837530 loops=1)
   Hash Cond: (a.chunkid = b.chunkid)
   Join Filter: (a.sentenceid < b.sentenceid)
   Rows Removed by Join Filter: 106883760
   ->  Seq Scan on chunks2 a  (cost=0.00..757.30 rows=46230 width=15) (actual time=0.039..77.275 rows=46230 loops=1)
   ->  Hash  (cost=757.30..757.30 rows=46230 width=15) (actual time=142.954..142.954 rows=46230 loops=1)
         Buckets: 65536  Batches: 1  Memory Usage: 2680kB
         ->  Seq Scan on chunks2 b  (cost=0.00..757.30 rows=46230 width=15) (actual time=0.031..64.340 rows=46230 loops=1)
 Planning time: 1.209 ms
 Execution time: 1212779.012 ms


I know they have the same operations and no index was used. Where is my mistake and how to speed up query with index? Or how to use indexes efficiently in my case?




--
Fábio Dias Moreira

pgsql-novice by date:

Previous
From: Jozef Riha
Date:
Subject: Re: help needed with error during upgrade (9.4 -> 9.5)
Next
From: Sherrie Kubis
Date:
Subject: Getting started - pgadmin3