Re: Postgres refusing to use >1 core - Mailing list pgsql-performance

From Aren Cambre
Subject Re: Postgres refusing to use >1 core
Date
Msg-id BANLkTim9Gg15bQsaNOpoUKZjmyH-+1vNYQ@mail.gmail.com
Whole thread Raw
In response to Re: Postgres refusing to use >1 core  (Craig Ringer <craig@postnewspapers.com.au>)
Responses Re: Postgres refusing to use >1 core  (David Boreham <david_list@boreham.org>)
Re: Postgres refusing to use >1 core  (Eric McKeeth <eldin00@gmail.com>)
List pgsql-performance
I suspect your app is doing lots of tiny single-row queries instead of
efficiently batching things. It'll be wasting huge amounts of time
waiting for results. Even if every query is individually incredibly
fast, with the number of them you seem to be doing you'll lose a LOT of
time if you loop over lots of little SELECTs.

So here's what's going on.

I have a table of about 12,000,000 traffic tickets written by the Texas Department of Public Safety (TxDPS). Each ticket has a route name and a reference marker. On Interstate highways, reference marker = mile post. On all other roads, from US highways down to Farm to Market roads, the reference marker is based on a grid superimposed over the state. Basically that reference marker increments as the road crosses a grid line, so unless the road is perfectly N-S or E-W, these reference markers are more than a mile apart.

I have a separate table with data from the Texas Department of Transportation (TxDOT). It is a database of almost all the state's reference markers, along with latitude and longitude for each.

I am trying to geolocate each ticket by correlating the ticket's route/reference marker to the same in the TxDOT database. And it's not straightforward for a few reasons:

1. TxDPS and TxDOT formats are different.

TxDPS uses 1-5 to denote route type. 1 = Interstate. 2 = US or any state highway except Farm to Market. 3 = Farm to Market, 4 = county road, 5 = local road. So if the route name is 0071 and route type is 2, it could mean US 71 or TX 71, both of which really exist in Texas but are on different parts of the state.

I haven't proven it yet, but it is possible that no two routes of the same number are in the same county. You wouldn't find both TX 71 and US 71 in the same county.

For now, I am looking up the TxDOT database based on route type, name, and county, and I may need to repeat the lookup until I get a match.

In the above example, if the ticket is written for route_name = 0071, route_type = 2, and county = 206, then I need to do searches against the TxDOT database for:
  1. rte_nm = 'US71' AND county_num='206'
  2. rte_nm = 'SH71' AND county_num='206'
  3. rte_nm = 'UA71' AND county_num='206'
  4. rte_nm = 'UP71' AND county_num='206'
  5. ...
2. Not TxDPS reference markers correspond to TxDOT reference markers.

Now, if I've matched a route, I have to find the reference marker.

The TxDOT database is pretty good but not 100% complete, so some TxDPS tickets' reference markers may not exist in the TxDOT table. Plus, it's possible that some TxDPS tickets have the wrong marker.

To compensate, I am looking for the closest reference marker along the route that is not more than 50 marker units away, either direction. I've again implemented that with multiple queries, where I don't stop until I find a match. Suppose I am searching for reference marker 256 on TX 71. The queries will be like this:
  1. rte_nm = 'SH71' AND rm = '256' (base marker)
  2. rte_nm = 'SH71' AND rm = '257' (+1)
  3. rte_nm = 'SH71' AND rm = '255' (-1)
  4. rte_nm = 'SH71' AND rm = '258' (+2)
  5. rte_nm = 'SH71' AND rm = '254' (-2)
  6. ...
  7. rte_nm = 'SH71' AND rm = '306' (+50)
  8. rte_nm = 'SH71' AND rm = '206' (-50)
Assuming a matching route name was found in the prior step, the app will have 1 to 101 of these queries for each ticket.

Assuming steps 1 and 2 above worked out, now I have a reference marker. So I write to a third table that has four columns:
  1. HA_Arrest_Key (varchar(18) that refers back to the TxDPS tickets table
  2. gid (integer that refers to the unique identity of the reference marker in the TxDOT table)
  3. distance (integer that is the distance, in reference markers, between that noted in the TxDPS ticket and the nearest marker found in the TxDOT table)
  4. hasLatLong (Boolean that is true if TxDPS also recorded latitude and longitude for the ticket, presumably from an in-car device. These don't appear to be that accurate, plus a substantial portion of tickets have no lat/long.)
Right now, I am doing a separate INSERT for each of the 12,000,000 rows inserted into this table.

I guess the app is chatty like you suggest? HOWEVER, if I am reading system activity correctly, the master thread that is going through the 12,000,000 tickets appears to have its own Postgres process, and based on how quickly RAM usage initially shoots up the first ~60 seconds or so the app runs, it may be reading all these rows into memory. But I am consulting with Npgsql developers separately to make sure I am really understanding correctly. They suspect that the PLINQ stuff (basically "multithreading in a can") may not be dispatching threads as expected because it may be misreading things.

By using a producer/consumer model like that you can ensure that thread
1 is always talking to the database, keeping Pg busy, and thread 2 is
always working the CPUs.

Thanks for the example and illustration.

... or you can have each worker fetch its own chunks of rows (getting
rid of the producer/consumer split) using its own connection and just
have lots more workers to handle all the wasted idle time. A
producer/consumer approach will probably be faster, though.

That's what PLINQ is supposed to do. In theory. :-) Working with Npgsql folks to see if something is tripping it up.

Aren

pgsql-performance by date:

Previous
From: Josh Berkus
Date:
Subject: Re: Postgres refusing to use >1 core
Next
From: Aren Cambre
Date:
Subject: Re: Postgres refusing to use >1 core