How to influence query planning in Postgresql

This is about how to use Aurora's Query Plan Manager in RDS.

Aurora's Query Plan Manager (QPM) solves a very specific and painful problem: Postgresql's query planner stabbing you in the back with its terrible choices. The difference between a good plan and a bad plan can mean absolutely massive differences in execution times and performance. Something that previously executed in sub-millisecond can suddenly take minutes or hours (or even an effectively unbounded amount of time!). And the big problem with all of this is that postgres gives little recourse for correcting these transient planning problems. In fact, the postgresql philosophy seems to be that it isn't a problem at all – "you're holding it wrong".

To design a spacecraft right takes an infinite amount of effort. This is why it's a good idea to design them to operate when some things are wrong – Akin's laws of spacecraft design

QPM on the other hand lives in the gross and brutal real world. It's about maintaining operations even when things go wrong, because things do go wrong. Theoretical purity and waxing poetic about how things should have been done to avoid the problem don't matter when you're currently dealing with the problem. QPM gives you the tools to punch into the planner and quickly correct what's going wrong. Root cause analysis is for when things are stable.

I love it and I want to share how it works.

How to use Aurora Query Plan Management

Operationally, reacting to a production issue with QPM boils down to three steps:

  1. Identify the problem query
  2. Connect to the DB and force a good plan (via hinting, flags, child sacrifice, etc..)
  3. Use QPM to lock in the good plan as Approved and the bad plan(s) as Rejected

That's actually it! Once you've Approved/Rejected the plans the work is done. Nothing to deploy. Nothing to change. This is what makes it so delightful. The time from detection to correction is measured in minutes. While it's no Silver Bullet, it is pretty dang close. It gets you back to a stable state so that you can do you deep dive into the why of the problem without the "prod is down" time pressure hanging over your head.

The rest of this doc is going to cover a high level overview of what those steps look like in practice.

Note: How to enable QPM and which bits to flip in your parameter groups are ignored here because they're covered extensively in the AWS docs. The remainder assumes you've got a cluster setup with the apg_plan_mgmt extension running.

Setting up the example: Making Postgresql plan poorly

"Why the fuck isn't Postgres using my index?!" – literally everyone at some point

As a motivating example, we're going to tackle the super common problem of Postgresql suddenly flipping from a plan which uses your index (and runs quickly) to one which doesn't (and runs slowly).

One of the reasons this tends to happen is that your table statistics get out of whack from a large modification of data. We're going to simulate that scenario by forcing pg to have really, really terrible stats. Then we'll swap back to the original good plan by using QPM.

In short, we're going to make this query:

explain (analyze) select * from users where customer_id < 'ID_HERE'::uuid
  1. Slow (before indexing)
  2. Then fast (after indexing)
  3. Then slow again (due to a (forced) plan flip)
  4. Then fast again (by rejecting the slow plan).

Creating a test table

To demo how QPM can fix planning problems, we have to make a table that has some planning problems. customer_id is the main column we'll manipulate to force our plan flips. The other columns are there just for statistical interest.

CREATE TABLE users (
    account_id uuid primary key,
    customer_id uuid, 
	num bigint
);

Next we fill it up with 10m UUIDs. We don't actually need all those records to demo the behavior, but if you're following along at home, it has the nice effect of making you "feel" the plan flip via its extra execution time.

INSERT INTO users SELECT gen_random_uuid(),gen_random_uuid(), seq
FROM generate_series (1,10000000) as seq;

A Brief digression into table stats

When you run ANALYZE what that's doing behind the scenes is computing stats about the distribution of data in your tables. One of those stats is called histogram_bounds, which is range of equally spaced buckets that ANALYZE built after randomly sampling the table. When you issue a query with a range condition, for instance:

select * from foo where bar > 1;

It uses that histogram to figure out which bucket it would fall into and estimates how many rows would be returned as a result. The manual has a detailed overview of how selectivity is computed.

What's super neat is that we can see what's in that histogram via the pg_stats table

SELECT array_length(histogram_bounds, 1) histogram_size,
       tablename, 
	   attname, 
	   histogram_bounds
FROM pg_stats WHERE tablename = 'users';

out:

And we can query for those values to see how well the estimates line up with the actual rows returned. If we run with the first bucket in the histogram, which in my case is, 00001c0f-b250-442c-b678-d2f84dec4ca0

explain (analyze) select * from users where customer_id < '00001c0f-b250-442c-b678-d2f84dec4ca0'::uuid

We get something pretty unexpected – it's off by several orders of magnitude!

"Seq Scan on users  (cost=0.00..208335.00 rows=50000 width=40) (actual time=22.203..784.207 rows=17 loops=1)"

Postgres estimated that it'd find 50,000 rows, but it only found 17! The edges of the distribution seem to be reliably unreliable with how far off their estimates will be. Thankfully, this doesn't hold for everything in the histogram. As soon as you move away from the edges, row estimates quickly snap back to reality – usually within 2x even for random data types like UUIDs.

Why does all this matter? Because these statistics are what we'll exploit in order to get Postgresql into a state where it flips away from using an index. Speaking of which...

Indexing our data

So far we've just been doing table scans. Now we need to add our index so that we can have two different performance flavors.

create index on users (customer_id); 

And check the performance improvement:

explain (analyze) select * from users where customer_id < '00001c0f-b250-442c-b678-d2f84dec4ca0'::uuid

out:

"Bitmap Heap Scan on users  (cost=1159.93..77243.43 rows=50000 width=40) (actual time=0.029..0.205 rows=148 loops=1)"
"  Recheck Cond: (customer_id < '00011a0e-adf2-4305-bc24-77f53d147d23'::uuid)"
"  Heap Blocks: exact=148"
"  ->  Bitmap Index Scan on users_customer_id_idx  (cost=0.00..1147.43 rows=50000 width=0) (actual time=0.010..0.011 rows=148 loops=1)"
"        Index Cond: (customer_id < '00011a0e-adf2-4305-bc24-77f53d147d23'::uuid)"
"Planning Time: 0.184 ms"
"Execution Time: 0.223 ms"

Less than a millisecond. That's pretty dang fast.

We're almost there now. We've indexed our data. Our access patterns are fast. All is well.

Simulating bad stats

Now, we're going to just mangle the statistics to simulate what can happen after a bad data load. We know from the experiments above that even with the default statistics row estimates can still be off by more than 1000x. We're going to make that even worse – so bad that Postgres will assume that there's nothing to be gained by using the index.

-- we're setting the statistics target to the bare minimum
alter table users alter column customer_id set statistics 1;
-- Then running analyze again to recompute the histogram
analyze users;   

If we lookup our stats now:

SELECT array_length(histogram_bounds, 1) histogram_size,
       tablename, 
	   attname, 
	   histogram_bounds
FROM pg_stats WHERE tablename = 'users';

The histogram for our customer_id column has gone from 101 items down to just 2!

Now let's see what the planner does given these new statistics:

explain (analyze) select * from users where customer_id < '00001c0f-b250-442c-b678-d2f84dec4ca0'::uuid

It flipped! We're back to doing a sequential scan! Our execution time has gone from <1ms to almost a full second!

"Seq Scan on users  (cost=0.00..208335.00 rows=5000040 width=40) (actual time=0.006..800.367 rows=402247 loops=1)"
"  Filter: (customer_id < '0a4a6ba7-e6be-4696-a33e-6935c9d3e2d1'::uuid)"
"  Rows Removed by Filter: 9597753"
"Planning Time: 0.324 ms"
"Execution Time: 817.803 ms"

Note: you may see a different behavior depending on what kind of mood Postgresql is ultimately in. When in doubt, you can always force the plan change by setting random_page_cost to something absurdly high so that it scans no matter what.

The stage is now set. We've simulated a dreaded plan flip in our database. Now, for the rest of the example, we'll pretend we don't know why it flipped (i.e. us mangling the stats) – that's for figuring out after prod is stabilized. This is now about the operations side. You've been paged. Requests are timing out. You've pinned it down to this particular query.

Here's how you fix it with QPM.

Query Plan Management

The first thing we need to do is tell QPM we want it to start keeping track of things.

Note: It being off is for sake of example and showing how queries are processed. In reality, you should already have it enabled and capturing plans automatically.

set apg_plan_mgmt.capture_plan_baselines = on; 

This tells QPM to keep track of new plans as Postgresql comes up with them. Now that it's enabled, anytime a query is run more than once, its plan will be recorded in QPM.

So, let's run our query again two more times.

select * from users where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid
-- run it again so it gets picked up by QPM 
select * from users where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid

And now we can checkout what QPM knows about the planning via its dba_plans table:

SELECT sql_hash, plan_hash, enabled, status, sql_text, plan_outline, queryid 
FROM apg_plan_mgmt.dba_plans;

And we find the plan for the query we just did!

-[ RECORD 1 ]+------------------------------------------------------------------
sql_hash     | 1518555144
plan_hash    | -1889604132
enabled      | t
status       | Approved
sql_text     | select *                                                         +
             | from users                                                       +
             | where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid;
plan_outline | {                                                                +
             |   "Fmt": "01.00",                                                +
             |   "Outl": {                                                      +
             |     "Op": "SScan",                                               +
             |     "QB": 1,                                                     +
             |     "S": "public",                                               +
             |     "Tbl": "users",                                              +
             |     "Rid": 1                                                     +
             |   }                                                              +
             | }
queryid      | 1340028882142985320

The AWS docs have a full breakdown of what every field in the dba_plans table means, so I'll just highlight the most useful ones here:

  • sql_hash - this is a hash of the sql statement after being normalized. How normalization works and what influences this hash is super duper important to understand, so thorough read of the docs and then experimentation to confirm your understanding is encouraged. It's easy to fool yourself and end up accidentally managing the plan of a different query than the one you expect!
  • plan_hash - like the SQL hash, this is the thumbprint for the plan itself. This is used primarily as an identifier when making function calls to QPM
  • enabled - what it says on the tin – whether or not this plan is enabled
  • status - this is the main variable we manipulate as part of managing plans with QPM. It can be either Approved, Unapproved, Rejected, or Preferred

Now that QPM is tracking our plans, we have to enact step 2 of our 3 step plan: force a the planner to use a good plan at least twice so that QPM can pick it up.

Forcing a good plan from Postgresql

This part can be more art than science. The goal here isn't understanding the why Postgresql started making these planning choices – that's for later when we have more time – the goal at this phase is to do whatever we can to get back to a good plan so that we can resolve our production issues.

In practice, that means following a general approach of running the problematic query with explain analyze, and then speculatively flipping set_enable_x flags to see if something useful happens.

For our case, we "don't know" what happened – we only know we've been paged and our app's performance has tanked. So, we run our query with explain

EXPLAIN (ANALYZE, BUFFERS) select * 
from users 
where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid

which outputs:

"Seq Scan on users  (cost=10000000000.00..10000208335.00 rows=5000040 width=40) (actual time=1.270..781.934 rows=292 loops=1)"
"  Filter: (customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid)"
"  Rows Removed by Filter: 9999708"
"  Buffers: shared hit=83334"
"Planning:"
"  Buffers: shared hit=12"
"Planning Time: 0.141 ms"
"Execution Time: 781.988 ms"

While our plan is super simple, for anything involving more than a few lines, you'll definitely want to visualize the plan via depesz

Row estimation being off by a factor of 17,000x is what we'd call "a red flag". Something is funky in the database. We also know we've got an index on that column. So, a safe first pass here would be just disabling sequential scans and seeing what happens.

You can find a full list of all the available planner options here

set enable_seqscan = off;

Now we run our query again and see what happened.

EXPLAIN (ANALYZE, BUFFERS) select * 
from users 
where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid

Aha!

"Index Scan using users_customer_id_idx on users  (cost=0.43..190087.87 rows=5000040 width=40) (actual time=0.021..0.323 rows=292 loops=1)"
"  Index Cond: (customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid)"
"  Buffers: shared hit=14"
"Planning:"
"  Buffers: shared hit=10"
"Planning Time: 0.183 ms"
"Execution Time: 0.357 ms"

We're back to our index scan and original performance. Progress!

Now, we need to make sure that the query is visible to QPM, and that means running it by itself two more times.

-- one
select * from users where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid
-- two 
select * from users where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid

Now we again see what QPM knows about:

SELECT sql_hash, plan_hash, enabled, status, sql_text, plan_outline, queryid 
FROM apg_plan_mgmt.dba_plans;

Now we find two records!

-[ RECORD 1 ]+------------------------------------------------------------------
sql_hash     | 1518555144
plan_hash    | 2091457234
enabled      | t
status       | Unapproved
sql_text     | select *                                                         +
             | from users                                                       +
             | where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid;
plan_outline | {                                                                +
             |   "Fmt": "01.00",                                                +
             |   "Outl": {                                                      +
             |     "Op": "IScan",                                               +
             |     "QB": 1,                                                     +
             |     "S": "public",                                               +
             |     "Idx": "users_customer_id_idx",                              +
             |     "Tbl": "users",                                              +
             |     "Rid": 1                                                     +
             |   }                                                              +
             | }
queryid      | 619019368
-[ RECORD 2 ]+------------------------------------------------------------------
sql_hash     | 1518555144
plan_hash    | -1889604132
enabled      | t
status       | Approved
sql_text     | select *                                                         +
             | from users                                                       +
             | where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid;
plan_outline | {                                                                +
             |   "Fmt": "01.00",                                                +
             |   "Outl": {                                                      +
             |     "Op": "SScan",                                               +
             |     "QB": 1,                                                     +
             |     "S": "public",                                               +
             |     "Tbl": "users",                                              +
             |     "Rid": 1                                                     +
             |   }                                                              +
             | }
queryid      | 619019368

One is our slow, sequentially scanning plan. This is the one that Postgresql currently thinks is optimal. The other is our fast index plan, which is the one we just forced.

What we did here is the bulk of the work. Getting back to a sane plan can often take some trial and error, but once you've got the good plan firing, the battle is won. All that's left is getting the planner comply.

Rejecting the bad plan and locking in the good plan

The last thing to do is to tell QPM which of the two available plans is the one we want to use. This is done with a function called apg_plan_mgmt.set_plan_status. You can checkout all the available functions here. This is where sql_hash and plan_hash come into play.

We want to Reject the current scan based plan so that it's never used. From looking at our output above, we can see that the sql and plan hash for the slow plan is 1518555144 and -1889604132 respectively. To reject the plan all we need to do is feed those to the set_plan_status function.

select apg_plan_mgmt.set_plan_status(1518555144, -1889604132, 'Rejected')

Finally, we want to Approve the plan we know performs well – our index based one. From the output above, we can see that that one has a sql and plan hash of 1518555144 and 2091457234. So, we again feed those to set_plan_status, but this time setting the status to Approved

select apg_plan_mgmt.set_plan_status(1518555144, 2091457234, 'Approved')

And now we can find those changes reflected in QPM:

-[ RECORD 1 ]+------------------------------------------------------------------
sql_hash     | 1518555144
plan_hash    | -1889604132
enabled      | t
status       | Rejected
sql_text     | select *                                                         +
             | from users                                                       +
             | where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid;
plan_outline | {                                                                +
             |   "Fmt": "01.00",                                                +
             |   "Outl": {                                                      +
             |     "Op": "SScan",                                               +
             |     "QB": 1,                                                     +
             |     "S": "public",                                               +
             |     "Tbl": "users",                                              +
             |     "Rid": 1                                                     +
             |   }                                                              +
             | }
queryid      | 619019368
-[ RECORD 2 ]+------------------------------------------------------------------
sql_hash     | 1518555144
plan_hash    | 2091457234
enabled      | t
status       | Approved
sql_text     | select *                                                         +
             | from users                                                       +
             | where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid;
plan_outline | {                                                                +
             |   "Fmt": "01.00",                                                +
             |   "Outl": {                                                      +
             |     "Op": "IScan",                                               +
             |     "QB": 1,                                                     +
             |     "S": "public",                                               +
             |     "Idx": "users_customer_id_idx",                              +
             |     "Tbl": "users",                                              +
             |     "Rid": 1                                                     +
             |   }                                                              +
             | }
queryid      | 619019368

The bad plan is rejected, the good plan is approved!

Now check this out! If we query again, in addition to the planner now using the plan we want, we can see that a new piece of information is returned.

EXPLAIN (ANALYZE, BUFFERS) select * 
from users 
where customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid
"Index Scan using users_customer_id_idx on users  (cost=0.43..190087.87 rows=5000040 width=40) (actual time=0.018..0.334 rows=292 loops=1)"
"  Index Cond: (customer_id < '00020eb1-5492-4bbd-9ef3-b0388bbf5f63'::uuid)"
"  Buffers: shared hit=14"
"Planning:"
"  Buffers: shared hit=27"
"Planning Time: 0.307 ms"
"Execution Time: 0.360 ms"
"Note: An Approved plan was used instead of the minimum cost plan."
-------------------------
^^ !LOOK AT THIS LINE! ^^
-------------------------

It notes that "An Approved plan was used instead of the minimum cost plan"

This is what's so cool! Even though the planner still thinks the scan option is the superior choice, QPM overrides that and convinces it to use the "inferior" index plan!

And that's it!

All Done!

That's the whole process. No deployments. No code or query changes. All we have to do is what we've always wanted to do: tell postgresql to not use a crummy plan. QPM gives us that capability.

Limitations and considerations

QPM is no Silver Bullet (even though it can feel like one). Ultimately, it exists outside of the core planner as an extension, so the best it can ever be is a polite suggestion. It can only say, of the plans that are currently available, which one to favor or reject. If postgresql doesn't consider the approved plan to be a viable choice, it'll blissfully ignore QPM's recommendation.

QPM can also be finicky to use due to how it handles sql normalization and plan hashing. The exact same SQL statement can produce different hashes depending on how its arguments get bound by the client. This leads to baffling behavior where it can seem like QPM isn't working at all. However, in reality, it's working fine, you're just telling it to control a completely different query from the one you're actually trying to influence.

Related to the above, the tooling is still primitive. Everything is managed from a sql prompt, which comes with all the normal limitations. When all you've got is the humble table for viewing data, it can big challenge just to find the right plan id / hash so that you know you're looking at the right thing. The more queries being tracked by QPM, the harder this becomes. It can take a lot of flipping back and forth between different tables and tediously searching through data that doesn't fit on screen to pin down what's actually under QPMs control versus what isn't. Better tooling, like integration with the Enhanced Monitoring Console (hint hint / nudge nudge), would made the management of plans much more easy.

Lastly, if you're making QPM a core part of your operations strategy, a "how to" in a runbook somewhere isn't enough. This all needs practiced ahead of time. QPM's quirks need properly experienced so that you're prepared when under the gun. So drills need to be done where one team member forced a bad plan, and then another has to fix it.

Footnotes & Useful Links