Published 31/01/2020 at 11:01pm

Optimising Databases

For a long time I found database optimisation a mystery until I had a tutorial from a guy called Kenny at Percona. His method of teaching was superb and he gave us some useful and usable concepts to ensure that every query that we ever write can be "fully optimised". This blog post aims to help you out in the same way he helped me out. Trust me, see this through and at the end you'll be wanting to get your hands on every query. So, lets get started...

We'll look at the following:

  • How to know and find the queries you need to optimise

  • Interrogating the query to derive a solution

  • Methods to optimise the query and knowing which one(s) to choose

I should note at this point that the purpose of this post is to offer software solutions but there are other things that could help you. For example, you can change mysql config options on the server; that may help for your particular scenario. There are also database architecture concepts such as sharding (splitting data into multiple chunks - smaller data pools are quicker to read). This post doesn’t cover that and would need special attention to give it the justice it deserves.

How do you know that something needs to be optimised?

  • Somebody reports a part of the system running slowly

  • Technical Debt - you've previously chosen to leave a query in a state that's not fully optimised and know you need to come back to fix it

  • The system use has increased over time and with large amounts of data does not perform the same

  • A feature in the system is used differently to when it was first introduced and now doesn't perform in the same way

  • Tools are telling you there's something wrong (e.g. Profiling tools, Slow query log, diagnostics tools)

Each of the above are sources of information which will help you pull the start of the thread. It should at least give you enough to find it in your code.

I've found a few queries, which one do I start with?

Which of these would you optimise first?

It might seem like it's obvious but actually there is a lot more to consider than just the numbers. Admittedly, the numbers are important and play the biggest part in which one to fix but you also need to consider the other things below which are harder to quantify.

  • Looking at the numbers the 2 minute query uses 100 minutes of execution time whereas the right side uses 30 minutes. This is considerably less time but 9000 queries is a lot. More on that very shortly.

  • Secondly, the left query could be a complex query and could already be optimised as much as possible.

  • The right hand query one might be a critical or frequently used area of the system therefore may warrant more attention

  • The right hand query is much quicker in comparison but it runs so many times. What if we could get that down to a negligible amount. You still have to make 9000 connections which is expensive as you go through database connection cycle each time.

  • What would happen if those 9000 queries happened in a very short space of time?

There is no clear answer, you will need to use monitoring, statistics, your own knowledge of the system and feedback from users.

I found the problem, how do I know how to fix it?

Before we can know what needs to be done, we need to use some keywords and methods to tell us what our query is doing.

  • Using EXPLAIN
  • Using Handler Statistics

EXPLAIN

  • Prefix any query with the keyword EXPLAIN to investigate that query execution.
  • Tells you how MySQL’s execution plan

So if we had a query like this

SELECT * FROM table WHERE id = 1;

we can use explain like this

EXPLAIN SELECT * FROM table WHERE id = 1;

and it will show us something like this

EXPLAIN shows you a whole host of information. It shows MySQL's predicted execution plan. We can find out what the possible indexes are that it has available and which one it will use. We get an idea of how much it filters down the rows and a percentage that it filters them down. This can be really helpful and check whether we're on the right lines to ensure our query is well optimised.

Handler Statistics

  • Use it to check number of rows read by query
  • Can be used to validate you are reading the right number of rows in the right way

We can wrap our query with the following statements to check the number of rows read by the query. The 1st query resets all our counters back to zero and our 3rd query reads the handler statistics after the query has been executed.

FLUSH STATUS;
SELECT * FROM table WHERE id = 1;
SHOW STATUS LIKE 'ha%';

This will output something like this

This method is our most powerful tool. It tells us exactly what happened after the query was ran rather than EXPLAIN which tells us how MySQL plans to execute the query.

There is full documentation here about what each of the variables mean. As a general rule though we want the numbers displayed here to match as close to the number of rows we actually read in order to return the rows. This gets more complicated to work out when you have multiple joins or subqueries but in a simple one table query, if I were to return 20/3000 rows in my table I should expect to read 20 rows if my table was "fully optimised" for that query.

You also only want to see read_key and read_next with numbers next to them as that means its using the index or the primary key respectively, to find the rows.

Ways you can optimise your application

  • Rewrite our query (WHERE clause change, UNION, subqueries)
  • Manipulate the query using keywords FORCE, IGNORE & USE INDEX
  • Add single, composite or partial indexes
  • Caching - Software solution
  • Moving queries out of loops and get the data upfront
  • Modify the database schema

Some Practical Examples

Example 1 - single index

In our first example we are simply looking to look at all employees who were born in 1960. There are 23,197 people in our database which match that criteria. If you look at the handler statistics on the left we are reading every single row in the table meaning our query is running inefficiently. If we add a single index like the code block below, you can see that in the right image we are only reading the number of rows that we return. You can also see in the EXPLAIN output that we are using the index that we added.

ALTER TABLE `employees` ADD INDEX (`birth_date`);

Example 2 - composite index

In our 2nd example we are looking for all male employees who were born in 1960. There are 13,782 people who match that criteria. If you look at the handler statistics on the left we are reading less rows than we were in our first example because of the index applied in the first query but we are still reading over 10 times the amount of rows that we are returning. We can use a composite index which combines two columns to create an index.

ALTER TABLE `employees` ADD INDEX (`gender`, `birth_date`);

Example 3 - Caching

In this example we are looking to retrieve an employees hire date using their unique employee number. We should expect to return 1 row and read 1 row. You can see from the screenshot that according to the handler statistics we are only reading one row and the EXPLAIN says that we are using the primary key and filtering 100%. There is nothing that we can do in the database here.

However, let's assume that this query (with different parameters) gets called 100,000 times every hour. It would be very intensive on the database to create all those connections and in this case unnecessary. The employee number won't change and the hire date is unlikely to change.

We could use a key/value pair cache to store this data (storing this data somewhere securely). There are many with a simple google of "key/value pair cache" you will find many. When it comes to retrieve this data we can retrieve from the cache and bypass the database connection cycle completely.

Example 4 - taking queries out of loops

In this example (slightly contrived) we get the dates of births of employees who have an even employee number. You'll notice in this code block that we are running a query 8 times when we could run it once. Also imagine the list being even longer 100s or 1000s we'd needlessly be running extra queries.

  $employees = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
  foreach ($employees as $employee) {
      if ($employee % 2) {
        $dateOfBirth = $db->query("SELECT date_of_birth FROM employees WHERE emp_no = ?", [$employee]);
        ...
    }
}

We can simplify the code an run one code by firstly running the modulus on the input array and secondly passing that array in to the query using an IN statement completely removing the foreach.

  $employees = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16];
  $employees = array_filter($employees, function($var) {return !($var % 2);});
  $employees = implode(',', $employees);
  $dateOfBirths = $db->query(
      sprintf(
        "SELECT date_of_birth FROM employees WHERE emp_no IN (%s)",
        $employees
      );

Summary

  • Never assume you know the answer to the problem, use evidence to be sure
  • Use EXPLAIN and handler statistics to validate changes
  • Use real data - size, variety and other factors can vastly effect performance. Using something as close to the real thing will ensure your diagnosis is as accurate as possible and you'll have a realistic benchmark about how it will improve your live application.
  • You may need a combination of optimisations to ensure a query is “fully optimised".

© Louis Rickman 2021