Scaling PostgreSQL Performance Using Table Partitioning
Here’s a classic scenario
You work on a project that stores transactional data in a database. The application gets deployed to production and early on the performance is great, selecting data from the database is snappy and insert latency goes unnoticed. Over a time period of days/weeks/months the database starts to get bigger and queries slow down.
There are various approaches that can help you make your application and database run faster. A Database Administrator (DBA) will take a look and see that the database is tuned. They offer suggestions to add certain indexes, move logging to separate disk partitions, adjust database engine parameters and verify that the database is healthy. You can also add Provisioned iOPS on EBS volumes or obtain faster (not just separate) disk partitions. This will buy you more time and may resolve this issues to a degree.
At a certain point you realize the data in the database is the bottleneck.
In many application databases some of the data is historical information that becomes less important after a certain amount of time. If you figure out a way to get rid of this data your queries will run faster, backups run quicker and use less disk space. You can delete it but then the data is gone forever. You could run a slew of DELETE statements causing a ton of logging and consuming database engine resources. So how do we get rid of old data efficiently but not lose the data forever?
Table Partitioning
Table partitioning is a good solution to this very problem. You take one massive table and split it into many smaller tables - these smaller tables are called partitions or child tables. Operations such as backups, SELECTs, and DELETEs can be performed against individual partitions or against all of the partitions. Partitions can also be dropped or exported as a single transaction requiring minimal logging.
Terminology
Let’s start with some terminology that you will see in the remainder of this blog.
Master Table
Also referred to as a Master Partition Table, this table is the template child tables are created from. This is a normal table, but it doesn’t contain any data and requires a trigger (more on this later). There is a one-to-many relationship between a master table and child tables, that is to say that there is one master table and many child tables.
Child Table
These tables inherit their structure (in other words, their Data Definition Language__or __DDL__for short) from the master table and belong to a single master table. The child tables contain all of the data. These tables are also referred to as __Table Partitions.
Partition Function
A __partition function__is a Stored Procedure that determines which child table should accept a new record. The master table has a trigger which calls a partition function. There are two typical methodologies for routing records to child tables:
- By Date Values- An example of this is purchase order date. As purchase orders are added to the master table, this function is called by the trigger. If you create partitions by day, each child partition will represent all the purchase orders entered for a particular day. This method is covered by this article.
- By__Fixed Values__- An example of this is by geographic location such as states. In this case, you can have 50 child tables, one for each US state. As INSERTs are fired against the master the partition function sorts each new row into one of the child tables. This methodology isn’t covered by this article because it doesn’t help us remove older data.
Let’s Try Configuring Table Partitions!
The example solution demonstrates the following:
- Automatically creating database table partitions based on date
- Schedule the export of older table partitions to compressed flat files in the OS
- Drop the old table partition without impacting performance
- Reload older partitions so they are available to the master partition
Take the time to let that last piece of the solution sink in. Most of the documentation on partitioning I’ve read through simply uses partitioning to keep the database lean and mean. If you needed older data, you’d have to keep an old database backup. I’ll show you how you can keep the database lean and mean through partitioning but also have the data available to you when you need it without db backups.
Conventions
Commands run in shell as the root user will be prefixed by:
(root#)
Commands run in a shell as a non-root user, eg. postgres will be prefixed by:
postgres$
Commands run within the PostgreSQL database system will look as follows:
my_database>
What you’ll need
The examples below use PostgreSQL 9.2 on Engine Yard. You will also need git for installing plsh.
Summary of Steps
Here’s a summary of what we are going to do:
- Create a master table
- Create a trigger function
- Create a table trigger
- Create partition maintenance function
- Schedule the partition maintenance
- Reload old partitions as needed
Create a Master Table
For this example we’ll be creating a table to store basic performance data (cpu, memory, disk) about a group of servers (server_id) every minute (time).
CREATE TABLE myschema.server_master
(
id BIGSERIAL NOT NULL,
server_id BIGINT,
cpu REAL,
memory BIGINT,
disk TEXT,
"time" BIGINT,
PRIMARY KEY (id)
);
Notice that in the code above the column name time is in quotes. This is necessary because time is a keyword in PostgreSQL. For more information on Date/Time keywords and functions visit the PostgreSQL Manual.
Create Trigger Function
The trigger function below does the following
- Creates child partition child tables with dynamically generated “CREATE TABLE” statements if the child table does not already exist.
- Partitions (child tables) are determined by the values in the “time” column, creating one partition per calendar day.
- Time is stored in epoch format which is an integer representation of the number of seconds since 1970-01-01 00:00:00+00. More information on Epoch can be found at http://en.wikipedia.org/wiki/Unix_time
- Each day has 86400 seconds, midnight for a particular day is an epoch date divisible by 86400 without a remainder.
- The name of each child table will be in the format of
myschema.server_YYYY-MM-DD
.
CREATE OR REPLACE FUNCTION
myschema.server_partition_function()
RETURNS TRIGGER AS
$BODY$
DECLARE
_new_time int;
_tablename text;
_startdate text;
_enddate text;
_result record;
BEGIN
--Takes the current inbound "time" value and determines when midnight is for the given date
_new_time := ((NEW."time"/86400)::int)*86400;
_startdate := to_char(to_timestamp(_new_time), 'YYYY-MM-DD');
_tablename := 'server_'||_startdate;
-- Check if the partition needed for the current record exists
PERFORM 1
FROM pg_catalog.pg_class c
JOIN pg_catalog.pg_namespace n ON n.oid = c.relnamespace
WHERE c.relkind = 'r'
AND c.relname = _tablename
AND n.nspname = 'myschema';
-- If the partition needed does not yet exist, then we create it:
-- Note that || is string concatenation (joining two strings to make one)
IF NOT FOUND THEN
_enddate:=_startdate::timestamp + INTERVAL '1 day';
EXECUTE 'CREATE TABLE myschema.' || quote_ident(_tablename) || ' (
CHECK ( "time" >= EXTRACT(EPOCH FROM DATE ' || quote_literal(_startdate) || ')
AND "time" < EXTRACT(EPOCH FROM DATE ' || quote_literal(_enddate) || ')
)
) INHERITS (myschema.server_master)';
-- Table permissions are not inherited from the parent.
-- If permissions change on the master be sure to change them on the child also.
EXECUTE 'ALTER TABLE myschema.' || quote_ident(_tablename) || ' OWNER TO postgres';
EXECUTE 'GRANT ALL ON TABLE myschema.' || quote_ident(_tablename) || ' TO my_role';
-- Indexes are defined per child, so we assign a default index that uses the partition columns
EXECUTE 'CREATE INDEX ' || quote_ident(_tablename||'_indx1') || ' ON myschema.' || quote_ident(_tablename) || ' (time, id)';
END IF;
-- Insert the current record into the correct partition, which we are sure will now exist.
EXECUTE 'INSERT INTO myschema.' || quote_ident(_tablename) || ' VALUES ($1.*)' USING NEW;
RETURN NULL;
END;
$BODY$
LANGUAGE plpgsql;
Create a Table Trigger
Now that the Partition Function has been created an Insert Trigger needs to be added to the Master Table which will call the partition function when new records are inserted.
CREATE TRIGGER server_master_trigger
BEFORE INSERT ON myschema.server_master
FOR EACH ROW EXECUTE PROCEDURE myschema.server_partition_function();
At this point you can start inserting rows against the Master Table and see the rows being inserted into the correct child table.
Create Partition Maintenance Function
Now let’s put the master table on a diet. The function below was built generically to handle the partition maintenance, which is why you won’t see any direct syntax for server.
How it works
- All of the child tables for a particular master table are scanned looking for any partitions where the name of the partition corresponds to a date older than 15 days ago.
- Each “too old” partition is exported/dumped to the local file system by calling the db function
myschema.export_partition(text, text)
. More on this is in the next section. - If and only if the export to the local filesystem was successful the child table is dropped.
- This function assumes that the folder
/db/partition_dump
exists on the local db server. More on this in the next section. If you are wondering where the partitions are exported to this is where you should look!
CREATE OR REPLACE FUNCTION
myschema.partition_maintenance(in_tablename_prefix text, in_master_tablename text, in_asof date)
RETURNS text AS
$BODY$
DECLARE
_result record;
_current_time_without_special_characters text;
_out_filename text;
_return_message text;
return_message text;
BEGIN
-- Get the current date in YYYYMMDD_HHMMSS.ssssss format
_current_time_without_special_characters :=
REPLACE(REPLACE(REPLACE(NOW()::TIMESTAMP WITHOUT TIME ZONE::TEXT, '-', ''), ':', ''), ' ', '_');
-- Initialize the return_message to empty to indicate no errors hit
_return_message := '';
--Validate input to function
IF in_tablename_prefix IS NULL THEN
RETURN 'Child table name prefix must be provided'::text;
ELSIF in_master_tablename IS NULL THEN
RETURN 'Master table name must be provided'::text;
ELSIF in_asof IS NULL THEN
RETURN 'You must provide the as-of date, NOW() is the typical value';
END IF;
FOR _result IN SELECT * FROM pg_tables WHERE schemaname='myschema' LOOP
IF POSITION(in_tablename_prefix in _result.tablename) > 0 AND char_length(substring(_result.tablename from '[0-9-]*$')) <> 0 AND (in_asof - interval '15 days') > to_timestamp(substring(_result.tablename from '[0-9-]*$'),'YYYY-MM-DD') THEN
_out_filename := '/db/partition_dump/' || _result.tablename || '_' || _current_time_without_special_characters || '.sql.gz';
BEGIN
-- Call function export_partition(child_table text) to export the file
PERFORM myschema.export_partition(_result.tablename::text, _out_filename::text);
-- If the export was successful drop the child partition
EXECUTE 'DROP TABLE myschema.' || quote_ident(_result.tablename);
_return_message := return_message || 'Dumped table: ' || _result.tablename::text || ', ';
RAISE NOTICE 'Dumped table %', _result.tablename::text;
EXCEPTION WHEN OTHERS THEN
_return_message := return_message || 'ERROR dumping table: ' || _result.tablename::text || ', ';
RAISE NOTICE 'ERROR DUMPING %', _result.tablename::text;
END;
END IF;
END LOOP;
RETURN _return_message || 'Done'::text;
END;
$BODY$
LANGUAGE plpgsql VOLATILE COST 100;
ALTER FUNCTION myschema.partition_maintenance(text, text, date) OWNER TO postgres;
GRANT EXECUTE ON FUNCTION myschema.partition_maintenance(text, text, date) TO postgres;
GRANT EXECUTE ON FUNCTION myschema.partition_maintenance(text, text, date) TO my_role;
The function below is again generic and allows you to pass in the table name of the file you would like to export to the operating system and the name of the compressed file that will contain the exported table.
-- Helper Function for partition maintenance
CREATE OR REPLACE FUNCTION myschema.export_partition(text, text) RETURNS text AS
$BASH$
#!/bin/bash
tablename=${1}
filename=${2}
# NOTE: pg_dump must be available in the path.
pg_dump -U postgres -t myschema."${tablename}" my_database| gzip -c > ${filename} ;
$BASH$
LANGUAGE plsh;
ALTER FUNCTION myschema.export_partition(text, text) OWNER TO postgres;
GRANT EXECUTE ON FUNCTION myschema.export_partition(text, text) TO postgres;
GRANT EXECUTE ON FUNCTION myschema.export_partition(text, text) TO my_role;
Note that the code above uses the plsh
language extension which is explained below. Also note that on our systems bash is located at /bin/bash, this may vary.
That Was Fun, Where are we?
Almost there. So far we’ve made all the necessary changes within the database to accommodate table partitions:
- Created a new master table
- Created the trigger and trigger function for the master table
- Created partition maintenance functions to export older partitions to the os and drop the old partitions
You could stop here if you’d like and proceed to the section “Now Let’s See it in Action” but sure to continue below to configure automated maintenance.
What we have left to do for automated maintenance is:
- Install the plsh extension
- Setup the os to store partition dumps
- Create a cron job to automate the calling of the maintenance partition function
Configure PostgreSQL and OS
Enabling PLSH in PostgreSQL
The PLSH extension is needed for PostgreSQL to execute shell commands. This is used by myschema.export_partition(text,text)
to dynamically create a shell string to execute pg_dump
. Starting as root, execute the following commands,
root# cd /usr/local/src # Build the extension .so files for postgresql
root# curl -L href="https://github.com/petere/plsh/archive/9a429a4bb9ed98e80d12a931f90458a712d0adbd.tar.gz">https://github.com/petere/plsh/archive/9a429a4bb9ed98e80d12a931f90458a712d0adbd.tar.gz -o plsh.tar.gz
root# tar zxf plsh.tar.gz
root# cd plsh-*/
root# make all install # Note that the postgres header files must be available
root# su - postgres # Or whatever account postgresql is running under
postgres$ psql my_database # Substitute the name of your database with the partitioned tables
my_database> CREATE EXTENSION plsh; # NOTE: This must be done once for each database
Create the directory,
root# mkdir -p /db/partition_dump
Ensure that the postgres user owns the directory, and your deployment user’s group has permissions as well so that it can read the files. The default deploy user is ‘deploy’ on Engine Yard Cloud.
root# chown postgres:deploy /db/partition_dump
Even further information on PL/SH can be found in the plsh project’s README.
Schedule Partition Maintenance
The commands below will schedule the partition_maintenance
job to run at midnight every day
root# su - postgres ## This is the OS user that will run the cron job
postgres$ mkdir -p $HOME/bin/pg_jobs ## This is the folder that will contain the script below
postgres$ cat > $HOME/bin/pg_jobs/myschema_partition_maintenance
#!/bin/bash
# NOTE: psql must be available in the path.
psql -U postgres glimpse <<SQL
SELECT myschema.partition_maintenance('server'::text, 'server_master'::text, now()::date );
SQL
## Now press the <ctrl+d> keys to terminate the “cat” commands input mode
postgres$ exit ## Exit the postgres os user and execute as root:
root# chmod +x /home/postgres/bin/pg_jobs/myschema_partition_maintenance # Make script executable< root# crontab -u postgres -e ## Add the line:
0 0 * * * /home/postgres/bin/pg_jobs/myschema_partition_maintenance
View the cron jobs for the postgres user to ensure the crontab line is correct:
root# crontab -u postgres -l
0 0 * * * /home/postgres/bin/pg_jobs/myschema_partition_maintenance
Make sure your /db/partition_dump
folder is backed up if you are not using an Engine Yard Cloud instance. If you ever need the data again you’ll need these files to restore the old partitions. This may be as simple as rsyncing (copying) these files to another server just to be sure. We find that sending these to S3 works well for archival purposes.
Now your master tables are scheduled for partition maintenance and you can rest knowing that you’ve create something special; a nimble database that will keep itself on a weight loss program!
Reload Old Partitions
If you have separation anxiety from your old data or maybe a dull compliance request landed on your desk then you can reload the old partitions from the file system.
To reload a partition first we navigate to /db/partition_dump
on the local db server and identify the file and then as the postgres user we import the file back into the database.
postgres$ cd /db/partition_dump
postgres$ ls # find the filename of the partition dump you need
postgres$ psql my_database < name_of_partition_dump_file
After the partition file is loaded it will be queryable from the master table. Be aware that when the next time the partition maintenance job runs the newly imported partition will be exported again.
Now Let’s See it in Action
Create Child Tables
Let’s insert two rows of data to see creation of new child partitions in action. Open a psql session and execute the following:
postgres$ psql my_database
my_database> INSERT INTO myschema.server_master (server_id, cpu, memory, disk, time) VALUES (123, 20.14, 4086000, '{sda1:510000}', 1359457620); --Will create "myschema"."servers_2013-01-29"
my_database> INSERT INTO myschema.server_master (server_id, cpu, memory, disk, time) VALUES (123, 50.25, 4086000, '{sda1:500000}', 1359547500); --Will create "myschema"."servers_2013-01-30"
So what happened? Assuming this is the first time you’ve run this two new child tables were created, see the comments inline with the sql statement on the child tables that were created. The first insert can be seen by selecting against either the parent or child:
SELECT * FROM myschema.server_master; --Both records seen
SELECT * FROM myschema."server_2013-01-29"; --Only the first insert is seen
Note the use of double quotes around the child partition table name, they aren’t there because the table is inherited, they are there because of hyphen used between the year-month-day.
Perform Partition Maintenance
The two rows we inserted are more than 15 days old. Manually running the partition maintenance job (same job as would be run by cron) will export these two partitions to the os and drop the partitions.
postgres$ /home/postgres/bin/pg_jobs/myschema_partition_maintenance
When the job is done you can see the two exported files:
postgres$ cd /db/partition_dump
postgres$ ls -alh
…
-rw------- 1 postgres postgres 1.0K Feb 16 00:00 servers_2013-01-29_20130216_000000.000000.sql.gz
-rw------- 1 postgres postgres 1.0K Feb 16 00:00 servers_2013-01-30_20130216_000000.000000.sql.gz
Selecting against the master table should yield 0 rows, the two child tables will also no longer exist.
Reload Exported Partitions
If you want to reload the first child partition from the exported file, gunzip it then reload it using psql:
postgres$ cd /db/partition_dump
postgres$ gunzip servers_2013-01-29_20130216_000000.000000.sql.gz
postgres$ psql my_database < servers_2013-01-29_20130216_000000.000000.sql
Selecting against the master table will yield 1 row, the first child table will now exist as well.
Notes
Our database files reside on a partition mounted at /db which is separate from our root (‘/’) partition.
For more information on PostgreSQL extensions visit the extensions documentation. The database engine doesn’t return the number of rows affected correctly (always 0 rows affected) when performing INSERTs and UPDATEs against the master table. If you use Ruby, be sure to adjust your code for the fact that the pg gem won’t have the correct value when reporting cmd_tuples. If you are using an ORM then hopefully they adjust for this accordingly.
Make sure you are backing up the exported partition files in /db/partition_dump
, these files lie outside of the database backup path.
The database user that is performing the INSERT against the master table needs to also have DDL permissions to create the child tables.
There is a small performance impact when performing an INSERT against a master table since the trigger function will be executed.
Ensure that you are running the absolute latest point release for your version of PostgreSQL, this ensures you are running the most stable and secure version available.
This solution works for my situation, your requirements may vary so feel free to modify, extend, mutilate, laugh hysterically or copy this for your own use.
Next Steps
One of the original assumptions was the creation of partitions for each 24 hour period, but this can be any interval of time (1 hour, 1 day, 1 week, every 435 seconds) with a few modifications. Part II of this blog post will discuss the necessary changes needed to the partition_maintenance function and table trigger. I’ll also explore how to create a second “archive” database that you can use to automatically load old partition data, keeping the primary database lean and mean for everyday use.
Share your thoughts with @engineyard on Twitter