Using MySQL's ARCHIVE storage engine: Backups and partitions (Part I)
I still didn't use the ARCHIVE storage engine in production but am preparing to. A lot of existing projects with a need for large amounts of historical data use a combination of multiple MyISAM tables merged into one using the MERGE engine: Historical data is kept in packed read-only tables, the current data is added to an unpacked table. Now the question arises how you could implement a similar combination using the ARCHIVE engine for historical data, given that partitioning is not yet available in MySQL 5.0 and won't support multiple storage engines in 5.1. That's what this first article on ARCHIVE deals with, the second will take a closer look at backup strategies.
The initial idea for the combination problem is simple: Just create two (or more) almost identical tables. The table for the current data uses the MyISAM, InnoDB or MEMORY engine to get fast look-ups on keys, the older data is kept in compressed ARCHIVE tables without keys.
CREATE TABLE t_current ( d date NOT NULL PRIMARY KEY, c char(1) NOT NULL ) ENGINE=MyISAM; CREATE TABLE t_archive ( d date NOT NULL, c char(1) NOT NULL ) ENGINE=ARCHIVE;
Then create a VIEW with a UNION to emulate something like a MRG_MYISAM or a partitioned table where you can access current and archived data conveniently in one place.
CREATE ALGORITHM=MERGE VIEW t AS SELECT d, c FROM t_current UNION SELECT d, c FROM t_archive;
You'll notice the first problem right away when you execute this command. The server will respond with a Query OK, 0 rows affected, 1 warning (0.00 sec). Though the VIEW has been defined correctly, it couldn't make use of the MERGE algorithm.
mysql> SHOW WARNINGS\G *************************** 1. row *************************** Level: Warning Code: 1354 Message: View merge algorithm can't be used here for now (assumed undefined algorithm) 1 row in set (0.00 sec)
Due to the UNION MySQL currently (5.0.18 as of writing) still has to use the TEMPTABLE algorithm. Building a temporary table from your emulated partitions on every query actually renders this approach with a VIEW useless.
That's a pity because a more advanced approach is doomed as well at the level of MySQL's current query optimiser: You don't want the server to scan tables/partitions where there can't be any matching rows. This is called partition pruning. In our emulation scenario we could tell the optimiser to not scan a particular table using a WHERE restriction on the range key.
CREATE OR REPLACE VIEW t AS SELECT d, c FROM t_current WHERE d >= '2006-01-01' UNION SELECT d, c FROM t_archive WHERE d < '2006-01-01';
When we do a SELECT on this VIEW that only touches one of the two data ranges, the optimiser could theoretically prune the other of the two SELECT statements in the UNION. Unfortunately the query optimiser won't listen to that hint. We can prove this using a stored function that logs every row accessed from one of the two tables.
CREATE TABLE log (accessed ENUM('t_archive', 't_current'));
delimiter //
CREATE FUNCTION log_access(accessed ENUM('t_archive', 't_current'))
RETURNS ENUM('t_archive', 't_current')
DETERMINISTIC
BEGIN
INSERT INTO log VALUES (accessed);
RETURN accessed;
END //
delimiter ;
Insert some data into the tables and change the VIEW to call the access log function with the appropriate argument for each of the two SELECT statements.
INSERT INTO t_archive (d, c) VALUES ('2005-05-13', 'a');
INSERT INTO t_current (d, c) VALUES ('2006-03-01', 'c');
CREATE OR REPLACE VIEW t
AS
SELECT d, c, log_access('t_current') AS accessed
FROM t_current
WHERE d >= '2006-01-01'
UNION
SELECT d, c, log_access('t_archive') AS accessed
FROM t_archive
WHERE d < '2006-01-01';
Execute the following commands to prove that still both tables are accessed even when actually only data from one of the two is used in the final result set:
mysql> SELECT * FROM t WHERE d > '2006-02-01'; +------------+---+-----------+ | d | c | accessed | +------------+---+-----------+ | 2006-03-01 | c | t_current | +------------+---+-----------+ 1 row in set (0.01 sec) mysql> SELECT * FROM log; +-----------+ | accessed | +-----------+ | t_current | | t_archive | +-----------+ 2 rows in set (0.00 sec)
This behaviour doesn't only exist in VIEWs where it could be explained by the fact that they run with ALGORITHM=TEMPTABLE but also in simple subqueries. That's why I posted a feature request.
The question remains on how to combine ARCHIVE tables with other engines. Two feasible solutions come to my mind:
- Check the data range in your application code and just replace the table name in the statement accordingly. If the queried range spans more than one table combine the result sets of multiple queries or build a
UNIONquery dynamically. - Encapsulate your queries in a stored procedure and do the range check there. You should make use of dynamic SQL to build the
UNIONquery or combine multiple result sets using a temporary table.
These solutions could be implemented without too much of a hassle. This means you can still combine the advantages of different storage engines before partitioning allows this probably in MySQL 5.2.