Add the missing ADD_TO_SET() AND REMOVE_FROM_SET() for MySQL's SET type

MySQL has a datatype for the storage of sets in one simple column. It provides a FIND_IN_SET() function to deal with this in SELECT queries. However it still lacks an ADD_TO_SET() and REMOVE_FROM_SET() to use in UPDATEs. The usual workarounds you find on the web generate warnings or even wrong results with some special cases. So with MySQL 5.0 it's definitely the time to build a clean emulation for these two functions using stored routines. Or is it not (yet)?

Some thoughts on the SET type first: I find it to be very convenient in certain situations (see the Reference Manual or Mike Hillyer's article The MySQL SET Datatype for more on this MySQL specific feature). Of course using it means that your data is not normalized, as in this example:

-- a simple example with a set
CREATE TABLE `data_with_set` (
    `id` INT NOT NULL auto_increment PRIMARY KEY,
    `data` VARCHAR(255),
    `set` SET('a', 'b', 'c')
);

-- query your data using a simple select
SELECT * FROM data_with_set;

Taking a purist approach you should always use two additional tables instead of the SET type to implement this.

-- a normalized version of the same table without using the set type
CREATE TABLE `data_with_set` (
    `id` INT NOT NULL auto_increment PRIMARY KEY,
    `data` VARCHAR(255)
);

-- we keep all the possible members of a set in a separate table
CREATE TABLE `set_values` (
    `set_val_id` INT NOT NULL auto_increment PRIMARY KEY,
    `set_value` VARCHAR(255)
);
INSERT INTO `set_values` (`set_value`) VALUES ('a'), ('b'), ('c');

-- and need one more table to store all selected set members
-- for any given row in our main table `data_with_set`
CREATE TABLE `set_members` (
    `id` INT NOT NULL,
    `set_val_id` INT NOT NULL,
    PRIMARY KEY (`id`, `set_val_id`)
);

-- query your data using joins
SELECT d.id, d.data, GROUP_CONCAT(v.set_value)
FROM data_with_set AS d
LEFT JOIN set_members AS m USING (id)
INNER JOIN set_values AS v ON m.set_val_id = v.set_val_id
GROUP BY d.id;

You'll understand now what I mean when I talk about the convenience of the SET type. But you still should not give into the temptation to use a SET for ordinary user data. Adding a new possible SET value ('d' for example in the tables above) means fiddling around with your original table definition, something you (and definitely your users) should never have to do in a clean database design. Stick with the normalized approach here.

So where do I use the SET type after all? Wherever I have to keep a set of flags with my data for administrative-only purposes, the list of possible SET values is static and the overhead of two additional tables would be too much of a hassle. Let me give you two real-world examples where I found the SET type especially helpful:

For the second application it's definitely necessary to add and remove members in sets without touching the others.

The ADD_TO_SET() and REMOVE_FROM_SET() challenge

I once saw these two functions on a to do list in the MySQL manual but this page seems to have gone some time ago. Instead you find some tips on how to add or remove set members in the "User Comments" section of the manual or in Mike Hillyer's article. It usually boils down to something like the following (I know you could take the binary approach as well, but for now let's stick with the strings, as for me a set is something that's by definition unordered, even if that's not the case with MySQL's specific implementation):

-- ADD_TO_SET('new_member', set)
-- using CONCAT_WS() to join together multiple strings with a separator

CONCAT_WS(',', set, 'new_member')

-- REMOVE_FROM_SET('member_to_delete', set)
-- using REPLACE() to replace a member by an empty string

REPLACE(set, 'member_to_delete', '')

This is outright dangerous. Imagine a definition of SET('bar', 'restaurant', 'minibar', 'pool', 'pool table') to hold features of hotels (yes, you should probably use the normalized approach here anyway). Now a hotel with all of this closes its bar and its swimming pool. By using the above REPLACE() example, the hotel will also loose it's minibar (wich ends up being just a 'mini' and not being in the set as such will just be deleted) and it's billiard table (now being just an ordinary ' table' that falls out of the set, too).

Instead we should add commas around the set member we want to delete, to make sure it gets only erased if it appers in full. Then of course we have to replace it with a comma instead of an empty string, so other set members won't be joined together accidentally:

REPLACE(set, CONCAT(',', 'member_to_delete', ','), ',')

This of course won't work if the member comes first or last in the set we want to change. So we have to add further commas around the whole set as well.

REPLACE(CONCAT(',', set, ','), CONCAT(',', 'member_to_delete', ','), '')

This works but will generate warnings from MySQL because of the surrounding commas, so we have to strip them again using MySQL's TRIM() function.

TRIM(BOTH ',' FROM
    REPLACE(CONCAT(',', set, ','), CONCAT(',', 'member_to_delete', ','), ',')
)

Additionally we get warnings from the CONCAT_WS(',', set, 'new_member') replacement for ADD_TO_SET() as well, if we let it do it's work on an empty set. As set just contains an empty string we end up with ',member_to_delete'. But CONCAT_WS() does the right thing if we feed it a NULL instead of an empty string and does skip the comma in that case. So let's add that:

CONCAT_WS(',', IF(set = '', NULL, set), 'new_member')

The two replacements don't really look nice but at least they should work.

Encapsulating in stored functions

But with MySQL 5.0 we can easily hide the complexity inside two stored functions, both being actually only one-liners with some additional settings around the core code.

CREATE FUNCTION ADD_TO_SET(`member` VARCHAR(64), `set` VARCHAR(4096))
    RETURNS VARCHAR(4096)
    COMMENT 'ADD_TO_SET(member, set)\nAdds member to a SET string'
    DETERMINISTIC NO SQL
    RETURN CONCAT_WS(',', IF(`set` = '', NULL, `set`), `member`);

CREATE FUNCTION REMOVE_FROM_SET(`member` VARCHAR(64), `set` VARCHAR(4096))
    RETURNS VARCHAR(4096)
    COMMENT 'REMOVE_FROM_SET(member, set)\nRemoves member from a SET string'
    DETERMINISTIC NO SQL
    RETURN TRIM(BOTH ',' FROM
        REPLACE(CONCAT(',', `set`, ','), CONCAT(',', `member`, ','), ',')
    );

We just added a comment to the functions and tell MySQL that they are DETERMINISTIC (they generate the same output from the same input every time they're called) and contain NO SQL, which means they don't access any tables. That's valuable information that can be used by you, the optimizer and the replication code in MySQL.

That's it (almost): UPDATE queries dealing with SET columns just got a much nicer look. And should these functions once pop up as built into MySQL, you can just DROP our own versions and your code should continue to work, provided MySQL chooses the same syntax (which hopefully should be the case, in analogy to FIND_IN_SET() and what I once saw on the to do list in the manual).

Massive performance penalty

That's it, I said, and added an "almost". The sad end to the story is that the call to any stored function generates a considerable overhead. You can easily test this using MySQL's BENCHMARK() function by just executing our functions versus the inline code without a stored routine around it 100'000 times:

mysql> DO BENCHMARK(100000, ADD_TO_SET('a', 'b,c'));
Query OK, 0 rows affected (5.98 sec)

mysql> DO BENCHMARK(100000, CONCAT_WS(',', IF('b,c' = '', NULL, 'b,c'), 'a'));
Query OK, 0 rows affected (0.14 sec)

mysql> DO BENCHMARK(100000, REMOVE_FROM_SET('a', 'a,b,c'));
Query OK, 0 rows affected (6.21 sec)

mysql> DO BENCHMARK(100000, TRIM(BOTH ',' FROM
REPLACE(CONCAT(',', 'a,b,c', ','), CONCAT(',', 'a', ','), ',')));
Query OK, 0 rows affected (0.36 sec)

The penalty of using a stored function here is a slow-down by a factor of about 17 for the remove and about 42 for the add operation.

A suggestion for an improvement to MySQL's stored functions

This example just made me think about stored functions in general: I use a lot of mathematical expressions in my queries to compute a result from values stored in different columns and planned to wrap most of these into stored functions in MySQL 5.0. This would save us from a lot of typing, make queries much clearer and less error-prone. But if the overhead for calling a stored function is that big I probably have to rethink that for all but those functions that contain very expensive code, as there the overhead will be negligible compared to the time the function body needs to execute.

Perhaps MySQL could be improved by using something as the MERGE algorithm used for views: These functions could easily be reverted to inline code on the level of the query parser/optimizer. This should even work for some functions containing access to tables by rewriting them as JOINs. Until then the difficult decision for every stored function you plan to write remains: How much speed do you want to sacrifice for the clarity of your code?

Updates