Discover "bad" use of ENUM and SET types using MySQL information_schema
After reading Sheeri Kritzer's blog post, the resulting comments and her own reply on the use of SET and ENUM in MySQL I wondered how these types are actually used in our databases (So, Sheeri, who's perfect anyway? Just read this post to see some other "bad" use at our site!). Luckily the information_schema in MySQL 5.0 now provides an easy way to find out how your users employ the available datatypes.
I agree with some of what Sheeri basically wrote about SET and ENUM in her Working Smarter, Not Harder post mentioned above. It was only the specific examples she gave us that completely contradict what I wrote in one of my earlier posts on the SET type (this holds for ENUM as well): "Adding a new possible SET value means fiddling around with your original table definition, something you (and definitely your users) should never have to do in a clean database design."
The only place, I wrote, where I use a SET is where I have to keep some flags with my data for administrative-only purpose, where the list of possible SET values is static and where the overhead of two additional tables for the administrative use thus would be too much of a hassle. Additionally I should have added that you should only use a SET if you don't run any general searches on that column or you can live with a full table scan for that (as indexes can only be used for specific cases on a SET).
In short, I see quite a few good applications for ENUM, especially now in 5.0 where you can use it to achieve real constraints when you run your server in strict mode (so I agree here with Sheeri as long as your possible values are static), but for the SET type I almost every time found the normalized approach to be the better solution.
But do we (I'm only one amongst four developers at our company, with the additional task of being the part time DBA) really do as I say? So let's finally dig into the information_schema database to find out.
SELECT DATA_TYPE,
COUNT(*) AS n,
ROUND(COUNT(*)
/ (SELECT COUNT(*) FROM information_schema.COLUMNS)
* 100, 2
) AS `%`
FROM information_schema.COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema')
GROUP BY DATA_TYPE
ORDER BY n DESC;
This quickly gives us a list of all datatypes used, their count and their overall percentage (exclude the mysql system schema and the information_schema itself to prevent biasing). Run the query on your own server to see how your users choose amongst the different MySQL datatypes or see db4free.net's statistics for an example of a hosting provider with a wide user base.
+-----------------+-----+-------+ | DATA_TYPE | n | % | +-----------------+-----+-------+ | varchar | 927 | 35.96 | | decimal | 284 | 11.02 | | int | 257 | 9.97 | | char | 198 | 7.68 | | tinyint | 159 | 6.17 | | smallint | 90 | 3.49 | | enum | 81 | 3.14 | | text | 76 | 2.95 | | timestamp | 57 | 2.21 | | bigint | 32 | 1.24 | | float | 21 | 0.81 | | set | 17 | 0.66 | | mediumint | 15 | 0.58 | | tinyblob | 14 | 0.54 | | time | 8 | 0.31 | | binary | 5 | 0.19 | | mediumtext | 5 | 0.19 | | date | 4 | 0.16 | | longtext | 4 | 0.16 | | blob | 3 | 0.12 | | mediumblob | 2 | 0.08 | | tinytext | 2 | 0.08 | | float unsigned | 2 | 0.08 | | double unsigned | 1 | 0.04 | | varbinary | 1 | 0.04 | | longblob | 1 | 0.04 | | datetime | 1 | 0.04 | +-----------------+-----+-------+ 27 rows in set (0.64 sec)
The first thing that hit me by surprise was the high count of VARCHAR columns in our databases, but this is due to the fact that we work a lot with human language strings (weather forecasts), due to some legacy tables (that really should have been designed better) and the multiplicative factor of some archives using VARCHAR columns and consisting of multiple MyISAM tables merged into one by the MRG_MYISAM engine.
Then of course the count of ENUM and SET seems pretty high for the narrow field of use I described above. So let's see what definition lists of possible values they contain.
SELECT DATA_TYPE,
LEFT(COLUMN_TYPE, 50) AS definition,
COUNT(*) AS n
FROM COLUMNS
WHERE TABLE_SCHEMA NOT IN ('mysql', 'information_schema')
AND DATA_TYPE IN ('set', 'enum')
GROUP BY COLUMN_TYPE
ORDER BY DATA_TYPE ASC, n DESC;
We only want to see the columns of type SET and ENUM this time. Additionally the query only displays the first 50 characters of the value definition lists (to fit everything nicely on a console) and it groups identical definitions with a count of their occurrences. Again we filter the mysql system schema and the information_schema itself.
I actually wanted to have an additional column in the result set to show me the tables wherein the definition lists are used, but GROUP_CONCAT() and ORDER BY seems to be a bad combination on the information_schema, so this experiment ended in bug report #15307.
But still we get a good overview of what these two datatypes are used for. Let's first check ENUM at the beginning of the resulting table (I left out some of the rows returned from our system, but kept the selection representative).
+-----------+----------------------------------------------------+---+
| DATA_TYPE | definition | n |
+-----------+----------------------------------------------------+---+
| enum | enum('yes','no') | 9 |
| enum | enum('snow','hail','rain') | 6 |
...
| enum | enum('insert','update','delete') | 3 |
| enum | enum('yes','no','auto') | 2 |
| enum | enum('Ouvert','Fermé') | 2 |
...
| enum | enum('ftp','sftp') | 1 |
| enum | enum('primary','backup') | 1 |
| enum | enum('','AFRICA / AFRIQUE','ASIA / ASIE','EUROPE / | 1 |
| enum | enum('de','fr','en','es','it') | 1 |
| enum | enum('','500 HPA','700 HPA','850 HPA','925 HPA','1 | 1 |
...
Most of the time you don't even need to look at the actual table/application to decide if the use of an ENUM was justified or not. Here I'd say that all but the last two or three examples seem to be okay. The set of possible values on those first few rows is almost by definition static: You wouldn't for example expect a third state being added to 'Ouvert' and 'Fermé' (French for 'open' and 'closed').
For such binary states you could of course choose a BIT(1) column as well now in MySQL 5.0 (that was just a synonym for TINYINT(1) before and didn't enforce a selection of only two values). The advantage is that you can easily check for truth in your code with those numeric types and don't need an additional comparison, the advantage of an ENUM on the other hand is clarity. So it's actually just a matter of taste here, as both an ENUM with two possible values and a BIT(1) will eat up one byte of storage.
The last two examples though really look like they should have been implemented with an additional table instead of an ENUM (the third but last with the list of continents being a border case, as you don't expect any new continents to pop up, but you probably want to store some additional information on them and would need another table for that anyway).
Of course I checked this in the tables and it turned out that the list of continents and of the pressure levels appeared in a table that was sort of a tabular representation of a plain text file from an external source. The data was probably once normalized there and only denormalized for the generation of the text file. As we didn't bother about that when parsing the files again, the ENUM was just the optimal datatype (as far as storage is concerned) suggested by the analyse procedure.
Remains the list of languages that was probably suboptimally implemented as an ENUM.
Now let's finally have a look at some of the uses of the SET datatype in our database.
...
| set | set('ltg','fog','CAVOK','NSC','SKC','CLR','err') | 7 |
...
| set | set('html','graph','csv','xml') | 1 |
| set | set('val','min','max','sum','avg','dev','n','avg2' | 1 |
| set | set('Gewitter','Regen','Schnee','Eisregen','Sturm' | 1 |
| set | set('tt','tt5','ff','fxx','dd','rh','qfe','rrr','s | 1 |
+-----------+----------------------------------------------------+---+
65 rows in set (0.33 sec)
Most of the uses here actually contradict what I wrote on the use of the SET type.
The first example contains a set of almost independent binary flags in rapidly growing tables (many millions of rows per month). These flags are not used for any searches on that table and they are almost never used at all besides possibly for debugging. Each of the flags could of course have been implemented in a separate column using ENUM or with the normalized approach and two additional tables. The decision to use a SET here was solely motivated by storage optimization. All the flags fit nicely into one byte and then of course as this is an archive it's easier to handle it if everything is just in one instead of three tables. But of course if another flag should be added here in the future the table will be offline (during the necessary ALTER TABLE) for quite a time due to its size.
The second but last row is a SET of different possible severe weather conditions. At least the definition list seems to be static here as well. The other rows come from a user rights table of a data extraction tool: The SET was chosen here as these tables contain only few rows and are currently edited by hand in phpMyAdmin (which has much nicer support for SET columns than for a normalized schema with three tables).
What do we learn from this?
- Sometimes maybe most of us take the bait and use
SETandENUMin an environment where a normalized approach would have been the cleaner solution. - I just found the
information_schema(one of the features I neglected the most in MySQL 5.0 until now and only used for some dynamic code) to be a valuable tool for a DBA to quickly check the design of databases in an environment where not every new table or query can be checked beforehand. With MySQL 5.0 and theinformation_schemathe developers now definitely have more to fear from DBAs than the occasional look into the slow query log.