Three weeks with MySQL 5.0 in production: A field report

It's now already three weeks ago since we did the big step: The upgrade from MySQL 4.1.14 to 5.0.15 on our company's production master server. That's three weeks of experience with MySQL 5.0 in the real word. In my field report I want to tell you why I believe that MySQL 5.0 is the best available database solution for our enterprise, but I also have to tell you why in my opinion 5.0.15 was not yet the production release.

The actual upgrade was almost painless at first. I felt pretty safe, too, as we carefully checked all the points I mentioned in my own article "Safe transition to MySQL 5.0" (written just before the upgrade of our own master server). At the time of the master's upgrade from 4.1.14 to 5.0.15 we already had our 5.0.13 slave running in sync for three weeks without a single crash and we also checked the key applications against a separate 5.0.13 server.

This work payed off: Besides one single JOIN that was handled differently by 5.0 we didn't miss any compatibility issues. There was a crash although of the 5.0.15 master server just a minute after it's first launch and applying mysql_fix_privilege_tables.sql. But that was probably due to it being already under load (we couldn't turn off all the applications for the upgrade, so I just replaced the symlink to the MySQL installation directory from 4.1.14 to 5.0.15 and did an /etc/init.d/mysql.server restart). This crash was not too bad, as you would have to restart the server after the fix of the privilege tables anyway according to the documentation. We did a check of all the tables and just watched how everything ran smoothly for the first few hours.

Things gone worse

Then, as we already started to apply prepared code using the new 5.0 features, the picture changed. It wasn't only those new features that crashed the server, but some ordinary queries as well that ran fine on 4.1.14 and even on the 5.0.13 slave that was kept in sync before. Some busy days were ahead of us.

Then, five days, 9728 segmentation faults and three crashed tables later I was not really in the mood to call 5.0.15 a production release anymore. Okay, that's unfair. We actually had 8 server crashes on the master and some more than a dozen on the replication slave. Because one nice thing about MySQL is its mysqld_safe script that instantly restarts the mysqld server process after a crash. This saved my life: The service interruptions on the master were usually too short to be noticed by our customers. But this same mysqld_safe script resulted in my high count of 9720 crashes on the replication slave: After a crash it will automatically be relaunched and picks up the same query that made it crash before... you get it.

So things were actually not all that bad as it sounds at first, the worst thing resulting from the crashes being three broken MyISAM tables that could (as it's usually the case with MySQL) easily be fixed with the REPAIR TABLE or the myisamchk command. InnoDB did its job in keeping all the data accurate even with multiple crashes. So even after this busy first week I could still say: I never ever lost one single row of data to MySQL.

But that was just not what I was used to with MySQL before: During three years of heavy use I encountered only two crashing bugs in a production release. But then of course we never got so many new features at once...

Sorting everything out

So instead of filling my blog with all the nice things I could finally do with those really great new features of the 5.0 release I actually spent most of the first week (including the nights) fixing things, cleaning up and keeping the servers running. Whenever I had a quiet minute I tried of course to sort things out and could in the end attribute all of the crashes to one of these five points (plus the first mysql_fix_privilege_tables.sql crash):

  1. The replication slave crashed on some BEFORE UPDATE triggers if certain updating queries weren't run from the database where the table with the trigger resides in. The master was not affected from this crash. I could reproduce this one in a test environment and filed it as Bug #14614. We stopped development of new triggers on our production server as a safety measure after this.
  2. Master and slave crashed on a query of the structure UPDATE ..., (SELECT...) SET... with a further unnamed view inside the SELECT statement. This same query already crashed replication slaves before 4.1.13 (I filed it some time ago as Bug #10442), but this time masters were also affected (could be a coincidence and something completely different inside the same query caused the crash).
    I tried for hours to reproduce this in a test environment with exactly the same data set but didn't succeed. Table checks showed that everything on the production server was okay. Then it came to my mind that this could possibly be an optimizer bug and the reason I couldn't reproduce this on a test server was that all the data was read in from a dump and thus the table was in an optimized state. I ran an OPTIMIZE TABLE on the production server as well to make the optimizer choose a different approach to solve the query - and really, the problem disappeared. With it unfortunately the possibility to finally reproduce this.
  3. We had two crashes on this really simple query from our Wiki: /* User::idFromName */ SELECT user_id FROM `user` WHERE user_name='Bvontob' LIMIT 1
    As this happened twice (with two days in between) this is probably no coincidence. However the crash doesn't happen every time the query is run. So no results here either.
  4. One corrupted table was the result after I accidentally crashed the production server doing a CREATE PROCEDURE with no database selected. I actually knew about this Bug #13514 (now fixed in 5.0.16) and tried to avoid it. But if you really work on a production server in the command line interface it's only a matter of time until you hit this once. (I have no default databases selected on my command line clients to make sure I don't accidentally do something stupid in a wrong database. This turned out to be a bad move with this bug.)
  5. One crash on the slave happened on a simple DROP FUNCTION, the same query executed fine on the master and it never happened before and after again on the slave.

The bad thing was that after we implemented the first VIEW we couldn't take a consistent backup of our entire database anymore due to a bug in mysqldump (Bug #12838). This was a real problem as after all the crashes the replication slave got out of sync and we needed to initialize it again. There was no other practicable solution than to drop our VIEWs again for this.

Stabilized situation - but was this really the production release?

All the crashes (with the exception of the slave crash on the DROP FUNCTION) happened during the first week. After isolating and deactivating the suspect queries and stopping development of queries with new features things calmed down. When I did the update to 5.0.16 yesterday the master already had an uptime of 18 days and the slave of 8 days.

Of course we still hit some other (non-crashing) bugs and restrictions while developing new stuff (most of them already in the bugs database), but I think that's only normal if you want to be an early adopter. I know MySQL and what great job they do in fixing bugs (just look at the median time on the bugs statistics page which tells us that actually half of the reports are closed within 8 days) so I was never really concerned.

But do those people know it too who just now for the first time really check out MySQL as it starts to provide those enterprise level features needed by them? I think you could turn them away easily if they find out that a production quality release doesn't provide a clean way for a consistent backup and can be crashed by an everyday administrative command such as a CREATE PROCEDURE (both bugs actually known well before the release date), just to give you two examples I mentioned above.

So I think it was too daring to declare 5.0.15 production quality after only one release candidate and after skipping 5.0.14 as a public release. That has nothing to do with the product quality in general (which I think is great - and I know for sure that before christmas we'll have a really solid build on our servers) but only with the time the production label was applied.

That said I have to admit that I understand the marketing pressure and that I even personally was not unhappy with the production label, as I really wanted to finally start using MySQL 5.0 and I couldn't have sold this to my boss without that.

The reasons to choose MySQL 5.0

When we talk about selling a database product I have to give you the reasons why we chose MySQL 5.0. I don't want to talk about the technical features here too much as I believe that with 5.0 we're nearing the point (which we'll reach with partitions and online backups in 5.1) where you can really achieve any task with MySQL that probably made you choose products from the "big players" in the past. The points that really make MySQL stand out for me are the following:

Don't underestimate the last point when choosing a product!