Got a few spare Gigabytes of RAM? Fulfilling the memory needs of a leaking MySQL…
I knew right from the start I was in real trouble once again when I saw more and more error messages coming in from our applications. They kept telling me that our replication slave was not properly responding. Did I just encounter another bug I never met before? Anyway, I just cancelled my private plans for tonight as this looked like it could take some time to solve. How right I was.
The first shock: I couldn't even log in to the server using ssh. The connection was established, but the shell just wouldn't present its friendly prompt. Luckily I had an open connection from another machine I could use. It took almost a minute until the three letters of my first command showed up on that connection: t... o... p... and finally Return.
Another minute later the top screen suddenly appeared. A quick glance at the process list didn't show anything special, but the one minute load average was well above 20, though rapidly decreasing. The memory usage showed that the machine must have been heavily swapping just seconds ago, something it's actually never supposed to do. There sure must have been a serious problem, even if it just disappeared that instant.
Hooking up a mysql command line client on the host directly and checking the status seemed like the best idea at that moment (the server is dedicated to MySQL only, so this was the most probable cause of problems).
mysql3:localhost-(none) [root]> status -------------- mysql Ver 14.12 Distrib 5.0.18, for pc-linux-gnu (i686) using readline 5.0 Connection id: 30 Current database: Current user: root@localhost ... Uptime: 1 min 34 sec
Oops, the uptime should actually be something like a week. Seems like mysqld crashed. Or maybe not? I was puzzled for a second when I quickly checked the error log and didn't see the familiar mysqld got signal 11 followed by the stack dump and other information usually produced on every crash. Instead there were just some strange error messages telling me that the slave couldn't connect to the master, something I found to be pretty implausible as the master was running perfectly and there were no network problems.
060228 15:44:25 [ERROR] Error reading packet from server:
Lost connection to MySQL server during query
( server_errno=2013)
060228 15:44:35 [Note] Slave I/O thread: Failed reading log event,
reconnecting to retry, log 'mysql1-bin.000346'
position 1005255547
060228 15:51:54 [ERROR] Slave I/O thread: error reconnecting to master
'replication@10.1.1.101:3306': Error: 'Lost
connection to MySQL server during query'
errno: 2013 retry-time: 60 retries: 86400
Number of processes running now: 0
060228 16:09:32 mysqld restarted
060228 16:09:35 InnoDB: Database was not shut down normally!
...
However InnoDB complained about an abnormal shutdown. So, no crash information, but still not shut down normally? So something must have killed mysqld actively? Something? Let's check /var/log/messages.
Feb 28 16:09:24 mysql3 kernel: __alloc_pages: 0-order allocation failed
(gfp=0x1d2/0)
Feb 28 16:09:31 mysql3 kernel: VM: killing process mysqld
Indeed, the Linux kernel (running on a VMWare ESX virtual machine) must have thought that mysqld was a little bit too greedy. An allocation of another page of memory failed. Death sentence for MySQL.
While I was contemplating this fact (I've seen a lot of different crashes since the upgrade to 5.0, but this was the first time that my mysqld was executed by the kernel) I just saw MySQL's hunger for memory raise again. It would be only a question of time until it would pass the physical limits of memory.
It dawned on me that a recent query on the master could have been the cause: A data set of merely 22 million rows should have been moved to another table that just got a brand new trigger making use of some fancy stored procedures to check inserted data. The INSERT INTO … SELECT statement ran fine on the master (although it took over 40 minutes to execute) without any strange craving for memory.
At that time I went to search the bug database for anything on memory leak and replication, found bug #16175: memory leak in rpl_trigger.test, added a quick note and went back to solve my immediate problem, as the server went out of physical memory again and our staff started to complain.
Should I just skip the query? This meant that master and slave would go out of sync, a bad situation because I knew there were already other queries in the pipeline of the binary log that would make use of the freshly inserted data. So I needed to solve this in place. I redirected some applications to the master server (just to calm down the staff) and developed a plan.
If 2 GB were not enough, why not just throw more memory at it? From the estimated time I thought the query would need to execute and the time it actually took until the server ran out of physical memory (it was now in its third round, the slave picked up the same query again after every crash) I deduced that about 4 GB should be enough.
I was really lucky to have this instance of MySQL on a virtual machine (the master runs on a dedicated server). I started to shut down about every other virtual machine that was not absolutely needed and cut the memory assignments for the really important ones down to a minimum. Finally I got an additional 1.6 GB.
Would 3.6 GB in total be enough for the query to finish? I was prepared to restart mysqld after a success, as the Linux kernel couldn't reclaim the leaked memory, and reran the query. I was glued to the top screen for the next half hour, seeing the memory usage steadily increase.
The query finished at 86% of physical memory.