Too many connections: Handling of an emergency situation in MySQL
What an evening! Swisscom experienced some major internet connectivity problems tonight, heavily affecting our MySQL servers. I was hacking away on my laptop for hours while my date was sitting next to me in front of the fireplace (until she left after some time). But Perl and MySQL once more saved my life (albeit not the romantic evening). I'll give you an overview of the events, maybe there's something to learn from this for others as well.
I got the phone call from the office at 20:00 CET while at the same time my mailbox was filling up with messages like this one:
Subject: jobRunner@linuxautomat2 - Error executing job 'METAR Archiver'
Error: './metarArchiver.pl -c archiver.conf -aq' returned exit code 22
Output from './metarArchiver.pl -c archiver.conf -aq' (STDERR):
Error: Couldn't open mysql db 'meteonews' on 'mysql1.intern'
(Too many connections)
My workmate in the office first blamed the database to be the source of the problems. But seeing the infamous Too many connections from MySQL I quickly realized that there was probably something else going on. This could either be an application going crazy and connecting over and over again (but the applications at our site usually behave well and I never experienced something like that), some accidentally locked tables blocking other queries or primarily a network problem that results in too many lost and rebuilt connections.
When I tried to connect to our servers I quickly realized that the latter really must've been the case: I couldn't reach one single machine. I could perfectly ping them, yes, but all the bigger TCP packets got lost. My workmate in the office insisted he had internet connectivity, so something strange was definitely going on. I realized that I could reach most of the internet from home, but just not some hosts in Switzerland (unfortunately ours amongst them).
It took me 10 minutes to find a route through the net that at least worked for ssh via hops over 3 hosts through the networks of other providers and finally reached our servers through an ADSL line we have for backup.
That's the first thing I did when I saw the friendly prompt of our MySQL server:
mysql> SET GLOBAL max_connections = 1000;
Query OK, 0 rows affected (0.00 sec)
mysql> SET GLOBAL interactive_timeout = 10,
wait_timeout = 10,
net_read_timeout = 10,
net_write_timeout = 10;
Query OK, 0 rows affected (0.00 sec)
I doubled the maximum number of the allowed concurrent connections and drastically lowered all the timeouts for new network connections. MySQL would now drop all connections if they were stuck for 10 seconds and thus free them for other clients that weren't affected by Swisscom's network problems. This would at least give me a few minutes to check more things out and find a better solution.
A quick look at the output from SHOW PROCESSLIST and I knew that there were still too many connections stuck from before, so I desperately ran a few KILL CONNECTION commands on the MySQL prompt. Realizing that I could do this maybe forever I quit the MySQL command line client and hacked a few lines of Perl code into emacs.
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbhost = "localhost";
my $dbuser = "root";
my $dbpass = "no_this_is_not_our_password";
my $db = DBI->connect("DBI:mysql:host=$dbhost", $dbuser, $dbpass);
$db or die(DBI::errstr);
my $list = $db->prepare("SHOW PROCESSLIST")
or die(DBI::errstr);
my $kill = $db->prepare("KILL CONNECTION ?")
or die(DBI::errstr);
$list->execute()
or die(DBI::errstr);
while(my $process = $list->fetchrow_hashref('NAME_lc')) {
if($process->{'user'} eq $ARGV[0] && $process->{'host'} =~ /^$ARGV[1]/) {
$kill->execute($process->{'id'});
}
}
Now I had a little tool I could call from the Linux prompt that would kill all connections with a user name and a host which I could specify as arguments on the command line. (Update: For a better way of providing username and password to your script see Giuseppe Maxia's article Hiding connection parameters with Perl/DBI that he's probably written in reaction to this blog post. The default user I have in my .my.cnf lacks the SUPER privilege that's necessary for KILL, so I just hardcoded the credentials into the code, which is actually bad practice.) I cursed myself for not having something like this prepared already as I should have known I would end one day solving such a problem on the command line with no other administrative tools at my hand.
I still saw a lot of connections coming in all the time that just did nothing but sit around (and eventually getting killed after the above defined 10 seconds) and I realized that we had at least one user account of the form user@% (allowed to connect from any host) that was as well being used for internal connections as for connections from a remote host. That was a stupid thing because my next step would have been to limit the maximum number of connections for all remote hosts in degraded networks, so I could finally lower the global maximum connection threshold again. But before I could do that I had to split that user@% account.
mysql -NBe "SHOW GRANTS FOR user@'%'" > user.sql
This gave me a text file with the grant definition of that user to edit in emacs (the B option strips the ASCII table layout from the output, N removes the column header, so you just need to add the semicolons after every line to get valid SQL).
I replaced all the occurrences of % in the file with the name of the bad remote host and added a WITH MAX_USER_CONNECTIONS to the GRANT USAGE command on the first line:
GRANT USAGE ON *.* TO 'user'@'badhost' IDENTIFIED BY PASSWORD 'xxx' WITH MAX_USER_CONNECTIONS 10;
Now I could feed this freshly created user back to MySQL again.
mysql < user.sql
Done. The bad remote host would now be limited to a maximum of 10 connections. That was all I could do here.
But we still remained with the clients that couldn't get any data due to the degraded network. I found out that the problem affected mainly the Zurich area and that the connections to our branch office in Lausanne were okay. Luckily I just moved our slave to that office a few days ago. So we redirected the degraded external connections from Zurich to Lausanne. This is not automatically done as some of the clients write data as well and this should only be done on the master, of course.
So I had to prepare the slave to accept writes as well in this extraordinary situation.
mysql> SET GLOBAL read_only = OFF; Query OK, 0 rows affected (0.00 sec)
I know that this will desynchronize the slave. But only two tables are affected and I will definitely deal with that tomorrow after getting some sleep first...
(By the way: Of course we had to handle a lot of other problems arising from the network degradation as well. I just reduced it to the MySQL part in this blog post.)
Read more on used commands
- Server System Variables (sections on
max_connections, timeout settings,read_only) - SHOW PROCESSLIST Syntax
- KILL Syntax
- SHOW GRANTS Syntax
- GRANT and REVOKE Syntax
- Options of the MySQL Command-Line Tool