This story is from a year ago. I kept telling myself that I got to write a blog post about it and here it is: a continuous deployment troubleshooting case.

One of the application server pools running a LAMP application (Linux / Apache/ MySQL / Memcached / PHP) was showing the following issue. Every time new code was deployed to the pool, Apache processes on 10-15 random servers out of 300+ would deadlock and stop serving requests. The problem was happening for some time but was not perceived as a big issue. Random Apache deadlocks did not impact the end users, and restarts of deadlocked Apache instances returned services to the operational state. I was not very comfortable with this behavior, so I decided to dig in and see what was going on.

Troubleshooting this case was not very simple because

  • The problem was occurring in production,
  • The number of servers in that production environment was 300+,
  • The problem was occurring sporadically on a small subset of servers,
  • No discernable pattern.

The first step was to gather more information about the deadlocks. The Apache process was locking up, so getting information about the deadlock required installing a gdb. In order to run a non-interactive session of gdb, I created the following gdb script:

set pagination 0
During the next code deployment, when the deadlocked instances showed up again, I run the following remote command on the deadlocked servers:
for i in $(ps -ef | grep http | awk '{ print $2 }' | tr '\n' ' '); do echo
"=== pid ${i} ===" | tee -a ; gdb -x -p $i | grep -v
"Loaded symbols" | grep -v "Reading symbols" | grep -v "no debugging symbols
found" >>; done
The command gets the process IDs of Apache processes (httpd), runs the script created above, filters out gdb nagging and appends the output to file.

After running this remote command across the deadlocked servers, I fetched the backtrace information, and here is what I saw:

=== pid 311 ===
GNU gdb Fedora (6.8-27.el5)
Copyright (C) 2008 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later 
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Attaching to process 311
[Thread debugging using libthread_db enabled]
[New Thread 0x7f29a49ba760 (LWP 311)]

0x00007f29a2e54e74 in __lll_lock_wait () from /lib64/
#0  0x00007f29a2e54e74 in __lll_lock_wait () from /lib64/
#1  0x00007f29a2e50874 in _L_lock_106 () from /lib64/
#2  0x00007f29a2e502e0 in pthread_mutex_lock () from /lib64/
#3  0x00007f299a342969 in apc_pthreadmutex_lock (lock=0x7f2981ef6088) at
#4  0x00007f299a33d0ef in apc_cache_find_slot (cache=0x7f29a4f3ce30,
key={data = {file = {device = 139816836143312, inode = 25}, u
ser = {identifier = 0x7f29a4da18d0 "/etc/php.auto_prepend.php",
identifier_len = 25}, fpfile = {fullpath = 0x7f29a4da18d0
"/etc/php.auto_prepend.php", fullpath_len = 25}}, mtime = 1290456278, type =
3 '\003', md5 = "\r\000\000\000\000\000\000>\r\000\000\000\000\000\000c"},
t=1290456278) at /usr/local/src/php-5.3.3/APC-3.1.4/apc_cache.c:578
#5  0x00007f299a33d444 in apc_cache_find (cache=0x7f2981ef6088, key={data =
{file = {device = 139816836143312, inode = 25}, user = {identifier =
0x7f29a4da18d0 "/etc/php.auto_prepend.php", identifier_len = 25}, fpfile =
{fullpath = 0x7f29a4da18d0 "/etc/php.auto_prepend.php", fullpath_len = 25}},
mtime = 1290456278, type = 3 '\003', md5 =
"\r\000\000\000\000\000\000>\r\000\000\000\000\000\000c"}, t=0) at
#6  0x00007f299a342005 in my_compile_file (h=0x7fff14967f40, type=8) at
#7  0x00007f299d579921 in phar_compile_file () from
#8  0x00007f299d6f98aa in zend_execute_scripts () from
#9  0x00007f299d6a59c8 in php_execute_script () from
#10 0x00007f299d792613 in php_handler () from /etc/httpd/modules/
#11 0x00007f29a4a1b7c1 in ap_run_handler () from /usr/sbin/httpd
#12 0x00007f29a4a1c084 in ap_invoke_handler () from /usr/sbin/httpd
#13 0x00007f29a4a2db7a in ap_process_request () from /usr/sbin/httpd
#14 0x00007f29a4a2a431 in ap_process_http_connection () from /usr/sbin/httpd
#15 0x00007f29a4a2522c in ap_run_process_connection () from /usr/sbin/httpd
#16 0x00007f29a4a25679 in ap_process_connection () from /usr/sbin/httpd
#17 0x00007f29a4a3497f in child_main () from /usr/sbin/httpd
#18 0x00007f29a4a34b37 in make_child () from /usr/sbin/httpd
#19 0x00007f29a4a34dc4 in perform_idle_server_maintenance () from
#20 0x00007f29a4a35300 in ap_mpm_run () from /usr/sbin/httpd
#21 0x00007f29a4a017ef in main () from /usr/sbin/httpd

This backtrace shows that the Apache process is trying to acquire apc_pthreadmutex_lock, and the lock is not getting released. apc_pthreadmutex_lock is a part of the Alternative PHP Cache (APC) module.

Given that the problem is happening only after code deployments, the deadlock in APC makes sense. New deployments flushes APC cache, all PHP code on the server is getting recompiled and written into APC cache, cache write operations need to be serialized, and the lock responsible for cache write serialization is not getting released.

My next step was to dig into Alternative PHP Cache to better understand troubleshooting options available to me. I found that APC has several locking mechanism implementations: pthread_mutex-based, IPC-semaphores-based, and file-based. The deployed version of APC was compiled with pthreadmutex_lock. I decided to rebuild APC with IPC semaphores locks because semaphores are easier to track and Apache itself uses the semaphores mechanism for locking.

After deploying an updated APC module with semaphores locking, I watched the next code deployment looking for deadlocked servers. The deadlocked servers showed up again and again in APC, but now the backtrace looked a little different because of the IPC semaphores changes.

Next, I went on to look at how servers are behaving. Listing all active IPC semaphores on a system is very simple:

# ipcs –s

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x00000000 0          root       600        1         
0x00000000 65537      root       600        1         
0x00000000 262146     apache     600        1         
0x00000000 294915     apache     600        1         
0x00000000 327682     apache     600        1         
0x00000000 360451     apache     600        1         

------ Message Queues --------
key        msqid      owner      perms      used-bytes   messages

The view of the active semaphores on the system showed something interesting. Apache had 4 active semaphores. In the stock configuration, Apache creates 2 semaphores. APC adds another semaphore. Thus, the fourth semaphore was unexpected. Reviewing Apache configuration, I found out that mod_ssl was enabled on the systems. The fourth semaphore was created by mod_ssl. Secure HTTP connections in the environment were terminated at the load balancer. Mod_ssl on Apache application servers had no use.

I updated the server configuration to remove mod_ssl and watch the next code deployment. One code deployment - no deadlocked servers. Another code deployment - no deadlocked servers. Ten more code deployment servers - no deadlocked servers.

The source of the problem was apparently a condition where one apache child holding a lock on mod_ssl and another apache child holding a lock on APC. Because mod_ssl had no practical use in the installation, the solution was a simple removal of mod_ssl configuration.

Leave a Reply

Blogger Templates for WP 2 Blogger sponsored by Cinta.
Content Copyright © 2010 - 2011 Artem Veremey, All Rights Reserved
preload preload preload