Problem Requests to a LAMP-based Facebook application are load balanced between server nodes by an F5 BIG-IP Local Traffic Manager (LTM). The F5 BIG-IP LTM has two components: an ASIC-based fast-switching component and an AMD Opteron-based software-switching component. Layer 4 load balancing between the application server nodes can be handled either exclusively by the F5 ASIC (Performance L4 mode), or exclusively by F5 software components (Standard mode), or by a combination of F5 ASIC and F5 software components. With the Standard mode turned on, F5 BIG-IP LTM capacity utilization is at 60%. With the Performance L4 mode turned on, F5 BIG-IP LTM capacity utilization is at 20%. However, with the Performance L4 mode turned on, developers report seeing an occasional blank Facebook canvas page returned in response to legitimate application requests. Serving blank Facebook canvas responses at any significant rate is unacceptable. The issue with the blank Facebook canvas responses in Performance L4 mode has to be addressed.
Analysis
The problem is operational because the issue directly impacts operational capacity. The operational solution is to increase the available capacity by installing more F5 BIG-IP LTMs. However, addressing the problem on the operational level will increase the setup complexity and lead to significant expenditures of money and time. A more efficient solution is to troubleshoot the issue, and make the application run correctly under the Performance L4 mode.
Operations department will have difficulties troubleshooting the issue because the problem is cased by interaction between a network appliance and application servers. The fact that the problem manifests itself as an infrequent Facebook application failure complicates troubleshooting even further. In a traditional operations environment, troubleshooting of such issues oftentimes turn into a finger-pointing match. Network engineers will claim that the problem is with systems, systems engineers will claim that the problem is with the network, quality assurance staff will not be able to reproduce the issue, and developers will watch from the sidelines not eager to get involved. A DevOps team drawing on collective network, systems, QA, and development knowledge is more likely to succeed troubleshooting the issue.
Approach
Due to irregular rate of occurrence, trying to make sense out of this issue in the production environment will prove to be difficult if at all possible. Thus, before attempting to establish the root cause of the issue, the problem has to be isolated and reproduced on a controlled subset of the production environment.
In the actual case of troubleshooting, a simplified Facebook application was created (a Development task), and a dedicated profile was setup on F5 BIG-IP LTM (a Networking and Systems task). A script was written to make 10,000 requests to the simplified Facebook application and look for responses with a blank Facebook canvas (a QA task). Luckily, responses with a blank Facebook canvas were observed on the simplified setup. The failure rate was estimated to be 15%. The problem was successfully isolated to the controlled subset of the production environment.
# An example shell script to make a given number of HTTP requests for i in $(seq 1 ${ITERATIONS}) do echo "Iteration: ${i}" ( curl -s -A '${USERAGENT}' ${APPURL} ) 2>>&1 \ | grep ${KEYWORD} | tee -a ${RUN} done
The next step in the investigation was to run a tcpdump on the application server. Tcpdump records were filtered to show only packets with TCP SYN flag set. For each blank Facebook canvas response, filtered tcpdump records showed 3 TCP SYN packets arriving in a row without any TCP SYN,ACK replies sent back by the application server. The system kernel running on the application server was not recognizing load balancer's attempts to establish a connection. The root cause of the problem was identified. A simple solution that addressed the issue was disabling TCP timestamps on the application server (setting net.ipv4.tcp_timestamps kernel parameter to 0).
# Tcpdump analysis example$ tshark -r tcpdump.output tcp.flags.syn == 1 # Good request 46 7.958271 66.220.153.246 -> 1.1.1.1 TCP 62142 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=3790741991 TSER=0 WS=9 47 7.958300 1.1.1.1 -> 66.220.153.246 TCP http > 62142 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 SACK_PERM=1 TSV=1178123727 TSER=3790741991 WS=8 # Unanswered TCP SYN packets 61 8.964372 66.220.153.250 -> 1.1.1.1 TCP 60278 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=581641555 TSER=0 WS=9 62 11.963010 66.220.153.250 -> 1.1.1.1 TCP 60278 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=581644555 TSER=0 WS=9 63 13.958766 66.220.153.245 -> 1.1.1.1 TCP 60775 > http [SYN] Seq=0 Win=5840 Len=0 MSS=1460 SACK_PERM=1 TSV=581646550 TSER=0 WS=9