I have a strange attachment to Nagios web interface. Even though we kicked Nagios itself out of our infrastructure some time ago, we are still keeping a more moden fork of Nagios, Icinga, around as a cornerstone of our event-monitoring infrastructure. With quite a bit of effort spent writing cookbooks and monitors, Icinga is fully automated, and I can enjoy watching and showing my Nagios Web.

Nagios is loved and hated in the industry. Nagios has a great brand recognition, but suffers from less than intuitive setup and difficult to automate configuration. With passive checks and right Chef cookbooks, Nagios/Icinga setup becomes very effective.

Another Nagios weakness is poor implementation. Fortunately, Nagios poorly implements right ideas, so not everything is lost. This war story is about fixing Nagios NSCA implementation. Nagios NSCA is a Nagios passive-check daemon written in C that runs on a Nagios server and processes all passive check sent by Nagios clients.


One of my favorite DevOps tools, Chef, is gaining popularity by the day. Github alone has over a thousand Chef cookbooks publicly accessible.

Starting to use Chef is easy. Realizing the full power of Chef is not quite as easy. Many Chef cookbooks that I looked at can be significantly improved. Cookbooks created by developers can be improved by using systems engineering patterns. Cookbooks created by system administrators can be improved by using software engineering patterns.

Opscode does a very good job describing building blocks of Chef cookbooks. Explaining how to write good Chef cooks is something I have not seen done yet. In this post I will go over my cookbook writing process using a syslog-ng cookbook as an example.


This story is from a year ago. I kept telling myself that I got to write a blog post about it and here it is: a continuous deployment troubleshooting case.

One of the application server pools running a LAMP application (Linux / Apache/ MySQL / Memcached / PHP) was showing the following issue. Every time new code was deployed to the pool, Apache processes on 10-15 random servers out of 300+ would deadlock and stop serving requests. The problem was happening for some time but was not perceived as a big issue. Random Apache deadlocks did not impact the end users, and restarts of deadlocked Apache instances returned services to the operational state. I was not very comfortable with this behavior, so I decided to dig in and see what was going on.



With software development teams universally embracing Agile project management methodology, the question about Agile in Operations comes - up - frequently. Many view "agile sysadmin" as a key component of the DevOps movement. Recently, I had to give an answer to the "agile in operations" questions to my coworkers. Here's my take on it.


My Linux-focused list of kernel management tasks includes:

  • Following LKML (Linux Kernel Mailing List) for at least the kernel branch of my choice,
  • Following NETDEV list because significant portion of my tasks revolves around network,
  • Using GitWeb to port drivers and updates to the custom tree,
  • Maintaining a custom kernel source tree,
  • Building and packaging custom kernels to run in production.


DevOps Days 2010 had a great pannel discussion on Infrastructure as Code. The responses to one question (asked at 27:30) did surprise me. The question was "Who here goes to kernel.org grabs the sources and compiles them?" No a single panelist raised a hand. I find it shocking that all distinguished DevOps evangelists on the panel are running stock distribution kernels.


Problem

Requests to a LAMP-based Facebook application are load balanced between server nodes by an F5 BIG-IP Local Traffic Manager (LTM). The F5 BIG-IP LTM has two components: an ASIC-based fast-switching component and an AMD Opteron-based software-switching component. Layer 4 load balancing between the application server nodes can be handled either exclusively by the F5 ASIC (Performance L4 mode), or exclusively by F5 software components (Standard mode), or by a combination of F5 ASIC and F5 software components. With the Standard mode turned on, F5 BIG-IP LTM capacity utilization is at 60%. With the Performance L4 mode turned on, F5 BIG-IP LTM capacity utilization is at 20%. However, with the Performance L4 mode turned on, developers report seeing an occasional blank Facebook canvas page returned in response to legitimate application requests. Serving blank Facebook canvas responses at any significant rate is unacceptable. The issue with the blank Facebook canvas responses in Performance L4 mode has to be addressed.



Blogger Templates for WP 2 Blogger sponsored by Cinta.
Content Copyright © 2010 - 2011 Artem Veremey, All Rights Reserved
preload preload preload