DevOps Notes

Living the Culture

Troubleshooting DHCP in AWS

ami, aws, cloud, linux, troubleshooting 1 Comment »

Building AWS AMIs from scratch is the practice many security-focused organizations strictly follow.

I have been building custom AWS AMIs from scratch for my personal AWS environment for many years. Looking at my personal AMI collection, I have custom AMIs dating back to CentOS 5.

A few days ago I decided to do another refresh of my AWS environment and started with a new and improved build of CentOS 7.9. I rehashed my packer build merging in several improvements. OVA creation went flawless and AWS vmimport successfully finished. When I booted my new AMI to upgrade CentOS stock kernel to the latest kernel-lt 5.4, I noticed that DNS resolution is failing and that /etc/resolv.conf did not get updated correctly:

$ cat /etc/resolv.conf
# Generated by NetworkManager

I had to troubleshooting and fix DHCP on my new image.

Work Story: Fixing Nagios/Icinga NSCA

chef, linux, troubleshooting, work story To Comments »

I have a strange attachment to Nagios web interface. Even though we kicked Nagios itself out of our infrastructure some time ago, we are still keeping a more moden fork of Nagios, Icinga, around as a cornerstone of our event-monitoring infrastructure. With quite a bit of effort spent writing cookbooks and monitors, Icinga is fully automated, and I can enjoy watching and showing my Nagios Web.

Nagios is loved and hated in the industry. Nagios has a great brand recognition, but suffers from less than intuitive setup and difficult to automate configuration. With passive checks and right Chef cookbooks, Nagios/Icinga setup becomes very effective.

Another Nagios weakness is poor implementation. Fortunately, Nagios poorly implements right ideas, so not everything is lost. This war story is about fixing Nagios NSCA implementation. Nagios NSCA is a Nagios passive-check daemon written in C that runs on a Nagios server and processes all passive check sent by Nagios clients.

How to write a good Chef cookbook

chef, devops, howto To Comments »

One of my favorite DevOps tools, Chef, is gaining popularity by the day. Github alone has over a thousand Chef cookbooks publicly accessible.

Starting to use Chef is easy. Realizing the full power of Chef is not quite as easy. Many Chef cookbooks that I looked at can be significantly improved. Cookbooks created by developers can be improved by using systems engineering patterns. Cookbooks created by system administrators can be improved by using software engineering patterns.

Opscode does a very good job describing building blocks of Chef cookbooks. Explaining how to write good Chef cooks is something I have not seen done yet. In this post I will go over my cookbook writing process using a syslog-ng cookbook as an example.

Work Story: Troubleshooting LAMP stack

continuous deployment, devops, lamp, troubleshooting, work story To Comments »

This story is from a year ago. I kept telling myself that I got to write a blog post about it and here it is: a continuous deployment troubleshooting case.

One of the application server pools running a LAMP application (Linux / Apache/ MySQL / Memcached / PHP) was showing the following issue. Every time new code was deployed to the pool, Apache processes on 10-15 random servers out of 300+ would deadlock and stop serving requests. The problem was happening for some time but was not perceived as a big issue. Random Apache deadlocks did not impact the end users, and restarts of deadlocked Apache instances returned services to the operational state. I was not very comfortable with this behavior, so I decided to dig in and see what was going on.

Agile in Operations

agile, devops, pm To Comments »

With software development teams universally embracing Agile project management methodology, the question about Agile in Operations comes - up - frequently. Many view "agile sysadmin" as a key component of the DevOps movement. Recently, I had to give an answer to the "agile in operations" questions to my coworkers. Here's my take on it.

Managing a Linux kernel build

howto, kernel, linux, rpm To Comments »

My Linux-focused list of kernel management tasks includes:

Following LKML (Linux Kernel Mailing List) for at least the kernel branch of my choice,
Following NETDEV list because significant portion of my tasks revolves around network,
Using GitWeb to port drivers and updates to the custom tree,
Maintaining a custom kernel source tree,
Building and packaging custom kernels to run in production.

Running a Custom Kernel

devops, kernel To Comments »

DevOps Days 2010 had a great pannel discussion on Infrastructure as Code. The responses to one question (asked at 27:30) did surprise me. The question was "Who here goes to kernel.org grabs the sources and compiles them?" No a single panelist raised a hand. I find it shocking that all distinguished DevOps evangelists on the panel are running stock distribution kernels.

Case Study: Troubleshooting

devops, troubleshooting, work story To Comments »

Problem

Requests to a LAMP-based Facebook application are load balanced between server nodes by an F5 BIG-IP Local Traffic Manager (LTM). The F5 BIG-IP LTM has two components: an ASIC-based fast-switching component and an AMD Opteron-based software-switching component. Layer 4 load balancing between the application server nodes can be handled either exclusively by the F5 ASIC (Performance L4 mode), or exclusively by F5 software components (Standard mode), or by a combination of F5 ASIC and F5 software components. With the Standard mode turned on, F5 BIG-IP LTM capacity utilization is at 60%. With the Performance L4 mode turned on, F5 BIG-IP LTM capacity utilization is at 20%. However, with the Performance L4 mode turned on, developers report seeing an occasional blank Facebook canvas page returned in response to legitimate application requests. Serving blank Facebook canvas responses at any significant rate is unacceptable. The issue with the blank Facebook canvas responses in Performance L4 mode has to be addressed.

DevOps? Obligatory Post

devops, metaphysics To Comments »

What's a DevOps blog without the discussion about what DevOps is?

If nothing else, DevOps is a buzzword that means different things to different people.

Ted Dziuba thinks that DevOps is a trend of nonsense "where system administrators start writing unit tests and other things to help the developers warm up to them." DevOps is certainly not about system administrators writing unit tests. DevOps is not about Ted Dziuba learning "Operations" (aka UNIX) either. DevOps is about system administrators, Ted Dziuba, and developers working closely together sharing knowledge about unit tests, UNIX, etc. to produce and deliver products.