Mike "Bear" Taylor

You’re looking at your todo list and pondering what code to write during one of the brief moments of free time that appear on your daily schedule, when all of the sudden you get a message in team chat: Is the site down for anyone else?

It can be a frustrating experience, but never fear; you’re not alone. We here at &yet experienced this type of outage once before, and then again this week. In fact, nearly every operations team has experienced at least a variation on the above nightmare. It is just a matter of time before you have to deal with people thinking your site or service is down when the problem is really with the Domain Name Service (DNS). Even shops that spend a lot of money to work with DNS vendors who themselves have some serious redundancy and scale will eventually fall prey to an orchestrated Distributed Denial of Service Attack.

Continue reading »

Mike "Bear" Taylor

“If you don’t monitor it, you can’t manage it.”

In the last installment of our Tao of Ops series I pointed out the above maxim as being a variation on the business management saying, “You can’t manage what you can’t measure” (often attributed to Peter Drucker). This has become one of the core principles I try to keep in mind while walking the Operations path.

Keeping this in mind, today I want to tackle testing the TLS Certificates that can be found everywhere in any shop doing web related production - something that needs to be done and can be rather involved in order to do properly.

Continue reading »

Mike "Bear" Taylor

“If you don’t monitor it, you can’t manage it.”

That’s a variation on the business management saying “you can’t manage what you can’t measure” (often attributed to Peter Drucker). The saying might not always apply to business, but it definitely applies to Operations.

There are a lot of tools you can bring into your organization to help with monitoring your infrastructure, but they usually look at things only from the “inside perspective” of your own systems. To truly know if the path your Operations team is walking is sane, you need to also check on things from the user’s point of view. Otherwise you are missing the best chance to fix something before it becomes a problem that leads your customers to take their business elsewhere.

Continue reading »

Bear

Every Operations Team needs to maintain the system packages installed on their servers. There are various paths toward that goal, with one extreme being to track the packages manually - a tedious, soul-crushing endeavor even if you automate it using Puppet, Fabric, Chef, or (our favorite at &yet) Ansible.

Why? Because even when you automate, you have to be aware of what packages need to be updated. Automating “apt-get upgrade” will work, yes - but you won’t discover any regression issues (and related surprises) until the next time you cycle an app or service.

A more balanced approach is to automate the tedious aspects and let the Operations Team handle the parts that require a purposeful decision. How the upgrade step is performed, via automation or manually, is beyond the scope of this brief post. Instead, I’ll focus on the first step: how to gather data that can be used to make the required decisions.

Continue reading »

Marcus Stong

On the &yet Ops Team, we use Docker for various purposes on some of the servers we run. We also make extensive use of iptables so that we have consistent firewall rules to protect those servers against attacks.

Unfortunately, we recently ran into an issue that prevented us from building Dockerfiles from behind an iptables firewall.

Here’s a bit more information about the problem, and how we solved it.

The Problem

When trying to run docker build on a host that provides our default DROP policy-based iptables set, apt-get was unable to resolve repository hosts on Dockerfiles that were FROM Ubuntu or debian.

Continue reading »

Bear

One of the best tools to use every day for locking down your servers is iptables. (You do lock down your servers, right? ;-)

Not using iptables is akin to having fancy locks with a plywood door - sure it is secure but you just cannot know that someone won’t be able to break through.

To this end I use a small set of bash scripts that ensure I always have a baseline iptables configuration and items can be added or removed quickly.

Let me outline what they are before we get to the fiddly bits…

Continue reading »

Bear

So Heartbleed happened, and if you’re a company or individual who has public facing assets that are behind anything using OpenSSL, you need to respond to this now.

The first thing we had to do at &yet was determine what was actually impacted by this disclosure. We had to make list of what services are public facing, which services use OpenSSL directly or indirectly, and which services use keys/tokens that are cryptographically generated. It’s easy to only update your web servers, but really that is just one of many steps.

Here is a list of what you can do to respond to this event.

Continue reading »

Bear

As more and more people are enjoying the Internet as part of their every day lives, so too are they experiencing its negative aspects. One such aspect is that sometimes the web site you are trying to reach is not accessible. While sites can be out of reach for many reasons, recently one of the more obscure causes has moved out of the shadows: The Denial of Service attack. This type of attack is also known as a DoS attack. It also has a bigger sibling, the Distributed Denial of Service attack.

Why these attacks are able to take web sites offline is right there in their name, since they deny you access to a web site. But how they cause web sites to become unavailable varies and quickly gets into more technical aspects of how the Internet works. My goal is to help describe what happens during these attacks and to identify and clarify key aspects of the problem.

Continue reading »

Nathan LaFreniere

Deploying a production application can be quite the chore. On the road to &! 2.0, our processes have changed significantly. In the beginning stages of &! 1.0, I hate to say it, but deploys were a completely manual process. We logged in to the server over SSH, pulled from Git, and restarted processes all by hand. Less than ideal, to say the least.

Managing those processes was just as bad; we were using forever and a simple SysVInit script (those things in /etc/init.d for you non-ops types) to run it. When the process would crash, forever would restart it and we’d be happy. Everything seemed great, but then one day we accidentally pushed broken code live. What did forever do? Kept trying to help us, by restarting the process. The process that crashes instantly. Several CPU usage warning emails from our hosting provider later, we realized what had happened and fixed the broken code. That’s when we realized that blindly restarting the app when it crashes wasn’t a great idea.

Continue reading »

Nathan Fritz

The Problem

When I was at FOSDEM last weekend, I talked to several people who couldn’t believe that I would use Redis as a primary database in single page webapps. When mentioning that on Twitter, someone said, “Redis really only works if it’s acceptable to lose data after a crash.”

For starters, read http://redis.io/topics/persistence. What makes Redis different from other databases in terms of reliability is that a command can return “OK” before the data is written to disk (I’ll get to this). Beyond that, it is easy to take snapshots, compress append-only log files, configure fsync behavior in Redis. There are tests for dealing with disk access suddenly cut off while writing, and steps are taken to prevent this from causing corruption. In addition, you have redis-check-aof for dealing with log file corruption.

Continue reading »