It was an exceptionally long week, and you managed to get to bed around midnight. You’re a system admin, and at the core of your job is to keep the systems running. Tonight you are on call, if something goes wrong you’re the first to know and are responsible for responding.
You are specifically responsible for ensuring the availability of your companies website. Unlike other sites, your company is running an online commerce store, and service a global audience. The company is yielding $10k in new sales an hour.
It’s imperative that customers can access your site.
It’s 2 am. Your phone starts to light up like a Christmas tree. PagerDuty is having a meltdown and you’re on the receiving end. Your slack notifications are hitting the notification thresholds with Slack. Text messages are pouring in.
Little does the chaos know you forgot to turn on your notifications. There you lay, peacefully, thinking the world is anything but what it is.
The flickering lights, and vibration, finally get the attention of your dog that starts to growl at the inconvenience. The break in the evening noise catches your attention. You open your weary eyes and see your phone dancing in the mist of the evening grog.
It hits you. You grab your phone, it takes a fraction of a second to realize what has happened – you’re down.
For the better part of 10 years, that’s the world that Daniel and I lived, and continue to live with our projects. We serviced 100’s of thousands of businesses, of all sizes, around the world with incident detection, compromise mitigation services and availability assurances through our CDN / WAF. But through that entire experience, outages happened.. it’s the harsh reality of working on networks.
What we realized is that we needed a better solution for detecting, mitigating and recovering from these availability incidents. That’s why we are introducing NOC.org.
With NOC.org, the scenario above would have been identified and mitigated seamlessly for the user via some of the platforms Smart routing features.
Automating Detection of Incidents, Mitigating Issues, and seamless Recovery
One of the biggest weakest aspects of monitoring availability incidents is that is that it almost always requires manual intervention. Not because technologies don’t exist, but users often lack the knowledge, expertise, to implement the appropriate mitigating controls. In many more instances it’s because the platforms themselves make it too complicated.
NOC.org works to modernize the approach by integrating technologies together. Similar tools, but integrated to help make better decisions for users. If there is one thing we have learned over the years is that the world isn’t lacking in tools, they’re lacking in their ability to parse through the noise and make decisions.
Using Authoritative DNS and the NOC.org smarting routing features, a user is able to create enhanced records. These records allow you to create a fail-over and recovery construct between two nodes that work for you in any incident.
How NOC.org Would Respond to an Availability Incident
In the following illustrations I’ll show you what would have happened in the scenario above:
1 – Normal traffic flow to your web server….
2 – NOC.org detects issue with Primary, redirects traffic to Failover within minutes:
2 – NOC.org detects recovery, and recovers:
To do this NOC.org merges different technologies to a) detect issues, and b) automatically respond and recover on behalf of the organization. All through the use of Authoritative DNS and smart routing features.
Binding Monitors with Authoritative DNS Services
One way to tackle availability incidents is to leverage the Domain Name System (DNS), specifically Authoritative DNS (quick primer on DNS).
Authoritative DNS’ are a critical part of how the web works. They contain all the information associated with a domain known as records. These records are stored in a container known as a zone.
Every domain (e.g., perezbox.com) has a set of records. These records tell the web where to find information for a domain.
For example, I leverage email@example.com as my email. I use what is known as an MX record in my domains zone file to tell the web how to route email to my inbox. Additionally, I have a website that leverages an A record which tells the internet where to find the content of my site. That’s about as deep as I’ll go into zones here, but understand that every domain has one and the piece of the DNS ecosystem that controls these zones is known as the Authoritative DNS.
These zones are typically a feature embedded within a platform like a Registrar or a CDN provider.
Registrars are those that sell you the domain, think of a NameCheap. While a Content Distribution Network (CDN) helps ensure performance and availability, something like our alma matter, Sucuri. Both have their own reasons for why they want to retain a domains zone information, and in doing so treat it as an embedded feature.
Note: Some CDN’s don’t allow you to use other Authoritative DNS providers. While an antiquated approach, this would make it impossible to use with NOC.org.
As the domain owner, you have the ability to choose who you want to manage your zone. You have the ability to move your authoritative DNS to another provider. Doing so will often help provide failover and redundancy, especially when you have your ducks all in one basket – Registrar, DNS, CDN, WAF, etc…
It all works great, until it doesn’t.
Ensuring Business Continuity
Things go down, that is a hard lesson we learned running our own CDN / WAF for years. You can do everything in your power to ensure the service is never disrupted, but Murphy often has other plans. Whether it’s a partner disruption, or something as innocuous as an oversight during a PR.
Leveraging an independent Authoritative DNS can add exponential peace of mind to an organization that depends heavily on their online presence.
NOC.org is here to help provide that. Think of us as a complementary solution, not a replacement.