rundeck | Tags | PagerDuty

What is Runbook Automation? by Catherine Craglow

Catherine Craglow — Wed, 03 May 2023 15:00:35 +0000

The post What is Runbook Automation? appeared first on PagerDuty.

Rundeck by PagerDuty: State of the Community 2023 by Nisha Prajapati

Nisha Prajapati — Thu, 30 Mar 2023 21:37:52 +0000

The post Rundeck by PagerDuty: State of the Community 2023 appeared first on PagerDuty.

Jobs 101: Workflow Best Practices by Nisha Prajapati

Nisha Prajapati — Thu, 30 Mar 2023 21:09:25 +0000

The post Jobs 101: Workflow Best Practices appeared first on PagerDuty.

Getting Started Workshop: Rundeck By PagerDuty by Nisha Prajapati

Nisha Prajapati — Mon, 12 Dec 2022 20:49:04 +0000

The post Getting Started Workshop: Rundeck By PagerDuty appeared first on PagerDuty.

Getting Started with the Rundeck Ansible Integration by Nisha Prajapati

Nisha Prajapati — Tue, 22 Nov 2022 19:11:53 +0000

The post Getting Started with the Rundeck Ansible Integration appeared first on PagerDuty.

Analyst Report: PagerDuty unifies IT operations with a blend of insights and automation by

Tue, 11 Oct 2022 18:06:50 +0000

The post Analyst Report: PagerDuty unifies IT operations with a blend of insights and automation appeared first on PagerDuty.

Introduction to PagerDuty Process Automation by Traci Myers

Traci Myers — Tue, 09 Aug 2022 15:44:48 +0000

The post Introduction to PagerDuty Process Automation appeared first on PagerDuty.

Automating Common Diagnostics for Kubernetes, Linux, and other Common Components by Joseph Mandros

Joseph Mandros — Wed, 27 Jul 2022 13:00:45 +0000

Watch our Automated Diagnostics webinar on demand to learn about common diagnostics for common components and how we provide out-of-the-box job templates for you to get started right away.

This is the second piece in a series about automated diagnostics, a common use case for the PagerDuty Process Automation portfolio.

In the last piece, we talked about the basics around automated diagnostics and how teams can use the solution to reduce escalations to specialists and empower responders to take action faster. In this blog, we’re going to talk about some basic diagnostics examples for components that are most relevant to our users.

But before we jump in, let’s make clear what automated diagnostics isn’t, based on some audience feedback on Twitter from the last article:

Automated diagnostics is different from alert correlation. Alert correlation depends on a specified depth of signals, as well as an engine that can properly identify said correlated signals. Automated diagnostics is meant to help the first responder triangulate the source of the issue to either fix the issue faster themselves, or escalate more accurately.
Automated diagnostics is different from monitoring. Monitoring is purpose-built to identify undesired states in performance or activity. This means that most monitoring is not purpose-built to emulate a first-responder’s activities to validate a true positive, or identify the first actions to take. Monitoring is focused on raising the alert. Automated diagnostics is focused on determining how to fix an issue once the alert is already created.

That said, automated diagnostics can certainly make use of data collected by monitoring tools—most people don’t apply thresholds to every datapoint they collect. In fact, one of our more commonly used diagnostics integration is to query CloudWatch logs. While we might consider a log aggregator a monitoring tool, sometimes the first steps of investigation are to look at the data in the monitoring tool that exists purely for diagnosing issues.

Providing responders with on-demand or pre-run diagnostic capabilities for their own environments can help a first responder quickly determine probable cause, thereby pulling in fewer individuals to assist with the incident. By providing first-responders with “diagnostic” data that is typically only retrievable by domain experts, the need to pull in more people for troubleshooting incidents is reduced significantly. This in turn drives down the cost of incidents and reduces mean time to response (MTTR) by automating the investigative steps that are typically manual in nature.

The status quo: Automation in incident response

Operations managers are often excited about the idea of enabling self-healing or auto-remediation. It’s a natural inclination to assume that speeding up resolution through automation means “applying a cure.” But more often than not, the industry theory of “no two incidents are truly identical” rears its head. When you have a high degree of variability, this reduces the value of such potential automation since it’s less likely to be run. For example, restarting a core service may be the right way to fix today’s issue, but it could lead to a cascading failure—and an even bigger incident—tomorrow.

*The reader now switches cognitive gears to the initial stages of a response.*

But you know what tends to be highly repetitive? The same investigative steps a responder takes to begin to diagnose what went wrong and determine what happened. More repetitive action means more value to gain from applying automation. For example, let’s say an incident kicks off within your Kubernetes distribution. No matter the nature of the incident, whether it be something within your image repository, or load balancer, you’re likely still going to take the same diagnostic step of pulling your kubernetes logs.

These diagnostic steps often remain static—for the most part—depending on the component you’re working with, no matter the priority of the incident that occurs. Automated diagnostics can be applicable to heterogeneous incidents; it doesn’t have to be purpose-built for the same, recurring incident, it can be applied to and customized around all sorts of common incident types and severities—specific to your environment—for almost any common component. Think of it like going to the doctor’s office. Whether you are going to urgent care for a specific complaint or just an annual checkup, they still take your temperature, blood pressure, and weight when you walk in.

Common Examples

Every developer environment is different; but many environments are also quite similar once you really pop the hood. In the beginning stages of a response, most diagnostics will come from three main data sources:

Application data
System data
Environment data

There are several examples of common diagnostics and components that can be automatically pulled during the beginning of a response. This would not only help the responder better understand the severity of the incident, but will also help ensure the responder doesn’t pull in too many specialists and interrupt them from their normal day of work. For example, let’s look at Kubernetes (k8s) as a component for a responder during an incident. When an incident happens within a k8s environment, the infrastructure engineer who maintains the technology would typically perform actions such as:

Tail logs from k8s pod
Retrieve logs from k8s by selector label
Check image repo
Describe deployment
Execute command in pod

One thing all of these actions have in common? A typical L1 responder ACK’ing an incident doesn’t know how to orchestrate these actions—it’s just not their area of expertise. But with the out-of-the-box jobs from PagerDury’s Automated Diagnostics, the L1 responder can automatically run these diagnostics and execute these jobs, which speeds up the response and reduces the escalation to the infrastructure engineer responsible for the k8s environment.

Some common diagnostics and alert examples include:

CPU/Memory Consuming Services
- Common alert: High CPU/Memory
- Common question: Which service(s) are consuming CPU/Memory?
File size / Disk Consumption
- Common alert: High CPU/Memory
- Common question: Which files/directories are consuming the most space?
System Logs: Linux/Windows Commands
- Common alert: Server/service issues
- Common question: Is it an OS issue or app issue?
SQL Database Commands
- Common alert: Database blocks/deadlocks
- Common question: Is there a long-running query blocking other database requests?
Host Availability
- Common alert: Host down
- Common question: Is it actually down or is it a false-positive reachability issue?
Application Error: Application Logs or traces
- Common alert: 400/500 error codes
- Common question: What is the stack-trace?

A few examples of some common diagnostics for known components:

Cloudwatch Logs: Surface specific application and VPC logs.
ECS: View stopped ECS task errors.
ELB: Debug unavailable target-group instances.
Kubernetes. Retrieve logs from Pods by selector label.
Linux. Retrieve service status.
Nginx. Retrieve error logs.
Redis. Slow log entries.

And these are just some of the over 30 out-of-the-box jobs templates we have built for our users that you can find in the Automated Diagnostics solution guide. To use the Automated Diagnostics Solution, you must either have a PagerDuty Runbook Automation license or a Process Automation (previously Rundeck Enterprise) license. See the FAQ for details on how to use. If you do not have a license for either of these products, contact us to learn more.

Automating diagnostics within PagerDuty

Incidents that notify responders are filled with information provided by monitoring tools that have a “miopic” view on the alert(s). A common example is that high CPU usage triggers an alert, and this notifies a responder. But the information contained in the alert is surface-level in that it does not specify what might be the cause of the spiked CPU.

Diagnostic data is the deeper-level information that helps answer the “why” and “where” questions of incidents. Even though some monitoring and correlation tools provide some help in providing root-cause analysis for users, most fall short in their ability to emulate a responder’s investigative/troubleshooting steps of collating disparate data-sources into a unified view. By providing responders with on-demand or pre-run diagnostic capabilities, the odds of the first responder resolving the issue on their own increase, as well as the probability of pulling in fewer individuals to assist with the incident. Enter Automated Diagnostics.

Want to learn more about common diagnostics for the components you use? Register for our September 14th webinar event of the same name, hosted by Justyn Roberts, Senior Solutions Consultant, PagerDuty. New to Process Automation? Request a demo. Already using PageDuty Process Automation? Check out the automated diagnostics solution guide to see the end-to-end process of achieving the full solution. Questions? Reach out to me directly on Twitter @sordnam and let’s chat!

The post Automating Common Diagnostics for Kubernetes, Linux, and other Common Components appeared first on PagerDuty.

What is Automated Diagnostics and Why Should You Care? by Joseph Mandros

Joseph Mandros — Fri, 03 Jun 2022 13:00:35 +0000

How do you measure the cost of an incident?

A lot of people in technology talk about the cost of an incident solely from the perspective of downtime, or the number of customers and employees impacted. And from the surface, oftentimes that is a fair angle to take. It makes the headlines, and customer reputation and trust are critical to the success of any business—obviously.

But another direct cost of incidents that is infrequently acknowledged is the number of people that need to get involved during an incident; whether that’s to help investigate the root-cause, troubleshoot and resolve the incident, or absolve their team of responsibility—regardless of whether the incident is severe enough to impact your customers.

According to PagerDuty data, 50% of a responder’s time is spent determining who is best to pull in for additional support (and trying to figure out if there’s actually a problem) in x environment, or with y service. Given this statistic, this means that 50% of an incident’s lifespan is spent on the beginning stages of an incident (the diagnostic and triage phases), rather than on actual remediative actions.

The bottom line? The cost of people-hours and the number of manual actions taken per incident can get steep—fast.

Automating Your Incident Response

Applying automation to the early, recurring stages of the incident, including diagnosing the severity of the incident and understanding the genetic makeup of what went awry (and how), is critical to the success of the eventual remediation of the incident.

Automation is also important from a people perspective, ensuring your teams aren’t getting burnt out by the same, repetitive actions every time an incident kicks off. Ensuring the diagnostic data is available to first responders is paramount to the routing efficiency and overall workflow of the incident response.

Before we go any further, let’s first define diagnostic data. Diagnostic data is data retrieved by incident responders that is typically more specific than the information provided by monitoring tools. For example, whereas monitoring tools will alert you when there is a spike in CPU or Memory, the incident responders investigate by looking at the highest CPU and Memory consuming processes. Therefore, in this case, the Process Names or ID’s and their associated compute-consumption is the “diagnostic data.”

So now that we have defined Automated Diagnostics, why should you care? Because implementing an Automated Diagnostics practice can drive down the cost of incidents through both reduced incident duration and fewer responders paged.

The Problem with MTTR

Perhaps “problem” is the wrong word here, but hear me out: MTTR as a metric is too broad to return granular, actionable insights. Mean time to repair (MTTR) has been a staple maintainability metric in the IT universe for decades. And while it has many applications and does a great job of explicating the rate of general recovery, its achilles heel is just that—generality. And now that we can safely infer that 50% of a responder’s time is spent determining who is best to pull in for additional support, we’ve started looking at other metrics within the MTTR timeline, such as MTTT (mean time to triage) or MTTI (mean time to investigate).

MTTI/MTTT: The average time between the detection of an IT incident and when the organization begins to investigate its cause and solution. This denotes the time between MTTD (mean time to detect) and the start of MTTR (mean time to repair).

At PagerDuty, we measure this as the time span between when your first responder “acks” to when your resolver “acks.” This metric helps us click into what’s actually happening under the hood during an incident. After observing our own data, we’ve been able to infer that MTTI is one of the most time-consuming factors of MTTR. And in modern business, when a task requires time and attention from engineers, then that task is an expensive one for the business. Really expensive.

Using Automated Diagnostics

Now let’s bring this back around to MTTI and automated diagnostics. MTTI is not only lengthened by the technical tasks of responders manually pulling diagnostic data and having to decipher which team to escalate to based on x service and y incident. It’s also about the people and their limitations, depending on the specific expertise that is required to begin resolution. For example, in many cases, the first responder doesn’t know how to investigate the issue from the database or network ‘perspectives.’ That may be due to their lack of skills (background in databases or networks); access, or tribal knowledge (e.g. that a specific app-component depends on a complex integration with a third-party service).

By automating these investigative and debugging tasks, in addition to having the ability to delegate these actions across teams and responders, you will experience a positively cascading effect on MTTI, and eventually, MTTR.

So why should you care about automated diagnostics?

With automated diagnostics, you can:

Reduce escalations to scarce experts by designing paths to provide the first-responders with information that would typically be manually gathered
Distribute subject matter expertise across response teams
Invoke secure automation behind firewalls and VPCs
Troubleshoot and resolve faster without a human-assisted action required
Improve the speed of enablement to new engineers and ensure optimal efficiency at all levels of the incident response organization

Getting Started

You made your decision. Now it’s time to blaze the trail, but where do you start?

To use some marketing slang: don’t try to boil the ocean. Trial some actions that are both low in complexity and risk. This could be taking a deeper look at some of your noisiest services, or you could run some simple data pulls from various monitoring applications, disc usage, etc. But it’s important to have a strategy for the long-term roll out and vision of this functionality. Sure, you can write a script that pulls data from numerous sources and appends that to an incident. But that is far from scalable.

It’s important to think about the various infrastructure pieces and tools you will want to pull diagnostic data from. You will want a standardized approach for interfacing with your heterogeneous and dynamic environments.

To learn more about automated diagnostics, check out some of our how-to articles, which we will be continuing to publish throughout the year. Additionally, look out for a session on all things Automated Diagnostics from Jake Cohen during PagerDuty Summit next week!

For more resources about PagerDuty’s Process Automation portfolio, visit this page and get in touch with your account manager today.

Any questions? Feel free to ask away on Twitter @sordnam

The post What is Automated Diagnostics and Why Should You Care? appeared first on PagerDuty.

Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions by Joseph Mandros

Joseph Mandros — Mon, 04 Apr 2022 13:00:55 +0000

Let’s face it. Incidents can be expensive—really expensive. But the high cost of incidents within a production environment isn’t always due to a compromised service or negative customer experience. According to PagerDuty response data, over 50% of an incident’s lifespan is spent with first responders in the investigation and mobilization phases (what we call ‘triage’)— in other words, determining what might have gone wrong and calling the right person to fix it.

With the above statistic in mind, it’s clear that the shadow expense of your incident lifecycle is that of your people’s time—the engineer who discovered the incident, the on-call engineer who responded to the issue and determined root cause, and every other subject matter expert that gets looped into the incident lifecycle. And when you sprinkle in manual processes to the entire response timeline, things can get pricey. Very pricey.

The fact of the matter is, your developer organization’s time is just as valuable and important as the business’s bottom line. And as service and application development continues to grow in complexity, “time saved” becomes an even more important metric to track, quantify, and continuously improve. Finding a way to automate different aspects of the incident response process can help save your team’s time and bolster efficiency across the board. How can you do this, you ask? Enter PagerDuty® Automation Actions (formerly PagerDuty Rundeck Actions).

PagerDuty® Automation Actions

PagerDuty® Automation Actions add-on connects your first-line responders to corrective automation directly within PagerDuty. Instead of pushing escalations to specialists when an incident kicks off, responders can triage and resolve incidents themselves using safely delegated automation. As a result, teams reduce MTTR, lower interruptions to specialists, and quickly diagnose and remediate incidents.

PagerDuty® Automation Actions connects automated diagnostics and remediation to the incident response workflow. Automated Diagnostics are a set of actions for production services that your responders can automatically invoke when an incident occurs. Rather than having to escalate to expert specialists who manually run common tests, responders can safely and securely invoke this automation themselves from within PagerDuty and see responses delivered in real time back to your incident timeline.

Run designated actions such as service restarts, diagnostics, and more

With these diagnostic tests, responders can more efficiently escalate the incident to the right specialist for resolution, rather than involving a large group or escalating up the typical responder ladder. The specialists will be able to see the results of those common diagnostics and can get started right away.

Additionally, teams can also invoke these actions and collaborate on the incident directly from their Slack instance. This eliminates the need to access a service through a terminal and context switch between windows, creating a faster and more efficient way to resolve incidents—while also reducing escalations to specialists. As you mature your use of automated diagnostics, you can start using it for things like automated remediation and triggering using Event Intelligence.

PagerDuty® Automation Actions helps solve four main problem areas within an organization’s response process:

Siloed expertise. First-line responders don’t know the genetic makeup of every single application or service within an organization’s environment.
Consistent interruptions to specialists. Responders escalate to the engineer they think is the specialist of that application or service, taking time away from innovation and slowing time-to-resolution.
Repetitive and manual diagnostic steps. The first steps when an incident kicks off are often the same. These same, manual steps have to be actioned on before you can begin resolving the incident.
Complex and sprawling production environments. Knowing which systems to access and what actions to take can take time. Additionally, not every responder has the authority to access specific production systems, often making the escalation process difficult and time-consuming.

PagerDuty® Automation Actions solves the above issues by:

Delegating automation across teams. Deploy automated procedures to first-line responders that are typically invoked by specialists.
Resolving incidents faster with fewer escalations. By creating automation for common requests and operations, teams can spend less time figuring out who to escalate to and more time on a fix.
Triggering human-assisted/self-healing automation. Invoke diagnostic actions before responders are even paged using PagerDuty’s Event Orchestration.
Safely invoking automation with security in mind. Responders only see actions they have the authorization to invoke for impacted systems in an incident, and all actions are logged to maintain a strong security posture.

To summarize the above with some quick bullets, PagerDuty® Automation Actions helps teams:

Decrease response times up by to 30 minutes and MTTR by up to 25%
Reduce the volume of incidents that get escalated up the ladder
Distribute subject matter expertise across response teams
Trigger human-assisting and self-healing automation before responders even get paged
Invoke secure automation behind firewalls and VPCs
Deploy automated actions in place of manual procedures
Enrich incident documentation for smoother postmortems and reduced operator work

To learn more about the PagerDuty automation portfolio, visit our automation hub. If you want to learn more about PagerDuty Automation Actions and how it can help your team save time and money, contact your account manager or learn more today.

The post Democratize Your Team’s Automation Capabilities With PagerDuty® Automation Actions appeared first on PagerDuty.