Operations Health | Categories | PagerDuty https://www.pagerduty.com/blog/category/operations-health/ Build It | Ship It | Own It Tue, 05 Sep 2023 18:26:44 +0000 en-US hourly 1 https://wordpress.org/?v=6.3.1 How to Ace Your Services with PagerDuty by Débora Cambé https://www.pagerduty.com/blog/how-to-ace-your-services-with-pagerduty/ Wed, 06 Sep 2023 12:00:58 +0000 https://www.pagerduty.com/?p=83923 It’s finals week for the US Open, one of the most celebrated sports events in the world. Tennis is my favorite sport to watch as...

The post How to Ace Your Services with PagerDuty appeared first on PagerDuty.

]]>
It’s finals week for the US Open, one of the most celebrated sports events in the world. Tennis is my favorite sport to watch as I’m fascinated by the strength, composure and endurance each player displays while standing by themselves on the court, sometimes during incredibly long matches – the current record is 11h05.

Tennis players are fully accountable for the outcome of their matches at every single stage. Their performance directly impacts whether they win or lose. If this sounds familiar, that’s because it is. Service Ownership follows the same approach: “you build it, you own it”. In the context of DevOps, you’re not working alone. But there are definitely lessons to learn from tennis when it comes to building healthy, resilient services. 

The parallel started drawing itself when interviewing Leeor Engel, Director of Engineering for the Incident Response product line. Keep reading and find out his take on how to ace services and how the PagerDuty team used PagerDuty’s own Service Standards functionality to improve the overall maturity of their services.  

What is Service Standards?

When pivoting to a Service Ownership model, organizations struggle with having a clear visibility of their multiple services and how to uniformize their configurations. Launched a year ago for all PagerDuty plans, Service Standards can guide teams to better configure their services, while helping managers and administrators to scale these standards across the organization.

With Service Standards, PagerDuty provides nine standards that each service should fulfill to have the depth and context required for that service to be considered well-configured, all of which are able to be toggled on and off.

PagerDuty’s Customer Zero: PagerDuty

After the launch of Service Standards, PagerDuty was its own customer zero. Leeor walks us through the motivation behind this effort: “You wanna get adoption and figure out what the gaps are, get feedback, figure out ways to improve [the product]. Then there was an organizational goal. We talk a lot about what makes a service well configured and what does good look like. So we did a big push to get PagerDuty to be customer zero for that feature. We basically got every team to review all their services. And we actually found that many services did not meet the standards.”

Services varied considerably in their standard compliance, but “under 50%” were fully compliant. Approximately four months later, the goal to reach 100% compliance was achieved. But it’s a constant work in progress to keep it that way: “It can be very difficult, depending on the type of service, to get 10 out of 10 [standards]. So our goal was to get 100% of services to be at least 80% compliant. We got there. But then there’s an ongoing effort to maintain that because new services are created all the time, and it’s easy to forget this. And so our continuous process is what catches those stragglers and gets them compliant.

If you also want to ace your services, here are four lessons you can draw from tennis dynamics to get there:

Warm-Up

You might have identified the need to standardize your services to play in the best practices court. But maybe your organization has dozens, even hundreds, of services and that feels overwhelming. Where and how should you start to avoid feeling overwhelmed?

Lesson #1: Start with the baseline

In tennis, the baseline is where each game begins. It’s where players serve and it’s the foundation for their positioning and strategy. Without a well developed baseline play, there’s no chance of winning. But it needs to be built gradually.  

Similarly, standards work as a service’s baseline level of quality, consistency, and functionality. It’s not about achieving perfection from the outset but rather about having a structured foundation to build upon. Take it from Leeor: “You want to focus on systemic things and define any standard as a starting point. Don’t worry about it being perfect. Just get it in place and have a continuous monitoring regime. And that’s gonna move the needle the most, because that’s going to expose all these other problems you might have in your processes that you need to improve, whatever it might be. It’ll be sort of the gateway to exposing those things and then addressing them, continuously improving.

Lesson #2: Adapt to the surface

Every tennis player has their own style of play, but they must adapt to the surface they’re playing on, each enabling different dynamics. On grass, for example, rallies are usually shorter, as the ball bounces low and players need to get to it faster – playing the net successfully and mastering the volley is key to success.

In the context of services, recognizing each team’s unique circumstances is a crucial first step when determining which standards that team’s service should follow. As Leeor explains, “teams can have pretty different needs in terms of their services. Sometimes their integration set up is a little bit different. Sometimes they’re not monitoring things that are directly based on code deployments. For example, one of our Service Standards is having at least one change integration – we may have services that don’t. They may be triage services that have email integrations or things like that. Those services still provide value and they need a standard, but they need a slightly different one. There isn’t a one-size-fits-all that works for everyone.

Win the game

The foundations are set: you have defined your service’s boundaries and standards according to the needs of the team that owns it. Now you need to ensure those standards are complied with. How?

Lesson #3: Avoid unforced errors

An unforced error happens when a player loses a point even though their ability to execute it was completely in their control, i.e. not forced by the opponent.

Teams are responsible for keeping their service standards in check, but in the fast-paced DevOps world that can be tough; services change or new ones might be created depending on business needs. Leeor highlights three essential steps to successfully maintain the balance of your service standards and avoid the unforced error trap:

  • Monitor: With the new PagerDuty Service Standards API you can pull your service standards on a regular basis. This allows you to confirm if the standards are in line with the service needs, if they might need to change or if it makes sense to create exemptions.
  • Report: Create a reporting regime where you define a regular cadence to assess the state of all the services. With PagerDuty Service Standards it’s easy to do so, as the service performance data can be exported out of PagerDuty by the admins and shared as needed to drive accountability and show progress. Admins also have the option to make standards publicly available for the rest of the organization to view. 
  • Educate and be educated: Leeor explains how talking directly and frequently with team owners can raise awareness and educate on the importance of complying with service standards: “For example, business services were not uniformly used across all teams and it’s actually pretty useful. Even just to have a parent business service for your area. Then you can leverage capabilities like the Service Graph or Business Impact features. A system where you can see all your services at a bird’s eye view.” It can also help surface different use cases: “Over time, we developed this process where we could have some exemptions. An example would be testing a service that isn’t in production yet, and it doesn’t yet have the escalation policy. So we set up an exemption process – which ideally was temporary – and we set up some exclusions around specific standards.” 

Win the match

Lesson #4: Continuously improve

The beauty of tennis is the course of a match can change instantly. There is no time limit to a game or even a set and players aren’t only depending on variables they can control: there’s the opponent’s focus and physical condition, the weather, and even the audience. Are they cheering you on?

Tennis is a game of continuous improvement and the same happens with services. Well configured services help scale Service Ownership best practices which, in turn, drive the organization’s operational maturity level.

Here’s Leeor’s number one advice to get there: “The key thing is reporting. Of course you need to establish what your standard is and that may look a little different depending on the business. But really the critical thing is the continuous monitoring and reporting. Mistakes happen, things get missed, humans are humans, right? So you need some process that catches the things that fall through the cracks. Define a standard and continuously monitor it, like you would do with any other process. You’re trying to continuously improve. You need to monitor it.

Start Acing Your Services

Put all these lessons in practice with the PagerDuty Operations Cloud, the essential platform to get your services in shape and manage all unplanned, time-sensitive, critical work across the enterprise. Learn more here and try our free 14-day trial

The post How to Ace Your Services with PagerDuty appeared first on PagerDuty.

]]>
3 Ways You Might Have a NOC Process Hangover by Hannah Culver https://www.pagerduty.com/blog/3-ways-you-might-have-a-noc-process-hangover/ Mon, 24 Oct 2022 13:00:33 +0000 https://www.pagerduty.com/?p=79024 NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation...

The post 3 Ways You Might Have a NOC Process Hangover appeared first on PagerDuty.

]]>
NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation and the cloud era have led to the rise of DevOps, and with it, service ownership. Service ownership means that developers take responsibility for supporting the software they deliver at every stage of the life cycle. This brings development teams closer to their customers, the business, and the value they deliver.

It also requires a departure from the traditional NOC incident handling methods. Yet, as organizations transition towards service ownership, some old NOC processes remain. Here are three common NOC process hangovers and how to replace or update them.

Process hangover: L1 responders aren’t able to resolve issues

NOCs used to be the command center for technology issues. They functioned like a brain, sending out signals to relevant appendages. Issue with networking? Route to networking. Issue with security? Route to security. The NOC’s central function was to involve the correct SME to resolve an issue. This meant digging through spreadsheets (and sometimes physical contact books!) to figure out who was responsible for what.

When everything was on premise and in person, this made sense. There were fewer services, and incidents could be neatly separated by departments. If the database was having an issue, you could call up the database on-call responder. The responder (who would likely be in office or close enough to respond in person) could then go to the datacenter and take a look.

Now, in the remote work, cloud era, where organizations have hundreds or thousands of services maintained by dozens or even hundreds of teams spread across the globe, the rolodex method has outlived its purpose. It’s next to impossible to maintain accurate spreadsheets to know which teams are responsible for which services. And, as the organization changes, records grow stale quickly. Services can move between teams. Teams change as people move between them, or leave/join the company. Now, an L1 responder has to work too hard to identify the right person in an efficient and timely manner.

Organizations need a way to remove these manual steps to find the right person and route incidents directly to SMEs who can jump in to respond to any issues. This can happen in a variety of ways. For some organizations, a DevOps service ownership model is the right path forward. Those who write the code are assigned to respond and fix the service during an incident. The alert is routed directly to the on-call person on the development team that supports the service, and the SME takes it from there.

For other organizations, it might make sense to have a hybrid approach where L1 responders serve as the first line of defense before escalating to distributed, on-cal teams for their services. L1 responders shouldn’t be a routing center that connects the issue with another team. Instead, they should be empowered to resolve an incident themselves. You can set up your L1 responders to be more effective by enabling them with the ability to both troubleshoot and selectively resolve incidents. Access to automation and resources like runbooks can empower L1 responders to help accelerate the diagnosis and remediation process, oftentimes without needing to disrupt the subject matter experts that are in charge of X service via an escalation. By putting automation in the hands of L1 responders, organizations can avoid unnecessary escalations and empower L1s to resolve issues faster.

Process hangover: Major incidents aren’t called or are called too late

We’ve heard it before: time is money. And when NOCs were the primary method of ensuring incidents were responded to, they had an additional responsibility. An NOC needed to ensure that resources were well managed. This meant no unnecessary personnel responding to problems. NOCs often took the blame if they called a major incident too soon and interrupted people for a minute problem. These disruptions took SMEs away from their work innovating. So it was crucial for NOC responders to only call major incidents when it was clear there was a much bigger issue at play.

But now, time isn’t money, uptime is money. The cost of a major incident that’s flown under the radar is larger than the cost of tagging in some extra help. Imagine you’re an online retailer and your shopping cart function is down. Every minute your customers can’t add items to their cart, you’re losing hundreds of thousands of dollars. Plus, customer expectations have increased over the last few years. Customers expect that their app, tool, platform, streaming service, etc. works without interruption. And it erodes customer trust when it doesn’t. In fact, according to PWC, 1 in 3 customers would stop doing business with a brand they loved after one bad experience.

Organizations need to call major incidents sooner to mitigate customer impact. Yes, this may mean waking someone unnecessarily once in a while. But, that’s far less likely with service ownership. SMEs responsible for a service have a better understanding of when to call a major incident than an L1 responder would. So there are fewer false alarms.

Process hangover: Come-and-go war rooms

NOCs often serve as the communication hub for a major incident. This helps responders working to resolve an issue keep on task. Back when many companies had everything (and everyone) on-premise, there was a war room. People came there and the NOC coordinator kept everyone up to date. Now, with distributed teams and systems, physical war rooms are a thing of the past. Many companies instead have virtual war rooms with a video conferencing bridge or chat channel that remains open during an incident.

Other stakeholders may want to treat this war room like a physical one, dropping in as they please. But, in this virtual world, this means that these stakeholders are asking the incident responders questions. This delays the resolution. Companies with come-and-go virtual war rooms may experience more miscommunications and frustration. Responders feel frustrated by interruptions and stakeholders feel frustrated with the lack of communication.

One way to mitigate this is to close the war room to non-participants. If someone isn’t a part of the incident response team, they don’t need access to the response team’s virtual war room. Instead, what they need is an internal liaison. This is a designated communicator from the incident response team.

The internal communication liaison consolidates incident information and relays it to relevant stakeholders. To make this easier, communication liaisons can use status update notification templates. These templates dictate how to craft communications for a specific audience. They ensure that stakeholders receive any information necessary to make decisions. And no responders have to stop working on the incident at hand to share updates.

Hangovers aren’t fun, but they always end

NOCs are a tried and true way of managing incidents for many organizations. But NOC methods become out of date when moving into this era of digital transformation. Seamless communication and rapid response are key to preserving customer trust. Looking forward, teams will involve SMEs immediately and call major incidents sooner rather than later. They’ll also communicate with key stakeholders throughout an incident while setting boundaries.

And often teams need a digital operations platform to help support this transition. PagerDuty allows teams to bring major incident best practices to their organization, resolving critical incidents faster and preventing future occurrences. Try us for free for 14 days.

The post 3 Ways You Might Have a NOC Process Hangover appeared first on PagerDuty.

]]>
How service ownership can help you grow your operational maturity by Hannah Culver https://www.pagerduty.com/blog/how-full-service-ownership-improves-operational-maturity/ Mon, 01 Nov 2021 13:00:18 +0000 https://www.pagerduty.com/?p=72215 Digital operations management is about harnessing the power of data to act when it matters the most. It’s also about having the right processes and...

The post How service ownership can help you grow your operational maturity appeared first on PagerDuty.

]]>
Digital operations management is about harnessing the power of data to act when it matters the most. It’s also about having the right processes and procedures to support teams when every second is critical. Maturing your digital operations takes time, iteration, and commitment. The change won’t happen overnight. But, if you put in the effort, you’ll reap outsized benefits. You’ll be able to learn from incidents and proactively improve your services over time.

One way to improve your digital operations maturity is to adopt service ownership. In this blog post, we’ll share what service ownership is, how to make the transition once your organization announces the pivot, and how your teams will grow in maturity along the way.

So, what is service ownership?

Service ownership means that people take responsibility for supporting the software they deliver at every stage of the software/service lifecycle. That level of ownership brings development teams much closer to their customers, the business, and the value being delivered.

Benefits of service ownership are varied, but here are some of the most important:

  • Your teams will know who is on call and when. This helps them feel more confident in on-call, and builds accountability for the services they build.
  • Service reliability improves. When a team focuses on a particular service, trends are easier to notice. Issues with reliability bubble up faster, and improvements can be prioritized.
  • Customers experience less service degradation and downtime. Happier customers mean a more successful business. With service ownership, you can respond to incidents faster and can even resolve them before any significant customer impact.

Many organizations make this move to service ownership to innovate faster and gain a competitive advantage. The flexibility of service ownership allows you to pivot in new directions and adapt to change at a rapid pace. But this isn’t something that can be completed in isolation. Service ownership is part of a new cultural and operating model that must be adopted organization-wide to be successful. Let’s look at how to get started.

How can I adopt service ownership?

Like any worthwhile culture change, service ownership will not be an initiative you can complete within a single sprint. And you’ll need the whole organization to move in this direction for this initiative to succeed. For the purposes of this blog post, we’ll assume that your organization is ready to adopt service ownership, and your team is looking for the best way to make the change. To get started, there are a few things you can do.

  • Create a list of services. If you haven’t created a list of all the services in your system, work cross-functionally with other teams to understand all the moving pieces. While eventually you’ll want to include business services, you should take it step-by-step and focus on those owned by technology teams first. Once you have a list of services, it’s time to start on the “ownership” part.
  • Define the team that will own the service. Start by considering who is responsible for the service you are defining. A service should be wholly owned by the team that will be supporting it via an on-call rotation. If multiple teams share responsibility for a service, it’s better to split up that service into separate services (if possible). Some organizations call this “service mitosis”—splitting one cell into two separate cells, each looking very similar to the former whole. There are several methods for deciding how to separate services like, for example, splitting them up based on team size or volume of code they manage. You can read more about how we did that at PagerDuty.
  • Set up the on-call rotation for this service. Ensure that the people on the team share responsibility for ensuring availability of the service in production. Create on-call schedules that rotate individuals and back-up responders on a regular cadence, as well as policies that include escalation contacts.
  • Ensure the team is sized correctly. Services should be set up granularly enough so that the members of that team are able to quickly help identify the source of problems. This can apply to creating a service with a scope so large that the knowledge necessary to support it is beyond what’s contained within the team. But it also applies to scoping a team in a way that is too small. For example, if two microservices effectively behave as one, and fixing a problem on one means also fixing it on another, then it might make sense to combine them.
  • Start small. It’s important to roll this change out incrementally. That way, you can show success early and inspire other teams to adopt this mindset. This also gives teams time to learn from others before implementing service ownership themselves. Ideally, the change should roll out smoother with each team.

As your system grows and changes, make sure to adjust services, teams, and on-call rotations accordingly. This isn’t a set-it-and-forget-it-motion. Instead, you should expect to change as your business does. Bake time into quarterly planning to understand how your team is faring. If you’re feeling overwhelmed, bubble up the need for more support. Teams need to make sure this feedback is given to managers, and managers are responsible for escalating accordingly.

Don’t we need some documentation for this?

Each service needs documentation, no matter how small it is. Documentation helps everyone better understand what the service is and does, how it interacts with other services, and what to do when problems arise. With this in mind, these are the most important points to touch on when creating documentation.

Naming and describing: The best-named services aren’t the ones cleverly named. When naming a service, try to think of the most simple and descriptive way to say what it does. This helps eliminate confusion down the line as you grow and scale. Make sure your description is equally informative. The description should answer questions like:

  • What is the intent of this service, component, this slice of functionality?
  • How does this thing deliver value?
  • What does it contribute to?
  • If this is part of a customer-facing feature, explain how this will impact customers and how it rolls up to the larger business component.

Determining dependencies: Services don’t operate in a vacuum. Our jobs would be much easier if an issue in one service was isolated and didn’t affect any other services. Yet, this is not the case as we move more towards microservices. You need to know which services yours depends on and what services depend on yours.

At this point, it’s extremely valuable to create a service graph that shows both the technical and business services and how they map to each other. Ideally, this would be a dynamic tool that would allow you to understand how failure in one part of the system affects the rest of the system as a whole.

Beyond mapping these dependencies, you should have communication plans for them. How will you alert dependent services when you experience an incident? How will you communicate technical problems to other line-of-business stakeholders? Laying out these plans ahead of time can help you think of incidents in terms of business response.

Runbooks: Runbooks are an important tool for teams. They’re like a cheat sheet for each service. Make sure you document how to complete common tasks and resolve common incidents. As you become more familiar with your service, you can even include automation into your runbooks. This automation can range from advanced auto-remediation sequences that can eliminate the need for human involvement for some incidents, to lightweight context gathering and script running.

Whatever stage your runbooks are at, it’s key to update these regularly. If you notice something is incorrect in a runbook during response, flag it and go back to it later. Runbooks only work if they’re reflective of the current state. Create time and space to keep these assets up to date.

And remember that runbooks aren’t a cure-all. You can’t plan for and map out resolution instructions for every incident. As your system grows, you’ll encounter novel incidents. A runbook is a tool, not a silver bullet.

How do I know what success looks like?

True success comes from the entire organization adopting service ownership. You’re never done with this initiative, as services and their needs and dependencies are constantly changing. However, you can use metrics to understand how your service is performing. And you can talk to your team and understand qualitatively how they feel about this change.

To understand service performance, you can look at a variety of tools. First, you can use analytics to understand how noisy it is, how often your team is paged, and when those interruptions occur. This can give you an understanding of how healthy your service is in the eyes of the team supporting it.

If you want to know how your service is performing in the eyes of your customers, there’s a tool for that as well. SLOs, or service level objectives, are an internal metric used to measure the reliability of a service. SLOs determine the amount of failure a service can experience before a customer is unhappy, and are created from SLIs (service level indicators).

If you’re within the acceptable level of failure (also known as the error budget), your service will be perceived by customers as reliable. If you are not meeting your SLO, it’s likely your customers are unhappy with your performance.

SLOs are great tools for putting metrics to reliability and demonstrating the value of service ownership. But they’re not the only way to measure success. You also need to speak to your teams to understand their feelings.

Open discussion with teams can help bolster confidence and increase psychological safety. This is extremely important as you will encounter failure along the way. You may not size your teams correctly at the beginning of your journey, and some services might be strapped for support. You may not have the right SLOs, and need to recalibrate. Whatever the challenge you encounter, you need to stay blameless.

These hurdles mean you’re learning and improving. If you can approach them with a positive attitude and listen to the service owners, you’ll improve the reliability of your services, your system as a whole, and the happiness of your teams.

What’s my next step?

Increasing your digital operations maturity is a long road, but one worth traveling. It’s beneficial for your team, the services you run, and your customers. Adopting a service ownership mindset isn’t the only way to make these improvements, but it is a key component.

If you’re looking to learn more about service ownership, you can read our Ops Guide or watch this on-demand webinar. If you want to learn more about planning for digital operations maturity, check out our eBook. And, if you’d like to see how PagerDuty can help you move the needle on initiatives like FSO and operational maturity, try us for free for 14 days.

The post How service ownership can help you grow your operational maturity appeared first on PagerDuty.

]]>
ChatOps and Mobile Adoption: The Power of Teams Working Where They Are by Hannah Culver https://www.pagerduty.com/blog/chat-ops-and-mobile-adoption-in-2020/ Thu, 28 Oct 2021 13:00:10 +0000 https://www.pagerduty.com/?p=72108 The way we socialize, learn, shop, and receive care has changed drastically over the last 18 months. For many of us, perhaps one of the...

The post ChatOps and Mobile Adoption: The Power of Teams Working Where They Are appeared first on PagerDuty.

]]>
The way we socialize, learn, shop, and receive care has changed drastically over the last 18 months. For many of us, perhaps one of the most drastic changes was the way we work. While work from home (WFH) was an option before the pandemic, NCCI states, “only 6% of the employed worked primarily from home and about three-quarters of workers had never worked from home.” Fast forward to 2021, and according to NorthOne, here’s how much things have changed:

Even when remote work is no longer a necessity for public health, it is here to stay and that flexibility for fully remote or hybrid work will increasingly be an option that current and future employees will look for. As Bloomberg stated, “A May survey of 1,000 U.S. adults showed that 39% would consider quitting if their employers weren’t flexible about remote work.” With the Great Resignation costing organizations top talent and heavy recruitment costs, it’s important to keep current employees happy, and the option to have options is a new standard.

But many employers worry about the productivity of remote work. Technology leaders are tasked with developer velocity and time to market. Balancing the need for innovation with maintaining high availability and reliability for the services they’re responsible for is a big task.

In the event of an incident, detecting problems and driving to resolution quickly and without customer impact is crucial. This has been a challenge for many distributed teams. While documentation, training, and knowledge sharing are all important, technology that helps teams feel closer and work where they like is an exceptional advantage.

We analyzed our own platform data and compared how teams handled urgent work in 2019 and 2020. The results showed that there were important differences between the two years, and that certain tools and practices helped teams adapt to working within distributed teams.

ChatOps tools and mobile application adoption have helped teams throughout the last 18 months work collaboratively while remote to resolve incidents faster, and these trends are only becoming more important looking forward. Here’s what the new future looks like with ChatOps and mobile applications for incident response.

Mobile adoption brings incident response to you

As we become more comfortable with working remotely, sometimes it means taking a walk to clear our heads. Or running out for coffee. Or perhaps spending the day working from the park. This flexibility allows employees to be their best selves and stay passionate about their work. Additionally, as remote work can tilt work-life balance, opportunities like this can right the scales.

But, teams still need to be ready for anything. And mobile applications help them respond faster when failure happens. For instance, with mobile adoption, an engineer can acknowledge an incident while walking their dog. In a report created from our own platform data, we took a look at how teams fared in 2020 compared to 2019. According to our data, we saw that organizations with higher mobile adoption rates had 40-50% faster MTTA (mean time to acknowledge) than those with lower mobile adoption. This benefit continues to increase as an organization or account grows in size.

An improvement in MTTA can benefit the organization in a few ways. First, when an incident is quickly acknowledged, it avoids being lost, forgotten, or overlooked. Second, when teams can acknowledge an alert faster, fewer escalations are triggered and your teams have more time to work uninterrupted. Last but not least, the faster a team can jump on an alert and trigger an incident, the faster a potentially customer-impacting issue is resolved.

Teams are downloading mobile applications to assist with incident response. PagerDuty gives teams an extra leg up when responding to alerts. With a mobile app, teams can acknowledge alerts and even kick off incident response from their phone.

And once the alert is acknowledged and incident response kicks off, teams are able to respond to incidents in the tools they know best with ChatOps.

ChatOps isn’t just a buzz word

ChatOps is all about conversation-driven development and incident response. While in a chat room, team members type commands that the chatbot is configured to execute through custom scripts and plugins. These can range from code deployments, to security event responses, to team member notifications. While this method of collaboration has been around for a while, it’s grown in popularity over the last few years.

Especially with remote work as the norm for many teams, the ability to collaborate efficiently during incidents within the tools your team already uses has become invaluable. Microsoft Teams and Slack, two of the most popular communication tools, are now host to a variety of slash commands and custom configurations. These actions can help teams automate context gathering, incident creation, and even execute runbook automation sequences to speed up response.

Our platform data showed an increase in ChatOps adoption by 22% over the last year. While this is a significant change, it makes sense. Without the ability to swivel in your chair and ask a teammate something, ChatOps became the next best option for problem solving.

And, as the number of integrations grows, more teams are able to use the tools they love to drive faster response. PagerDuty, for example, has ChatOps integrations with tools like Atlassian, atSpoke, and many more. As integrations continue to grow in both number and quality, the work you can do via collaboration tools expands.

While this growth doesn’t happen overnight, it’s clear that with time and the right processes, teams can be remote and still excell at responding to incidents. This faster response translates into happier customers and less downtime for your services. As remote work is here to stay, more teams will need to pivot to ChatOps in order to streamline incident response and limit context switching between tools.

A connected, mobile, remote future

Chat and collaboration applications sit at the center of any efficient DevOps and ITOps teams, especially if they’re distributed. Applications like Slack and Microsoft Teams enable responders to quickly and easily collaborate during incidents, reducing their time to resolution. We enable responders by offering more flexibility with chat as a PagerDuty Incident Contact method. Teams can truly work where they are by acknowledging and resolving incents all within the tools they already use.

If you want to learn more about our State of Digital Operations Report, you can download the full version here. Or, if your teams are ready for a solution that incorporates both ChatOps and mobile use into an incident response process, try PagerDuty for free for 14 days.

The post ChatOps and Mobile Adoption: The Power of Teams Working Where They Are appeared first on PagerDuty.

]]>
What Operational Maturity Looks Like Today With PagerDuty’s Kyle Duffy by Hannah Culver https://www.pagerduty.com/blog/what-operational-maturity-looks-like-today-with-pagerdutys-kyle-duffy/ Thu, 14 Oct 2021 13:00:10 +0000 https://www.pagerduty.com/?p=71935 Companies that underwent accelerated digital transformations during the past 18 months are looking to understand how they can improve their operational maturity to handle the...

The post What Operational Maturity Looks Like Today With PagerDuty’s Kyle Duffy appeared first on PagerDuty.

]]>
Companies that underwent accelerated digital transformations during the past 18 months are looking to understand how they can improve their operational maturity to handle the increase in complexity. This is paramount to an organizations’ future success. In fact, research that PagerDuty conducted with IDG found that, on average, organizations with a mature digital operations approach are able to achieve the following:

We sat down with Kyle Duffy, Vice President of Solutions Consulting at PagerDuty, to hear his observations about operational maturity and how customers embrace the model to level-up their organizations.

Q: You must spend a lot of time talking about digital transformation and how
technology leaders tackle their aspirations. Is there an inflection point that you’ve
noticed during your career?

KD: What I’ve noticed is the pace of change consistently accelerating every year. Market dynamics change incredibly quickly, which puts pressure on companies to innovate, and new technologies allow organizations to keep pace. The companies that are thriving haven’t just transformed their technology though, they’ve modernized the way they operate and reinvented their culture.

Q: PagerDuty has been championing digital operations maturity for years. Have you seen a change in organizations’ digital maturity in recent years?

KD: On the less mature end of the spectrum, we still see plenty of organizations that frequently learn about problems from their customers and operate from a reactive position. Reaching the most mature end of the spectrum is still elusive for most—and the bar keeps creeping up higher. What has changed is that the differentiation in maturity levels is more refined. Solving for handling real-time work has emerged as a cornerstone of being able to reliably and efficiently keep critical digital services and customer experiences working seamlessly.

Additionally, systems built for queued ticketed systems simply bog down the workflow and increase risk of sustained service disruption. We used to talk about the challenges of event volumes in fairly broad terms. Now, a lot of companies have solved for event noise and the associated disruption from alerts, and have moved on to the next stage of identifying more context around events. For example, how are these events related? What are the dependencies? Where is the root cause?

Q: As leaders and organizations mature along the model, what are the key areas of
marked growth/change? Are they still bogged down by the same problems?

KD: The challenges remain the same. Organizations that get bogged down tend to hesitate on making the biggest change, which is a cultural transformation to a service ownership model. For many, the people changes are actually harder than technology changes. But the payoff is huge—truly reinventing culture is the only way I’ve seen companies accelerate innovation, increase uptime, and reduce costs at the same time.

Another growth area, from a technology perspective, is shifting away from managing incident response processes in systems that are built for queued work. Today’s world requires systems built for real-time work. In that regard, PagerDuty is a big enabler of the cultural transformation to service ownership and all the benefits that come with it.

Q: What is the hardest part for organizations to uplevel their operational maturity? Any
tips on how to ease that transition?

KD: Resistance to change. You’ve got to find and embrace the people who are believers in transformational change and have the respect of the organization. They’ll be a big part of leading your team through this journey whether they have a title or not.

Q: Is there any low hanging fruit or advice you’d give to organizations who are struggling to break through from a reactive approach to adopt a more proactive posture?

KD: You don’t need to do it all at once. Change should be incremental. For example, when putting people on call for the first time as you shift to a service ownership model, start by putting them on call only during business hours so it doesn’t feel like a burden. What teams often realize is that it actually improves their quality of life, which makes them much more excited to go on call after hours. You should also focus first on the small subset of applications and teams that are needle movers for the business. You can’t afford for those to move slowly, so concentrate your energy there.

In other words, statistics and data about past incidents and practitioner response patterns fed continually into an AI model can help us confidently make assumptions if they are closely related to past incidents.

______________________________________

If you’re interested in learning more about how to benchmark and improve your operational maturity, download this eBook. If you want to learn how PagerDuty can help you achieve these goals, contact your account manager and sign up for a 14-day free trial.

The post What Operational Maturity Looks Like Today With PagerDuty’s Kyle Duffy appeared first on PagerDuty.

]]>
The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More by Hannah Culver https://www.pagerduty.com/blog/covid-19-affected-incidents-levels-mttr-mtta/ Wed, 06 Oct 2021 13:00:34 +0000 https://www.pagerduty.com/?p=71721 Digital transformation accelerated for many companies during the last 18 months. While it may have been on the agenda prior to COVID-19, teams were pushed...

The post The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More appeared first on PagerDuty.

]]>
Digital transformation accelerated for many companies during the last 18 months. While it may have been on the agenda prior to COVID-19, teams were pushed to extreme speeds to digitize and meet the rising online demand. During this time, organizations learned important lessons that they’ll carry on with them into this new future. Leaders can take these learnings and use them to build better products, healthier and more efficient teams, and a happier customer base.

Our team aggregated some of these key findings in our State of Digital Operations Report. One important lesson we learned was that critical incidents increased by 19% YoY between 2019 and 2020, and it doesn’t look like incident volumes will be slowing down anytime soon.

Some organizations had more opportunities to learn and grow than others during this period. For instance, the highest lift in critical incident volume was seen in the Travel & Hospitality and Telecom industries, with 20% more critical incidents. In late March 2020, we saw that highly stressed cohorts, including online learning platforms, collaboration services, travel, non-essential retail, and entertainment services, were experiencing up to 11x the number of critical incidents.

In this installment of our State of Digital Operations blog series, we’ll chat through how 2020 affected metrics like MTTR (mean time to resolve) and MTTA (mean time to acknowledge), burnout and attrition rates, and what leaders can do to improve the lives of their teams and their customers looking towards a digital future.

How did MTTA and MTTR change?

MTTA is the time it takes for a responder to acknowledge the alert. MTTR is the time it takes to actually resolve the incident. These aren’t the sole metrics that determine operational excellence, yet many organizations use them as a proxy and derive important insights. These insights are useful when pinpointing strengths and weaknesses in incident response processes.

Our platform data showed that, while MTTR is improving, total time spent resolving incidents is still increasing. This is likely due to the increased amount of critical incidents. As incident numbers rise, even as teams get better at resolving them, total time spent on incidents is still increasing. This takes a toll on technical teams as they see their workloads shift from planned work to unplanned work.

MTTA is decreasing alongside MTTR. As teams onboard PagerDuty, they’re able to achieve a higher level of digital operations maturity via the platform. Digital operations maturity is the level of proficiency ranging from manual to preventative teams have in handling urgent work. Each step is characterized by key capabilities. As teams are able to standardize incident response, their MTTR improves. As they create more efficient on-call and alerting rules, their MTTA improves.

Another aspect of MTTA is the ack%, or the amount of critical alerts acknowledged after an alert fires. This is another way to demonstrate operational maturity. The higher the ack% is, the more responsive and accountable your teams are. PagerDuty users were able to increase ack% over an account’s lifetime. The longer the account was using PagerDuty, the better the ack% and MTTA was. Even with performance cohorts split out, with the 10th percentile being nearly twice as fast at acknowledging incidents compared to the 25th percentile, all accounts are seeing improved MTTA over time.

Mobile adoption of the PagerDuty application helps improve MTTA and ack%, as on-call team members are rarely an arm’s reach from being able to respond to an alert. This means customer-impacting issues are being handled faster than ever. But, it also means that engineers are never really away from work. As the lines between work and home blur, it’s important to understand the weight of these alerts on technical teams.

How were burnout and attrition affected?

An abrupt 2 AM wakeup call might be an inconvenience if it happens once every few months. But, if it’s happening multiple times per week, the effect is more pronounced; teams begin burning out, their mental health suffers, and eventually they leave the organization in the hopes of being able to achieve a better work/life balance elsewhere. During this period coined The Great Resignation, it’s imperative that organizations are able to attract and retain talent.

Leaders looking to understand their teams’ pain points can examine on-call both qualitatively and quantitatively to determine who is at risk of burning out, and why. Our platform data has given us some insight into what these triggers are.

Compared to 2019, organizations saw 4% more interruptions in 2020. However, when digging into the spread across time categories, there was a 9% increase in off-hour interruptions and a 7% lift in holiday/weekend hour interruptions, compared to 5% increase in business hour interruptions and 3% decrease in sleep hour interruptions.

While it’s good that fewer engineers are being woken up during their sleep, the 9% increase in off-hours means that family time, dinners, evening workouts, and more are being put aside to respond to interruptions. Over time, this irregular schedule adds up to about 12 additional weeks worked per year from each on-call team member.

Our platform data also showed that the more frequently engineers were paged off-hours, the more burnt out they become. The median user receives two non-working hour interruptions a month. On the other end of the spectrum, burned-out users were experiencing 19 non-working hour interruptions per month. It’s no surprise that these burned-out users were the most likely to leave the company.

We saw that responder profiles leaving the platform (our proxy for attrition) experienced a higher than average off-hour incident load. Using regression analysis, we looked at material off-hour incident work volume for both deleted users and remaining users and found a statistically significant positive correlation between off-hour volume and a user’s odds of deletion.

In other words, to retain employees, leaders need to understand how to decrease interruptions (especially non-working-hour interruptions) for their teams. One way to do this is with intelligent noise reduction.

Reducing the noise to keep responders healthy

These off hour interruptions are sometimes unavoidable. Afterall, if your checkout cart stops working at 7 PM, you can’t just lose revenue until your team is back online the next morning. But, sometimes on-call engineers are paged at 2 AM for things they can do nothing about. Noise reduction can help as it allows teams to focus on what’s really important.

Production systems generate a lot of events; only some of these rise to the level of an alert, or something that could be wrong. Otherwise, many of these events can simply be logged in your monitoring system for further inspection. Additionally, some of these alerts can be irrelevant. They might be repeat alerts, or inactionable ones, or ones that could be resolved through auto-remediation with no human intervention.

Our platform data showed that through event compression and alert grouping techniques, we’re able to help customers reduce event to incident noise by 98%. Thus alert storms are reduced to the minimum necessary number of actionable alerts. If you want to learn more about this, you can hear from Etsy on how we helped the team proactively identify noisy, non-actionable alerts and control what got to disrupt the team’s flow state or deep sleep.

When alerts are meaningful, your teams are able to handle less but with more care. This limits the amount of time they need to spend away from the things they love during non-working hours and can protect against burnout and attrition.

It also means they’re able to focus on the critical issues at hand and provide excellent service to your customers. As organizations continue to focus on providing excellent customer experience in a digital world, this becomes even more important.

What does the future look like?

2020 changed the pace of acceleration for many companies making a digital transformation. But the pace won’t slow down now. Companies need to be prepared for this level of digital reliance from here on out.

If you think your teams are ready for a digital operations management platform, try PagerDuty free for 14 days. If you’d like to learn more about our findings, check out the State of Digital Operations Report.

The post The Cost of Increasing Incidents: How COVID-19 Affected MTTR, MTTA, and More appeared first on PagerDuty.

]]>
PagerDuty’s Engineering Management Handbook for Healthier Teams and Services by Hannah Culver https://www.pagerduty.com/blog/engineering-managment-handbook-for-healthier-teams-and-services/ Tue, 05 Oct 2021 13:00:58 +0000 https://www.pagerduty.com/?p=71719 This July, we launched The State of Digital Operations, which sheds light on the volume of real-time work, its growth over time, and how that...

The post PagerDuty’s Engineering Management Handbook for Healthier Teams and Services appeared first on PagerDuty.

]]>
This July, we launched The State of Digital Operations, which sheds light on the volume of real-time work, its growth over time, and how that increasingly burdens technical teams. We wanted to see how engineering leaders in our own organization approach some of the challenges surfaced in the report, so we had our Director of Product Marketing, Julian Dunn, sit down with two of our own engineering managers at PagerDuty, Leeor Engel and Dileshni Jayasinghe, for a roundtable to discuss real-world tactics for approaching topics like:

  • Managing unplanned, real-time work and building the on-call muscle
  • Understanding team and service health
  • Conducting operational reviews and sharing knowledge

If you’d rather watch or listen to the webinar, you can check out the recording here. For those who prefer to scan or read, we’ll share some of the highlights from their discussion in this blog post.

Managing real-time, unplanned work and building the on-call muscle

Our report findings show that incidents increased for our customers by about 19% from 2019-2020. Both Engel and Jayasinghe shared that their teams faced an increase in noise and signals. Establishing a better understanding of the alerts teams receive can help ease the load of on-call teams.

Jayasinghe shared that she’s been encouraging her teams to finetune their tooling, including how and when they get alerted and paged, which mirrored Engel’s philosophy that it’s important to rethink monitoring thresholds and whether or not the alerts team members are getting are actionable. Tuning for the “right level of actionable noise” is something we’ve been hearing across our customer base, especially with the change in work modality.

Just like many teams around the world, PagerDuty’s engineers made the shift to remote work, and as a part of this change, the entire organization has had to rethink how alerts are handled. Previously, teammates could turn in their chair and ask for help with triaging or ask a question before kicking off an incident. Now, Jayasinghe says it’s important to err on the side of caution and trigger incidents early so coordination can begin.

Like our customers, PagerDuty’s own teams are also constantly pedaling on their own digital operational maturity journey, and a key learning that we embrace ourselves has been the importance of building up an on-call muscle that can support the increase in alerts.

Whether you’re fresh out of school or a bootcamp, or simply hadn’t had to take an on-call shift in previous roles, going on-call for the first time can be intimidating. In the webinar, Dunn recalls from his own days as a software engineer once upon a time, “They never talk about the operational side of it—being responsible for a service and being on call.” So, how are engineers supposed to get up to speed with going on-call?

At PagerDuty, the philosophy is to start from a culture of ownership, psychological safety, blamelessness, and continuous learning. In short, Jayasinghe said that the best way to help engineers build up their on-call muscle is by making sure that they feel supported. She lets her teams know that they can always escalate with no judgement, and a secondary on-call person is always at the ready to help the primary triage and walk through the issue if needed.

She also believes that engineering managers should carry the pager and be on the on-call rotation. “As a manager, it’s important to be on-call and show that you understand to build empathy for your teams. This shows new engineers that everyone has ownership over their services.”

As a best practice, both Engel and Jayasinghe suggest setting up shadowing between months two and three of an engineer’s tenure. Engel also emphasized reverse shadowing, where the training engineer is in the driver’s seat and has support along the way. He noted that practice makes perfect and it helps new teammates become familiar with the tools and dashboards.

“You want as little novelty as possible when you get paged. That way, you have the things you need at your fingertips. If you can mentally rehearse that by getting those tools down, that’s a huge help.”

Understanding team and service health

During 2020, our platform data showed that users worked longer and less consistent hours than in 2019, with a third of our users working an extra 12 weeks of work per year! Additionally, we found that the more often an engineer was pager outside of business hours, the more likely they were to leave the platform (our proxy for attrition). With statistics like these, it’s clear that managing team health is paramount. But what does it look like in practice ?

Engel thinks of health in two key dimensions: the people perspective and the service perspective. The people perspective means understanding how your team is doing mentally, how frequently they are interrupted, and when these interruptions occur. The service perspective (operating with a full-service ownership model) accounts for the load on a per-service basis.

He notes that it’s important to think about how to get the most “bang for your buck” by prioritizing noisy services and making changes that will have the highest impact on your team.

“One thing I definitely keep my eye out for is, did someone get woken up in the night, or worse multiple times in the night? That’s something you’ll want to address quickly,” Engel said.

Jayasinghe and Engel both spoke about the importance of having procedures that address nights like these. Jayasinghe recommends that managers create documentation that determines when someone needs an override for the rest of their on-call shift or when an on-call engineer should be given a day off to recover.

“As a manager, you should have these policies written so people are empowered to say, ‘I was woken up, I am going to take the time to recover and come back fresh,’” Jayasinghe said.

She also suggested teams take a look at their monitoring tooling. At PagerDuty, we have a dashboard all teams share with key services and metrics that helps us see anomalies and increased load so we can proactively approach issues before someone is paged. With this proactive approach, Jayasinghe and her team are able to keep their unplanned work at less than 20%.

Jayasinghe said that managers looking to get a more qualitative view on team health should ensure they’re scheduling regular 1-1s with their team members. She recommends the Plucky 1:1 Starter Pack, especially the questions pertaining to work-life balance for getting a pulse on how teams are doing.

Conducting operational reviews and sharing knowledge

As teams grow and mature, it’s important to create processes that support analyzing health and sharing knowledge. This helps teams across engineering keep up to date and learn from one another. Here is some advice our panelists gave on making sure learnings are shared widely.

Operational reviews are an excellent way for teams to understand how they’re performing. We even use PagerDuty’s analytics for this, specifically the operational report cards. We’ve created an on-call handoff scorecard that covers key things like interruptions per person and per service. Not only does this give the team a better idea of what happened during the rotation, it also helps build empathy between teammates. One thing these operational reviews also look at are the service’s SLOs.

SLOs (service level objectives) are metrics that show how a service’s reliability is performing against a customer-centric goal. Availability and latency are some of the most common SLOs. If there are anomalies in the monitoring that affect the SLOs, the team might take note of action items that can help them protect customer experience. This also determines which incidents are the most important to focus on, though it will take time and iteration.

“You pick your SLOs as a representative proxy of the impact to customers. It takes time to find out what that proxy is because it has to be something that actually matters to the customers,” Dunn reiterated.

Another aspect of SLOs is the corresponding error budget, or the acceptable amount of failure a service can have in a particular time window. Engel noted that error budgets help his teams understand how to calibrate risk taking and experimentation.

You can use your remaining error budget from each window to chaos engineer. At PagerDuty, we call this Failure Friday. Teams can intentionally break parts of services in a planned, safe way to understand how it will respond to failure. This prepares teams in the event of a real failure, and can provide learning opportunities to mitigate this failure entirely.

Beyond Failure Fridays, Engel also suggests learning from postmortems. All teams should be encouraged to share postmortems with one another and make the postmortem meetings as open as possible. Beyond reading current postmortems, teams can also review historical postmortems to see what happened, what action items were derived, and how it affected the system as a whole. Engel also suggests doing postmortems with engineering leadership across teams.

“This is where we’re looking for systemic issues that might have affected this team, but might be a theme we’re also seeing in other incidents. Maybe we can address this and save other teams from ever running into this problem,” Engel explained.

Reading postmortems is a great way to learn from past failures for both old team members and new. But, if you’re specifically looking for ways to share knowledge across teams as you scale and grow, Jayasinghe shared her advice.

She suggests staffing new teams with at least some existing engineers to maintain the culture. New managers coming to an organization should join mature teams so they can learn from their new reports. This helps keep existing practices in place. Additionally, new teams should shadow old teams during on-call handoffs to become familiar with tools and the monitoring dashboards.

Jayasinghe and Engel noted that their managerial peer group is crucial to learning. PagerDuty managers work to standardize tooling, processes, and dashboards and document these in our Ops Guides. Each service has an Ops Guide located in a Github repository and the links are available to everyone. For example, you can check out our on-call Ops Guide.

Further learning from our engineering leaders

Last but not least, Engel and Jayasinghe shared the industry resources they find most helpful. These include:

If you want to hear more from Leeor Engel, Dileshni Jayasinghe, and Julian Dunn, watch their on-demand webinar, “Perspectives on Digital Operations: The Volume and Human Impact of On-Call and Real-Time Work.” If you’d like to see what PagerDuty can do for your teams, begin your 14-day free trial.

The post PagerDuty’s Engineering Management Handbook for Healthier Teams and Services appeared first on PagerDuty.

]]>
Has the firefighting stopped? The effect of COVID-19 on on-call engineers by PagerDuty https://www.pagerduty.com/blog/has-the-firefighting-stopped-effect-covid-19-on-engineers/ Mon, 30 Aug 2021 13:00:58 +0000 https://www.pagerduty.com/?p=71138 With digital becoming the primary channel for work, education, shopping, and entertainment in the last 18 months, it’s no surprise that workloads for technical teams...

The post Has the firefighting stopped? The effect of COVID-19 on on-call engineers appeared first on PagerDuty.

]]>
With digital becoming the primary channel for work, education, shopping, and entertainment in the last 18 months, it’s no surprise that workloads for technical teams and on-call engineers have increased.

Data from PagerDuty’s inaugural platform insights report, The State of Digital Operations, highlights this reality. As of July 2021, the average number of events managed daily by PagerDuty is 37 million, with 61,000 of those being critical incidents. Critical incidents are defined as those from high urgency services, not auto-resolved within five minutes, but acknowledged within four hours and resolved within 24 hours. According to our data, the number of critical incidents grew by 19% from 2019-2020.

For many teams responsible for supporting this always-on world, “firefighting” has become the typical mode of operation. But this digital shift is here to stay, and the workload is not going to reduce. Over the next few blogs, we’re going to dig further into the findings from our platform data and explore how the growing volume of real-time work is increasingly burdening technical teams. In this first blog, we’ll share how this firefighting affects burnout levels, how to classify and quantify interruptions, and what teams can do to avoid attrition.

Risk of burnout a real threat

Life as an on-call engineer is always hectic, but we looked specifically at what the experience was like in the last 18 months. Comparing the hours worked in the first 12 months of the pandemic (March 2020-March 2021) to the preceding 12 months (March 2019-March 2020), we can see that more than a third of PagerDuty users worked far less consistent schedules in 2020 than in 2019. On average, individuals are working the equivalent of two extra hours per day. This totals an extra 12 weeks of work over the course of a year.

Humans sit at the heart of incident response. Being aware of overwork is critical for businesses, managers, and technical teams alike. The continual pressure, disruption to responders’ routines, and the impact on individuals’ lives is a recipe for burnout. And it’s important to remember that not all interruptions are created equally. Some take a bigger toll on the wellbeing of on-call engineers.

Interruptions around the clock

An interruption is a non-email notification—including a push notification to a mobile phone, an SMS, or a phone call—generated by an incident. Looking into our platform data, it’s clear that how many interruptions a responder faces, and the time of the day they are interrupted affects their level of burnout.

The total volume of interruptions increased 4% in 2020 from 2019, with some teams hit harder than others. This is especially true of smaller companies where 46% of users are interrupted each month compared to 30% of enterprise users. Smaller organizations are often in hypergrowth mode and may lack the resources of larger businesses, but managers must balance the drive to grow against the risk of burned-out technical staff.

The time of day an interruption happens is also important. Between 2019 and 2020, there was a 9% increase in off-hour interruptions and a 7% lift in holiday and weekend hour interruptions. We define the types of interruptions as follows:

  • Business Hours Interruptions: Sent between 8 a.m. and 6 p.m. Monday to Friday in the user’s local time.
  • Off Hours Interruptions: Sent between 6 p.m. and 10 p.m. Monday to Friday or during 8 a.m. to 10 p.m. over the weekend in the user’s local time.
  • Sleep Hours Interruptions: Sent between 10 p.m. and 8 a.m. in the user’s local time.

When engineers are on call, they understand that they might get interrupted. But there is a clear difference between an interruption sent at 3.p.m. and one at 3.a.m, and the subsequent impact on the person. We broke down the analysis of off-hours interruptions further and identified three distinct cohorts.

Responders in the “good” percentile experienced 2 non-working hour interruptions per month. Those in the “bad” 75th percentile, who we identify as “overworked,” have seven non-working hour interruptions a month. And for those in the 90th percentile, it certainly is “ugly.” These responders are on the receiving end of 19 non-working hour interruptions a month. That is three times as many as those “overworked,” and ten times as many as the median responder.

Tackling the Great Resignation

Operating under this kind of stress is clearly not sustainable. The result can be employee attrition. Our data shows that the more often people were disturbed in their off hours, the more likely they were to leave the PagerDuty platform (our proxy for attrition). The profiles of responders leaving the platform showed they experienced off-hour incidents every 12 days compared to every 15 days for remaining users.

Currently, many sectors are in the midst of what economists are calling The Great Resignation. Employers can’t afford to lose talented and skilled technical staff because they are burned out. Organizations need to actively manage incident response workloads and mature their on-call processes to promote better team health and avoid overworking their people. Here are three ways teams can take back control.

  1. Measure on-call qualitatively and quantitatively with operational analytics. Teams can measure on-call workloads by looking at the volume of interruptions and the time spent on-call. They can then combine this data with other metrics, such as time of day, severity, number of escalations, to identify those individuals most at risk of burnout and contextualize their on-call experience. PagerDuty Analytics collates data across incidents, services, and teams, and turns it into insights and recommendations to help managers understand the burden on on-call teams.
  2. Stop getting interrupted by inactionable alerts. When responders are being bombarded with alerts, it creates a stressful environment where everything is “urgent.” Intelligent alert reduction cuts down on this noise, allowing responders to focus on the incidents that really need attention. You can tune alerts to share the right amount of information your teams want, even if that does mean allowing certain amounts of specific noise to cut through. Event Intelligence is PagerDuty’s AI-powered tool for digital operations. Its adaptive learning algorithms separate signals from noise and only alerts teams on genuine incidents that require human intervention.
  3. Create automation sequences that can auto-remediate without human intervention. Another way of taking back control is to give responders access to self-service capabilities to resolve an issue, without needing to escalate to a subject matter expert or even to involve a human at all. Teams can document incident response processes (e.g scripts, tools, API calls, manual commands) into a runbook that can be automatically triggered to resolve an incident. Incidents are resolved in real-time, with minimal stress. Check out this eBook on Runbook Automation from PagerDuty and Rundeck to learn more.

As we adjust to the new normal, firefighting mode must be matured into a more proactive and preventative model of incident response to mitigate burnout and attrition. An always-on world needs a new approach that helps businesses to respond effectively when an incident does strike, and reduces negative impacts on the teams responsible for supporting digital services. Proactively managing workloads means that incidents are dealt with in real-time, every time, while reducing the burden on on-call engineers.

To learn more about our platform data learnings, check out the rest of our State of Digital Operations report or watch our Perspectives on Digital Operations: The Volume and Human Impact of On-Call and Real-Time Work webinar.

The post Has the firefighting stopped? The effect of COVID-19 on on-call engineers appeared first on PagerDuty.

]]>
The top 4 key levers to build towards long-lasting digital operations maturity by PagerDuty https://www.pagerduty.com/blog/4-key-levers-for-digital-operations-maturity/ Tue, 17 Aug 2021 13:00:16 +0000 https://www.pagerduty.com/?p=70864 Digital operations maturity is a journey. The first step is to understand where you are, where you want to get to, and what’s keeping you...

The post The top 4 key levers to build towards long-lasting digital operations maturity appeared first on PagerDuty.

]]>
Digital operations maturity is a journey. The first step is to understand where you are, where you want to get to, and what’s keeping you from getting there. Only then can you make strategic decisions and lay out a plan for how to approach any hurdles and land where you want your organization to be. For many organizations, upleveling operational maturity requires investment in driving cultural change with fundamental shifts to operating models.

Change is hard, but accepting two facts can help your team embrace the need to adapt to your increasingly complex technology ecosystem:

  1. Incidents are going to happen.
  2. There are ways to prepare your team and your technology stack to ease the pain and impact when things go wrong.

There are four key levers that can help businesses accelerate their journey towards adopting a more proactive posture for digital operations. Companies may be at varying degrees of sophistication in these areas, but investing in any or all of these levers building them into your strategic roadmaps will set your teams up for success.

Lever One: Leverage AI/ML & Automation Across The Incident Response Lifecycle

One of the key differences between reactive and proactive organizations is the use of artificial intelligence/machine learning (AI/ML) and automation. Not only can it help reduce and collate noise so that only the most urgent and significant signals come through, it can also help with root cause analysis and auto-remediation. Automation and applying advanced technology like AI/ML for various phases of the incident response lifecycle can dramatically cut down on repetitive, highly manual tasks, reduce the number of false positives, and streamline processes to help empower more individuals to take action.

Mature organizations are looking to technologies such as AIOps and runbook automation for more efficiency and improved productivity. AIOps uses big data, machine learning, and analytic insights to suppress noise, correlate events, and automate the identification and resolution of IT issues, while runbook automation takes repetitive manual tasks out of the equation using SOPs containing expert knowledge for common actions.

To learn more check out these resources:

Lever Two: Shift Towards Full-Service Ownership

Full-service ownership, commonly known as “You Build It, You Own It” or “code ownership,” can improve digital operations maturity because it’s a shift towards DevOps practices by having developers take responsibility for supporting the software they write in production. This methodology sets up the people closest to the technology from a design and implementation perspective as responsible for the code throughout the entire product development lifecycle.

Mature, proactive teams reap the benefits of this cultural shift in the form of bringing developers closer to their customers, the business, and the value being delivered by the service or application. It also means they will have to be on call for their own work, which involves some change management, but ultimately it puts accountability directly into the hands of that engineer or team. When ownership is established, this direct connection helps to orchestrate the incident response lifecycle, and makes escalation and routing of an incident more straightforward.

To learn more see these resources:

Lever Three: Establish a Blameless Culture of Knowledge Sharing and Continuous Learning

A feature of mature, proactive organizations compared to their more reactive peers is a commitment to knowledge sharing and continuous learning. Sharing information may sound easy, but building the right foundation for pervasive continuous learning requires cultural change and cannot be achieved overnight. Making this shift involves a change in philosophy and an intentional effort to create a blameless culture and psychological safety based on the acceptance that with complex systems, incidents are inevitable and will happen. Collectively, these efforts help to ensure that ITOps and DevOps teams have access to the right information to do their jobs and operate effectively.

Establishing this blameless culture starts with breaking down silos of knowledge and encouraging sharing and productive conversation around how to solve for issues and furthermore, prevent them in the future. Otherwise, engineers will hesitate to speak up when incidents occur for fear of being blamed. This silence increases overall mean time to acknowledge (MTTA), mean time to resolve (MTTR), and exacerbates the impact of incidents. The mindset must be one of accepting that failure is inevitable in complex systems, but being aware that how we respond to failure is what matters. Once you have that, then you can leverage practices like blameless post-mortems to proactively plan for preventing repeat events in the future.

For additional resources, see these Ops Guides:

Lever Four: Collaborating Across the Enterprise as a Unified Front for Customer Experience

In a time when customer and enterprise service expectations have never been higher, technical teams don’t want to be learning about issues from their customers. An invaluable trait of more digitally mature organizations is improved communication and collaboration with cross-functional partners in the business. This creates a united front for handling updates to external stakeholders (such as partners or customers) to manage that end-user experience.

Organizations can then be more proactive about handling any customer-impacting issues. It keeps all involved stakeholders on the same page and improves internal coordination among developers, IT, operations, and customer service. Better alignment enables each segment of the business to keep their respective leadership teams up to date on resolution status and proactively make any plans necessary to address real-time issues.

To learn more check out these resources:

Your investment in any one of these levers may vary by your maturity level or unique organizational needs. However, at some point during your digital transformation, you’ll need to evaluate how you’re pulling on each of them to build towards long-lasting digital operations maturity. This process is a marathon rather than a sprint, and any effort put towards these initiatives will allow you to reap the benefits for the long term.

If you want more information about how to plan for and begin improving your digital operations maturity, take a look at this eBook. If you want to learn how PagerDuty can help you achieve these goals, contact your account manager and sign up for a 14-day free trial.

The post The top 4 key levers to build towards long-lasting digital operations maturity appeared first on PagerDuty.

]]>
Answer to the Ultimate Question of (On-Call) Life, the Universe, and Everything: 71 by Lisa Yang https://www.pagerduty.com/blog/answer-ultimate-question-on-call-life/ Thu, 06 Dec 2018 13:00:43 +0000 https://www.pagerduty.com/?p=51028 In The Hitchhiker’s Guide to the Galaxy, a group of scientist mice built a mega-computer named “Deep Thought” to Answer “The Ultimate Question of Life,...

The post Answer to the Ultimate Question of (On-Call) Life, the Universe, and Everything: 71 appeared first on PagerDuty.

]]>
In The Hitchhiker’s Guide to the Galaxy, a group of scientist mice built a mega-computer named “Deep Thought” to Answer “The Ultimate Question of Life, the Universe, and Everything.” After 7.5 million years, the machine produced “42.”

At PagerDuty, we did something similar, except we didn’t have scientist mice or wait 7.5 million years. Instead, we had a data scientist and nine years of PagerDuty on-call notification data, which we compared it across 10,000 PagerDuty customers, 50,000 responders, and 760 million notifications—and our number was “71.”

What this means: Our Operations Health Management Service (OHMS) found that responders who maintained an average health score of 71 or higher were more likely to stay at their companies for more than 18 months.

Don’t Panic

Let me back up a little bit and explain “71.” Over the past year, my team (aka the Digital Insights team) created an algorithm to contextualize on-call pain.

The output of the algorithm was a number from 0 to 100. A health score of 100 means you’ve never received a notification within a specific time period (week, month, or year)—therefore, you might not be a responder, and we remove folks like you from our study calculations (and you’re perfectly healthy). In contrast, the closer to 0 you are, the more on-call pain you’re experiencing.

This health score is a product of 16 different facets. We took into consideration the following:

  • The time of day you’re being notified—if it’s dinner hours, evening hours, sleep hours, or the weekend
  • How frequently the notifications are coming in
  • How many days in a row you’re receiving notifications

Two people with the same health score might have completely different contribution points, as seen in these screenshots:

 

 

 

 

 

 

 

 

The person on the left only received three sleep notifications, but also has a health score 64 compared to the person on right, who had seven. The algorithm not only take your day into consideration, but also looks at the volume of notifications days before as well. Looking at long-term on-call pain trends is the only way to accurately tell the story of a responder’s on-call health.

What Does On-Call Pain Look Like?

On-call pain manifests differently to different people, but in short:

  • It’s having sleep disrupted, night after night
  • It’s dinners ruined and Netflix marathons cut short
  • It’s family time, interrupted
  • It’s not wanting to do fun things because of sleep deprivation

 

What Problems Can On-Call Pain Cause?

On-call pain can lead to a number of problems, including persisting grouchiness, loss of productivity, responder burnout (leading to them ignoring pages/alerts), and abhorrent misuse of pop culture references.

So we want to avoid all those problems, right? After all, better work-life balance = happier workers = better productivity.

With our OHMS study, we were able to triangulate the responders who were beyond burnt out and who were most likely to leave. The replacement of an average on-call responder could add up to $300k. Here’s an overview of why it costs so much:

Source: https://www.glassdoor.com/employers/blog/calculate-cost-per-hire/

Share and Enjoy

Some (those who are not on call) may argue that being on call is part of the job and ask: “What’s the big deal?” If you work with on-call responders, I invite you add yourself as a shadow on Escalation Policies for one week to understand their pain.

Because, yes, being woken up one night a week might not be that big of deal. But what about two nights in a row? Or three? On-call responders and new parents know that, despite how sleep deprived they might be, they’re still expected to show up to work on time the next morning, carry on with project delivery, be a sociable coworker, and still respond to incidents as they come in.

This is where the health score comes in: Putting a number to your on-call pain lets someone know when you need help and also informs your managers that someone else needs to take over an on-call shift.

This is also beneficial to your team as a whole because, as I explained earlier, employees experiencing excessive on-call pain (average health score of 71 or less) will leave and find another job—which further exacerbates the pain for everyone else staying behind.

Only YOU Can Prevent Forest Fires

Oh, that’s the wrong cultural reference. Oops. Anyway, now that you know on-call pain has real consequences (and hopefully you’re going to try being on call yourself), did you also know you can do something about it?

Check out the health scores of your team using PagerDuty’s Operations Health Management Service (OHMS). With OHMS, you’ll receive a weekly email that calls out the top 3 responders, top 3 teams, and top 3 services with a health score of below 71. You’ll also have access to consultants who work with you to maximize your PagerDuty investment by recommending best practices and helping implement the features that best fit the needs of your teams.

So what is the Answer to the Ultimate Question of Life, the Universe and Everything?

More stable systems and happier employees. That’s exactly 42 characters! WOW!

The post Answer to the Ultimate Question of (On-Call) Life, the Universe, and Everything: 71 appeared first on PagerDuty.

]]>