PagerDuty’s Engineering Management Handbook for Healthier Teams and Services
This July, we launched The State of Digital Operations, which sheds light on the volume of real-time work, its growth over time, and how that increasingly burdens technical teams. We wanted to see how engineering leaders in our own organization approach some of the challenges surfaced in the report, so we had our Director of Product Marketing, Julian Dunn, sit down with two of our own engineering managers at PagerDuty, Leeor Engel and Dileshni Jayasinghe, for a roundtable to discuss real-world tactics for approaching topics like:
- Managing unplanned, real-time work and building the on-call muscle
- Understanding team and service health
- Conducting operational reviews and sharing knowledge
If you’d rather watch or listen to the webinar, you can check out the recording here. For those who prefer to scan or read, we’ll share some of the highlights from their discussion in this blog post.
Managing real-time, unplanned work and building the on-call muscle
Our report findings show that incidents increased for our customers by about 19% from 2019-2020. Both Engel and Jayasinghe shared that their teams faced an increase in noise and signals. Establishing a better understanding of the alerts teams receive can help ease the load of on-call teams.
Jayasinghe shared that she’s been encouraging her teams to finetune their tooling, including how and when they get alerted and paged, which mirrored Engel’s philosophy that it’s important to rethink monitoring thresholds and whether or not the alerts team members are getting are actionable. Tuning for the “right level of actionable noise” is something we’ve been hearing across our customer base, especially with the change in work modality.
Just like many teams around the world, PagerDuty’s engineers made the shift to remote work, and as a part of this change, the entire organization has had to rethink how alerts are handled. Previously, teammates could turn in their chair and ask for help with triaging or ask a question before kicking off an incident. Now, Jayasinghe says it’s important to err on the side of caution and trigger incidents early so coordination can begin.
Like our customers, PagerDuty’s own teams are also constantly pedaling on their own digital operational maturity journey, and a key learning that we embrace ourselves has been the importance of building up an on-call muscle that can support the increase in alerts.
Whether you’re fresh out of school or a bootcamp, or simply hadn’t had to take an on-call shift in previous roles, going on-call for the first time can be intimidating. In the webinar, Dunn recalls from his own days as a software engineer once upon a time, “They never talk about the operational side of it—being responsible for a service and being on call.” So, how are engineers supposed to get up to speed with going on-call?
At PagerDuty, the philosophy is to start from a culture of ownership, psychological safety, blamelessness, and continuous learning. In short, Jayasinghe said that the best way to help engineers build up their on-call muscle is by making sure that they feel supported. She lets her teams know that they can always escalate with no judgement, and a secondary on-call person is always at the ready to help the primary triage and walk through the issue if needed.
She also believes that engineering managers should carry the pager and be on the on-call rotation. “As a manager, it’s important to be on-call and show that you understand to build empathy for your teams. This shows new engineers that everyone has ownership over their services.”
As a best practice, both Engel and Jayasinghe suggest setting up shadowing between months two and three of an engineer’s tenure. Engel also emphasized reverse shadowing, where the training engineer is in the driver’s seat and has support along the way. He noted that practice makes perfect and it helps new teammates become familiar with the tools and dashboards.
“You want as little novelty as possible when you get paged. That way, you have the things you need at your fingertips. If you can mentally rehearse that by getting those tools down, that’s a huge help.”
Understanding team and service health
During 2020, our platform data showed that users worked longer and less consistent hours than in 2019, with a third of our users working an extra 12 weeks of work per year! Additionally, we found that the more often an engineer was pager outside of business hours, the more likely they were to leave the platform (our proxy for attrition). With statistics like these, it’s clear that managing team health is paramount. But what does it look like in practice ?
Engel thinks of health in two key dimensions: the people perspective and the service perspective. The people perspective means understanding how your team is doing mentally, how frequently they are interrupted, and when these interruptions occur. The service perspective (operating with a full-service ownership model) accounts for the load on a per-service basis.
He notes that it’s important to think about how to get the most “bang for your buck” by prioritizing noisy services and making changes that will have the highest impact on your team.
“One thing I definitely keep my eye out for is, did someone get woken up in the night, or worse multiple times in the night? That’s something you’ll want to address quickly,” Engel said.
Jayasinghe and Engel both spoke about the importance of having procedures that address nights like these. Jayasinghe recommends that managers create documentation that determines when someone needs an override for the rest of their on-call shift or when an on-call engineer should be given a day off to recover.
“As a manager, you should have these policies written so people are empowered to say, ‘I was woken up, I am going to take the time to recover and come back fresh,’” Jayasinghe said.
She also suggested teams take a look at their monitoring tooling. At PagerDuty, we have a dashboard all teams share with key services and metrics that helps us see anomalies and increased load so we can proactively approach issues before someone is paged. With this proactive approach, Jayasinghe and her team are able to keep their unplanned work at less than 20%.
Jayasinghe said that managers looking to get a more qualitative view on team health should ensure they’re scheduling regular 1-1s with their team members. She recommends the Plucky 1:1 Starter Pack, especially the questions pertaining to work-life balance for getting a pulse on how teams are doing.
Conducting operational reviews and sharing knowledge
As teams grow and mature, it’s important to create processes that support analyzing health and sharing knowledge. This helps teams across engineering keep up to date and learn from one another. Here is some advice our panelists gave on making sure learnings are shared widely.
Operational reviews are an excellent way for teams to understand how they’re performing. We even use PagerDuty’s analytics for this, specifically the operational report cards. We’ve created an on-call handoff scorecard that covers key things like interruptions per person and per service. Not only does this give the team a better idea of what happened during the rotation, it also helps build empathy between teammates. One thing these operational reviews also look at are the service’s SLOs.
SLOs (service level objectives) are metrics that show how a service’s reliability is performing against a customer-centric goal. Availability and latency are some of the most common SLOs. If there are anomalies in the monitoring that affect the SLOs, the team might take note of action items that can help them protect customer experience. This also determines which incidents are the most important to focus on, though it will take time and iteration.
“You pick your SLOs as a representative proxy of the impact to customers. It takes time to find out what that proxy is because it has to be something that actually matters to the customers,” Dunn reiterated.
Another aspect of SLOs is the corresponding error budget, or the acceptable amount of failure a service can have in a particular time window. Engel noted that error budgets help his teams understand how to calibrate risk taking and experimentation.
You can use your remaining error budget from each window to chaos engineer. At PagerDuty, we call this Failure Friday. Teams can intentionally break parts of services in a planned, safe way to understand how it will respond to failure. This prepares teams in the event of a real failure, and can provide learning opportunities to mitigate this failure entirely.
Beyond Failure Fridays, Engel also suggests learning from postmortems. All teams should be encouraged to share postmortems with one another and make the postmortem meetings as open as possible. Beyond reading current postmortems, teams can also review historical postmortems to see what happened, what action items were derived, and how it affected the system as a whole. Engel also suggests doing postmortems with engineering leadership across teams.
“This is where we’re looking for systemic issues that might have affected this team, but might be a theme we’re also seeing in other incidents. Maybe we can address this and save other teams from ever running into this problem,” Engel explained.
Reading postmortems is a great way to learn from past failures for both old team members and new. But, if you’re specifically looking for ways to share knowledge across teams as you scale and grow, Jayasinghe shared her advice.
She suggests staffing new teams with at least some existing engineers to maintain the culture. New managers coming to an organization should join mature teams so they can learn from their new reports. This helps keep existing practices in place. Additionally, new teams should shadow old teams during on-call handoffs to become familiar with tools and the monitoring dashboards.
Jayasinghe and Engel noted that their managerial peer group is crucial to learning. PagerDuty managers work to standardize tooling, processes, and dashboards and document these in our Ops Guides. Each service has an Ops Guide located in a Github repository and the links are available to everyone. For example, you can check out our on-call Ops Guide.
Further learning from our engineering leaders
Last but not least, Engel and Jayasinghe shared the industry resources they find most helpful. These include:
If you want to hear more from Leeor Engel, Dileshni Jayasinghe, and Julian Dunn, watch their on-demand webinar, “Perspectives on Digital Operations: The Volume and Human Impact of On-Call and Real-Time Work.” If you’d like to see what PagerDuty can do for your teams, begin your 14-day free trial.