NOC | Tags | PagerDuty

Three Teams That Can Use AIOps to Work Smarter, Not Harder by Hannah Culver

Hannah Culver — Mon, 28 Aug 2023 12:00:29 +0000

There isn’t a boardroom today that isn’t asking what AI and generative AI in application can help drive efficiency and accelerate their business. For organizations looking to capitalize on ML and automation to improve their efficiency during incidents, AIOps is a tangible, proven application thatproves to be an exciting opportunity for ITOps teams.

As we’ve seen across market landscape evaluations, there are a number of ways that solutions can be implemented. Despite this, the problems AIOps solutions aim to address remain fairly consistent: fewer incidents and faster resolution. But which teams can stand to benefit from this powerful technology and how will AIOps help them achieve their desired business outcomes?

Understanding how different teams can implement best practices to see a reduction in MTTR, total incidents, and time to adopt automation will help ensure that each team is taking value from your investment. Here are three teams that stand out as having much to gain from leveraging AIOps: Network Operation Center (NOC) teams, Major Incident Management (MIM) teams, and distributed service owning teams. Let’s cover each.

NOC teams

If you have a NOC, it acts as your central nervous system. You may also be in the middle of undertaking modernization efforts to reduce both cost and risk.

Many of our NOC customers tell us about challenges such as:

Eyes-on-glass operational style causes incidents to go undetected
Catch and dispatch means too many escalations to SMEs or routing incidents to the wrong team
Manual work drives up MTTR
L1/L2 teams experience high turnover and blame culture is common

To move beyond this, organizations can create L0 automation. This is automation that serves as the first responder, only bringing in humans when necessary. For well-understood, well-documented issues, L0 automation can auto-remediate incidents without a responder intervening. But for other more complex issues that require a hands-on approach, NOC teams can create L0 automation that immediately pulls in diagnostic information before the responder looks at an incident, routes incidents intelligently according to event data, and populates the incident notes with pertinent documentation and runbooks.

PagerDuty AIOps helps NOCs modernize and move away from eyes-on-glass methods. These NOCs are a center of excellence within their organizations, spearheading data-driven optimization, enabling best practices, and ensuring incident readiness.

MIM teams

When critical, customer impacting incidents happen, you don’t have time to waste. But, with complexity and noise on the rise, how do Major Incident Management teams improve to meet growing customer expectations?

We see MIM teams with common challenges such as:

Finding out about major incidents from overwhelming customers/users calling in or delayed team escalations
Lack of context as initial triage takes too long to assess severity and business impact
Long MTTR waiting for the right people, the right diagnostics, the right runbooks, etc
Disjointed tooling leading to communication barriers for responders and corresponding teams

MIM teams can overcome these challenges with a variety of automation and ML tactics. First, organizations can create automation that immediately routes high priority or severity incidents to a MIM team and tags in the appropriate teams needed via incident workflows. Additionally, ML can gather key context such as how rare an incident like this is, if it happened before and how it was resolved, and change events that might be correlated to the failure.

PagerDuty AIOps helps MIM teams detect major incidents faster, improve MTTR and customer experience, and save SMEs time. This reduces the cost of each incident and mitigates risk.

Distributed service owning teams

DevOps and distributed service owning teams are under more pressure than ever to deliver exceptional customer experiences. But with competing priorities and fewer resources, this is easier said than done.

Many of our customers share challenges they are facing such as:

Disparate monitoring tools with no central pane of glass
Too much noise leading to incorrect escalations and false incidents
Lack of context and information silos
Toil and time taken away from value-add initiatives

For service owning teams looking to overcome these challenges, an AIOps tool that can aggregate data from all the monitoring sources in the technical ecosystem can help bring clarity to incident response. Additionally, with ML, teams can reduce noise by automatically grouping together alerts based on context, time, and previous event data that the model has trained on. With this and the ML-surfaced triage information, incident response is streamlined so teams can get back to innovating faster.

PagerDuty AIOps helps service owning teams spend less time firefighting, reduce MTTR, and create exceptional customer experiences. This improves culture and team retention while increasing revenue for the entire organization.

Ready to get started?

With PagerDuty AIOps, teams like the ones we looked at see 87% fewer incidents, 14% faster MTTR, and 9x faster automation adoption. This helps organizations move faster, focus on the work that matters most to customers, and reduces risk and team burnout. Best of all, teams from dev to IT can see value from PagerDuty AIOps.

PagerDuty AIOps works in conjunction with the rest of the PagerDuty Operations Cloud to help organizations manage their operations by leveraging AI and automation to supercharge their digital transformation. With over 700 integrations, GenAI capabilities, and end-to-end event-driven automation, PagerDuty gives customers a 400% ROI and the right tools to leapfrog the competition.

To try PagerDuty AIOps out yourself, you can take an interactive product tour or try us for free for 14 days.

The post Three Teams That Can Use AIOps to Work Smarter, Not Harder appeared first on PagerDuty.

3 Ways You Might Have a NOC Process Hangover by Hannah Culver

Hannah Culver — Mon, 24 Oct 2022 13:00:33 +0000

NOC, or network operation center, processes have been set in stone for decades. But it’s time for some of these processes to evolve. Digital transformation and the cloud era have led to the rise of DevOps, and with it, service ownership. Service ownership means that developers take responsibility for supporting the software they deliver at every stage of the life cycle. This brings development teams closer to their customers, the business, and the value they deliver.

It also requires a departure from the traditional NOC incident handling methods. Yet, as organizations transition towards service ownership, some old NOC processes remain. Here are three common NOC process hangovers and how to replace or update them.

Process hangover: L1 responders aren’t able to resolve issues

NOCs used to be the command center for technology issues. They functioned like a brain, sending out signals to relevant appendages. Issue with networking? Route to networking. Issue with security? Route to security. The NOC’s central function was to involve the correct SME to resolve an issue. This meant digging through spreadsheets (and sometimes physical contact books!) to figure out who was responsible for what.

When everything was on premise and in person, this made sense. There were fewer services, and incidents could be neatly separated by departments. If the database was having an issue, you could call up the database on-call responder. The responder (who would likely be in office or close enough to respond in person) could then go to the datacenter and take a look.

Now, in the remote work, cloud era, where organizations have hundreds or thousands of services maintained by dozens or even hundreds of teams spread across the globe, the rolodex method has outlived its purpose. It’s next to impossible to maintain accurate spreadsheets to know which teams are responsible for which services. And, as the organization changes, records grow stale quickly. Services can move between teams. Teams change as people move between them, or leave/join the company. Now, an L1 responder has to work too hard to identify the right person in an efficient and timely manner.

Organizations need a way to remove these manual steps to find the right person and route incidents directly to SMEs who can jump in to respond to any issues. This can happen in a variety of ways. For some organizations, a DevOps service ownership model is the right path forward. Those who write the code are assigned to respond and fix the service during an incident. The alert is routed directly to the on-call person on the development team that supports the service, and the SME takes it from there.

For other organizations, it might make sense to have a hybrid approach where L1 responders serve as the first line of defense before escalating to distributed, on-cal teams for their services. L1 responders shouldn’t be a routing center that connects the issue with another team. Instead, they should be empowered to resolve an incident themselves. You can set up your L1 responders to be more effective by enabling them with the ability to both troubleshoot and selectively resolve incidents. Access to automation and resources like runbooks can empower L1 responders to help accelerate the diagnosis and remediation process, oftentimes without needing to disrupt the subject matter experts that are in charge of X service via an escalation. By putting automation in the hands of L1 responders, organizations can avoid unnecessary escalations and empower L1s to resolve issues faster.

Process hangover: Major incidents aren’t called or are called too late

We’ve heard it before: time is money. And when NOCs were the primary method of ensuring incidents were responded to, they had an additional responsibility. An NOC needed to ensure that resources were well managed. This meant no unnecessary personnel responding to problems. NOCs often took the blame if they called a major incident too soon and interrupted people for a minute problem. These disruptions took SMEs away from their work innovating. So it was crucial for NOC responders to only call major incidents when it was clear there was a much bigger issue at play.

But now, time isn’t money, uptime is money. The cost of a major incident that’s flown under the radar is larger than the cost of tagging in some extra help. Imagine you’re an online retailer and your shopping cart function is down. Every minute your customers can’t add items to their cart, you’re losing hundreds of thousands of dollars. Plus, customer expectations have increased over the last few years. Customers expect that their app, tool, platform, streaming service, etc. works without interruption. And it erodes customer trust when it doesn’t. In fact, according to PWC, 1 in 3 customers would stop doing business with a brand they loved after one bad experience.

Organizations need to call major incidents sooner to mitigate customer impact. Yes, this may mean waking someone unnecessarily once in a while. But, that’s far less likely with service ownership. SMEs responsible for a service have a better understanding of when to call a major incident than an L1 responder would. So there are fewer false alarms.

Process hangover: Come-and-go war rooms

NOCs often serve as the communication hub for a major incident. This helps responders working to resolve an issue keep on task. Back when many companies had everything (and everyone) on-premise, there was a war room. People came there and the NOC coordinator kept everyone up to date. Now, with distributed teams and systems, physical war rooms are a thing of the past. Many companies instead have virtual war rooms with a video conferencing bridge or chat channel that remains open during an incident.

Other stakeholders may want to treat this war room like a physical one, dropping in as they please. But, in this virtual world, this means that these stakeholders are asking the incident responders questions. This delays the resolution. Companies with come-and-go virtual war rooms may experience more miscommunications and frustration. Responders feel frustrated by interruptions and stakeholders feel frustrated with the lack of communication.

One way to mitigate this is to close the war room to non-participants. If someone isn’t a part of the incident response team, they don’t need access to the response team’s virtual war room. Instead, what they need is an internal liaison. This is a designated communicator from the incident response team.

The internal communication liaison consolidates incident information and relays it to relevant stakeholders. To make this easier, communication liaisons can use status update notification templates. These templates dictate how to craft communications for a specific audience. They ensure that stakeholders receive any information necessary to make decisions. And no responders have to stop working on the incident at hand to share updates.

Hangovers aren’t fun, but they always end

NOCs are a tried and true way of managing incidents for many organizations. But NOC methods become out of date when moving into this era of digital transformation. Seamless communication and rapid response are key to preserving customer trust. Looking forward, teams will involve SMEs immediately and call major incidents sooner rather than later. They’ll also communicate with key stakeholders throughout an incident while setting boundaries.

And often teams need a digital operations platform to help support this transition. PagerDuty allows teams to bring major incident best practices to their organization, resolving critical incidents faster and preventing future occurrences. Try us for free for 14 days.

The post 3 Ways You Might Have a NOC Process Hangover appeared first on PagerDuty.

6 Ways to Modernize Your Network Operations Center (NOC) by Joseph Mandros

Joseph Mandros — Fri, 26 Feb 2021 09:00:49 +0000

The post 6 Ways to Modernize Your Network Operations Center (NOC) appeared first on PagerDuty.

4 Ways to Save Money in Digital Operations by Joseph Mandros

Joseph Mandros — Thu, 09 Jul 2020 20:00:01 +0000

The post 4 Ways to Save Money in Digital Operations appeared first on PagerDuty.

Network Operations Center Best Practices and Functions by Joseph Mandros

Joseph Mandros — Tue, 09 Jun 2020 22:28:43 +0000

Network Operations Center Best Practices and Functions

A network operations center (NOC) is typically defined as the centralized location for an organization’s networking team. This team typically manages company servers, firewalls, databases, and IoT devices—anything related to the company network. The support they provide includes assigning and handling customer tickets, maintaining security, monitoring various tools and alerts for quality assurance, reporting and dashboards, and more.

The primary goal of the NOC is to ensure network uptime remains stable and error-free 24/7 and it meets service-level agreements.

Before we dive into the best practices of the NOC (pronounced as “knock”), let’s first cover the day-to-day operations.

Managing a network operations center

While a network operations center may live in one or more locations, it functions as the central point to monitor an organization’s network landscape. This includes:

Monitoring the company network health
Responding to events, issues, and downtime
Troubleshooting connectivity issues
Ensuring the network remains secure
Managing the firewall
Monitoring communication services, including email, digital video, and

VoIP

Optimizing network performance
Deploying remote software installations
Managing and deploying software patches
Backing up and storing data, including risk and compliance purposes

What are best practices for a network operations center?

Like any IT team, clearly defining goals and breaking down siloed communications can help in managing a NOC. Consider policies that establish:

Prioritization for the most urgent events, as defined by their potential to impact business operations and visibility to both stakeholders and end-users
An efficient incident response and triaging hierarchy, which proactively designates which team member should handle P1, P2, and P3 incidents so everyone remains on the same page
Steps to outline timing on when to continue with remediation and when to escalate
Timing for events before escalation occurs (e.g., 10 minutes for P1, 30 minutes for P2, etc.)

Effectively implementing these best practices across your NOC team

Teams work best with defined roles to keep everyone in sync, which helps streamline workflows and increase efficiency when minutes count the most, such as when downtime or other incidents happen.

To encourage alignment across the business, you will want to proactively:

Talk to your NOC team – Schedule 1:1 meetings at timely intervals. These should be casual conversations. Ask how operations could be improved. Listen. Discover team members’ strengths and weaknesses. The purpose of these meetings isn’t to focus on the negative, but rather to find out who’s best at what role, which will help streamline processes at your network operations center.
Keep up the training – In the world of network management, change is constant. Managing high-level incidents (i.e., those with the highest potential business impact) in a timely manner demands equally high-level training. Asking your team to keep up during off-hours isn’t realistic. Consider providing paid training in a phased process, so your operations center is always fully covered. This training could greatly provide benefits for both you and those responsible for ensuring things continue running smoothly behind the scenes.
Automate repetitive tasks – This can prove invaluable in reducing alert fatigue. No matter how highly trained those responsible for ensuring things continue running smoothly behind the scenes may be, implementing the right automation can detect and analyze incidents faster with greater visibility into the data—before, during, and after an incident—helping you maintain resilience for mission-critical services.

Managing an effective network operations center requires robust communications across your environment. The ability to have full visibility into incidents is always high on the list because it helps detect issues, enabling teams to proactively prevent incidents that could lead to network downtime and broken SLAs. Following the best practices mentioned in this article can help.

You can further empower your network operations response team by helping them resolve issues with Event Intelligence, which blends machine and human telemetry to provide teams with real-time data, enabling you to cut through the noise by providing the exact machine and human context your responders need.

See for yourself with a 14-day free trial today. No credit card is required.

The post Network Operations Center Best Practices and Functions appeared first on PagerDuty.

Announcing PagerDuty’s Solution for HybridOps by Julian Dunn

Julian Dunn — Mon, 06 May 2019 18:00:17 +0000

What Is HybridOps?

For years, traditional infrastructure provisioning and management followed a specific operating model that depended on Network Operations Centers (NOCs) to process operational events. As enterprise companies started to undergo digital transformation, the cloud created a different operating model: One that was much more agile and, some would argue, more efficient—and would replace all other operating models to create IT homogeneity.

This hope, however, was unrealistic—particularly when you take into account how IT works in most large enterprises. Ever-changing customer demands and a combination of cloud technologies, legacy on-premises solutions, and microservices mean that today, no “single” IT department exists. Instead, business needs have led to a proliferation of operating models, which we at PagerDuty call hybrid operations, or HybridOps.

HybridOps is an operating model that consists of many people and teams doing a variety of different functions, with the main goal of keeping up with the speed of business. At its core, HybridOps is the ability to orchestrate intelligent, real-time response across distributed teams—including NOCs, DevOps, security, and customer support—that span different operating models. It uses data, analytics, and automation to break down silos and empower teams to focus on delivery at the speed of business.

HybridOps already exists today at most large organizations, whether they know it or not. To create greater alignment with the business and happier customers and teams, these organizations should embrace HybridOps by letting teams choose the way they want to work, with the autonomy to use the tools they’re comfortable with, while holding them accountable to business-impacting outcomes rather than output-oriented metrics.

Embracing HybridOps at Your Organization

To successfully and effectively orchestrate people and teams, modern enterprises must provide visibility across complex and far-reaching technology stacks, and balance that with investments in both centralized operations teams and DevOps culture and agility. So how can you achieve that at your organization?

After more than 10 years of working with sector leaders undergoing digital transformation, PagerDuty combed through both quantitative and qualitative data and found that high-performing HybridOps organizations share the following characteristics:

Seamless communication among multiple teams and the use of automation like machine learning to orchestrate sophisticated incident response
Using data platforms that encourage information sharing and continuous improvement
A culture that promotes trust among teams with different operating models

PagerDuty for HybridOps

A platform like PagerDuty can act as the hub for real-time operations at the center of enterprise HybridOps operating models, which helps break down silos between ITOps, DevOps, and other teams using bi-directional integrations and easy-to-deploy machine learning capabilities. It can also provide insights into the health of critical business services—insights that keep operations running by enabling teams and people across the organization to effectively orchestrate a response to quickly resolve issues so they can spend more time focusing on enhancing the customer experience.

Additionally, PagerDuty can help connect disparate teams by allowing technical and business leaders and other stakeholders to use the tools of their choice to access real-time insights into incidents and how they impact the business. For example, the newest version of the ServiceNow integration enables multiple ITSM and NOC teams to coordinate a real-time response without leaving the ServiceNow interface, while the Slack integration allows DevOps teams to drive complex actions from within a ChatOps interface.

With the right platform in place to facilitate HybridOps, teams—regardless of the operating models they use—can focus on resolving incidents as quickly as possible and keep up with speed of business.

Interested in learning more? Check out our HybridOps white paper for best practices and tips on how you can work effectively with HybridOps in your organization.

The post Announcing PagerDuty’s Solution for HybridOps appeared first on PagerDuty.

The Future of the NOC by David Hayes

David Hayes — Tue, 21 Nov 2017 13:00:14 +0000

One of the best things about working at PagerDuty is that our customers, our users, our champions, and our buyers are all the same people. With this year’s push into major incident response, we’ve spent a lot of time talking to Network Operation Centers (NOCs) about what the future holds for them.

Every job changes with new technology — some, like long-distance trucking will be completely disrupted by self-driving trucks — but after all the discussions we’ve had with the best NOCs around, it looks like their evolution will be significant but manageable.

I’ve always thought about PagerDuty as helping your Mean Time To Promotion, in keeping with that, here are some of the possible futures we see for NOCs.

Site Reliability Engineer

One of the most straightforward paths is towards becoming a Site Reliability Engineer (SRE).

If you want a job doing this, you need all the troubleshooting skills of a systems admin, layered on with a deep understanding of monitoring. The goal of an SRE is to detect glitches before they develop into problems that users can notice. And if that doesn’t work, SREs moves heaven and earth to get everything back online. You’ll frequently see SRE positions at big cloud or online companies, like Amazon, Google, Heroku, and even Etsy. People get really cranky if they can’t buy things immediately, and SREs are there to make sure they can.

SREs keep the world online (ok, that’s kind of a big ask). As an SRE, you would work with a team to predict needs and build scale in a way that is fluid and invisible from the front end. Site Reliability Engineering is the art of never letting the user see you sweat, as a company. You’re working to make sure there is always enough capacity, enough uptime, enough pipe, and enough monitoring to make sure something isn’t falling apart invisibly.

Instead of firefighting, you want to be a building inspector, designing wider hallways, doors that always swing out, and multiple staircases (metaphorically). It may look heroic to jump in with a fire ax and a hose and tear down doors and fight flashovers, but it’s better to never need the heroics if you have smart policies around building materials and building sprinklers.

Ops becomes QA

Historically, quality assurance (QA) at software companies has had an unfair reputation. In fact, there are lots of great companies like Microsoft where there’s a parallel track for Software Development Engineers in Test (SDET). Click testing has long since become automated unit tests which are now automated click & API tests against the staging server.

Operations and QA are the formalizations of, “Eek! Things are broken.” If you have a solid QA team checking things in test before you deploy, there are far fewer surprise outages. If you have an Operations team, they design and build things mindfully, considering risk and performance, rather than simply installing and hoping things work right.

At its core, DevOps and Operations are about getting servers or containers to meet the “three R requirements”:

Reliable: stays up or fails over to something else gracefully
Replaceable: you can start a new instance of the server with no special steps
Routine: server provisioning and decommissioning should be so easy that you can create a web form to do it

To me, that also sounds a lot like QA.

DevOps means if something broke and woke you up, you are empowered to write the test that ensures it never makes it to production again — you’re already the best part of QA.

As you get better at preventing downtime or outages and streamlining requests, you can scale volume more easily because you’re not responding to one-off requests. Think about the difference between manually resetting user logins and offering an automated system to do it. You may spend the same amount of time fixing user login problems, but for ten to twenty times as many users.

NOC as point to all of tech

One of my favorite NOCs I’ve visited is a telecommunications company in Los Angeles — it’s a classical NOC with an unconventional feel. Starting from the massive wall of dashboards, the room is arranged in rows, with each row representing a promotion in their operations org. Promotions average 6-12 months apart, with clear milestones and can stop with being in the back row (as a defacto SRE) or into other parts of the org. With so many companies lamenting how hard it is to find talent these days, I expect this will become more common.

At PagerDuty we treat our support team in much the same way: employees in our support org have gone on not only to be managers or more technical roles inside that org, but also to the engineering, marketing, and sales teams and I don’t see any sign of that stopping (unsurprisingly, this makes it easier for us to hire great people)

Change isn’t always bad, but it always comes

Predictions are hard, especially about the future; but it’s clear that the future of the NOC will not be humans watching screens waiting to press buttons. For many classes of always-on applications, it will still make sense to keep people ready to jump into action — the question is what to do with the other 99% of their time.

The NOC has undergone quite a bit of change in recent years and will continue to do so. Those that adapt to the changing digital landscape will position themselves for success, and we look forward to navigating that transition with you.

The post The Future of the NOC appeared first on PagerDuty.

The Transformers by Rachel Obstler

Rachel Obstler — Tue, 16 May 2017 13:00:15 +0000

I recently had the privilege of spending a full day with a small group of our customers. The attendees were leaders in their development and IT operations organizations and spanned a wide variety of industries, including technology, media, finance, retail, healthcare, and more. Every single one of them are recognized leaders in their spaces. One of the questions we asked our customers was, “do you have any specific transformation programs underway or planned?” And indeed, most customers had a story of their transformation journey.

Transitioning to Cloud

The largest and most established companies talked about transitioning to the public cloud — and for these companies, in particular, it’s no easy task. Some have industry regulations to contend with, while others are faced with years of traditional IT processes, rules, and habits to change. Their transition will be painful, potentially with multiple years of an in-between hybrid state with cloud applications that depend on applications still running on internal infrastructure, and vice versa.

The Role of Central Operations

Others talked about the modernization or automation of the NOC and central operations functions. These plans ranged from ensuring the NOC could add value to independent DevOps teams sprouting up across the organization, to fully transforming the role of operations to providing the platform that enables developers to not just write but also operate their code in production.

Monoliths to Microservices

Refactoring code from monoliths to microservices was another theme. These were typically companies already in the cloud, purpose built for SaaS, but with code origins from more than five years ago, before cloud, containers, or deployment automation were established technologies. Some of them had already gone through significant transformations — or were what are commonly called, “digital natives”, or born in the cloud — fully utilizing DevOps best practices that many had adapted and improved for their own purposes.

Digital Operations Transformation

While these customers have varied situations and different plans, they are all focused on improving agility to support greater innovation in their customer-facing digital services. There is no great surprise that the IT industry is undergoing tremendous change at a rapid pace, and these companies are all adapting accordingly. They are moving to a developer ownership model and adopting DevOps practices because they believe (as do we) that this is the best way to both ensure high quality customer experiences and maximize innovation.

Another common element across all of these transformation stories stood out: it not only takes the right tools to support digital operations transformation, but also the right culture of collaboration, knowledge sharing, learning, and improving. Tools by themselves can’t change a culture, but they can support and promote a culture of learning and improving. One of our customers alluded to his transformation as a “hearts and minds” effort as much as anything else. This sentiment was perhaps said most succinctly in Bridge Kromhout’s post, “Containers Can’t Fix Your Broken Culture.”

Our goal at PagerDuty is to build product that helps customers implement best practices used by thousands of the most operationally mature teams, such as enabling individual teams to operate autonomously while ensuring consistency with well-defined processes around major incident response. We look forward to supporting and accelerating our customers’ transformation efforts by making it easier than ever to adopt modern incident management and DevOps practices.

The post The Transformers appeared first on PagerDuty.

Twitter Killed The Call Center by Zachary Flower

Zachary Flower — Thu, 02 Feb 2017 12:00:20 +0000

Why External Variables Matter in Incident Management

When it comes to incident management, it’s easy to fall into an insular mindset. We spend months planning and configuring systems that alert us of any issues within the system, and to cover our bases, we establish traditional customer support channels to identify issues we don’t catch ourselves. While this train of thought isn’t wrong, this approach has led to the rise of users reporting issues they are experiencing in public forums like Twitter and Facebook.

Social Media Paving a New Path

Social media has become a great way for organizations to connect directly with their users in a casual setting, which has opened up the door for closer two-way communication. Beyond simply being a more personal way to communicate with an organization, social media has also become a more effective way to get help.

Because of the public nature of the medium, users have found that when they aren’t able to get help via traditional routes, such as a phone call or support ticket, sending a frustrated tweet often can yield much quicker results. I can personally attest to the effectiveness of using social media when all other avenues have failed. It’s not just about fielding complaints, though. Many users prefer to use social media to communicate with organizations because it makes them feel like their input is heard.

I’ve personally shot a few courtesy tweets to companies when something looks wrong.

Is their site throwing 500 errors? Send a tweet.
Is there a wonky JavaScript interaction that makes the experience less than stellar? Send a tweet.

Sure, it’s probable that they are aware of the issue, but as a developer, I know that I appreciate receiving a heads-up on problems I might not be aware of. “If you see something, say something,” applies perfectly to this type of situation.

Synchronous vs. Asynchronous

While the “reporting” aspect of social media support is incredibly powerful, there’s more to it than ease of use. Social media platforms — Twitter in particular — have a “real-time” feel to them.

Email, for example, is an asynchronous communication medium which, in the context of customer support, means that your issues are added to a queue. There’s no sense of urgency, and the non-disruptive experience can leave you feeling like you’re waiting in line at the DMV. Twitter, on the other hand, has a more synchronous flow to it. While it’s not exactly a real-time support channel like a phone call or live chat, there’s more of a sense of urgency associated with it than email because the company’s time to response is recorded publicly. Feedback is more immediate and personable, which makes your issue feel just as important as anyone else’s.

Separating the Needles from the Haystack

So how do we stay on top of external variables without getting buried in the non-incident management-related input? At a high level, the solution to this problem is the same as any support channel: triage and delegate.

Triage

Because social media isn’t intended to be used solely for customer service, accounts can get jammed with non-support chatter. It is important to physically isolate the bug reports from the rest of the feed in order to effectively respond to them. Beyond simply isolating the bugs from the non-bugs, identifying commonalities between the reports is also key. This will allow you and your team to detect patterns and escalate issues that affect multiple users before they get out of hand. The Operations Command Console, for instance, can correlate data sources such as tweets to a specific event in your infrastructure as well as visualize the blast radius. This way, you can understand if a failed deploy or outage directly led to a customer reaction on social, and if so, the extent of the customer impact.

Delegate

Once an issue has been identified and isolated from the rest of the group, it needs to be assigned to the appropriate person or team. This can be accomplished in a myriad of ways, either by simply forwarding the report to your existing help desk application, or using a more dedicated social media support application.

No longer solely owned by marketing departments, social media platforms are an excellent way to get real-time information about trouble within your application infrastructure, and an absolutely essential source of data in today’s digital world to provide complete visibility. By giving these platforms the same level of process and dedication as more traditional support channels, you can respond to incidents as the initial reports come in and maximize customer happiness, rather than after a problem has grown into something more damaging.

While the call center isn’t necessarily dead, in order to ensure happy customers, organizations must be proactive instead of reactive in leveraging the wealth of customer data that is now available through a public medium such as social. Those who include social media as a critical component of their monitoring and incident management strategy will have deeper visibility into the quality of their users’ experience and reap the benefits of improving customer loyalty.

The post Twitter Killed The Call Center appeared first on PagerDuty.

Rethink. Become a Modern NOC. by Tony Albanese

Tony Albanese — Tue, 18 Feb 2014 17:03:40 +0000

It’s easy to feel underutilized as an engineer working in a NOC. Especially in a larger organizations you may find yourself silod into owning highly specific responsibilities.

At PagerDuty, we don’t believe that any engineer should sit around, wasting time, watching lines on graphs move up and down. You’re too smart to waste your talents. Instead, whenever you are on the clock you should be an integral part of your team, pushing the needle towards the future.

If you ever find yourself frustrated that you are sitting idly around waiting to make phone calls while your company’s servers catch on fire, we encourage you to actively to rethink your role and start breaking down the silos in your organization.

Be a Resource, Not a Call Center

As an engineer in a NOC you have the unique ability to touch several areas in an organization and play a vital role in the exchange of information. You can help solve problems faster and grow your business. But this is nearly impossible to accomplish as you spend time dialing phones instead of utilizing your skills.

Human Brute Force, Meet Automation Accuracy

You are the first line of defense when an incidents occurs within your infrastructure, which is an incredibly noble role to play. As the first person to encounter an issue in your company’s infrastructure you can do one of two things; take action and coordinate your teams knowledge and efforts to fix the problem or simply make phone calls to find someone to research and resolve the incident.

Spending time flipping through lists of team members can be excruciating, especially when contact information isn’t up to date. With the right tools you can automate

the painful part of escalating incidents to your on-call teams. These basic tasks can easily be eliminated with automation to increase the value you provide.

Empower Yourself, Increase Everyone’s Productivity

By supplying vital information that will help resolve incidents you are ultimately empowering yourself and taking initiative that will surely be noticed. Not to mention, you will probably be exponentially happier after knowing you have positively impacted the success of your company.

Supplying the on-call engineer responsible for an incident with the correct runbook or having the ability to identify a network latency or load balancer issue between sites that wasn’t apparent from your monitoring tool’s report can greatly reduce an incidents mean time to repair (MTTR). This assistance will be invaluable and greatly appreciated by your team.

Filter, With a Human Touch

At PagerDuty our alert bundling and deduplication is one of our most valued feature sets to eliminate alert fatigue. But tools are only as smart as their users. As an engineer in a NOC you have the unique ability to apply a human touch to filter alerts for your company’s on-calls.

As the first line of defense to encounter these alerts you can have a unique single-view of your system to determine whether or not action is required. In response, you can adjust the thresholds that are required to triggered an event. You have the power to make automation tools, like PagerDuty, smarter without having to unnecessarily wake up anyone on your team.

By implementing this severity-based alerting for your on-calls, non-critical alerts that occur at 2:00 AM can wait. Your team will thank you for letting them handle the issue in the morning, instead of being woken up in the middle of the night. Also by prioritizing your time, you can help resolve incidents before anyone even notices to alleviate some of your teams’ headaches.

The post Rethink. Become a Modern NOC. appeared first on PagerDuty.