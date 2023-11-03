LONDON, ENGLAND – MAY 11: Fuel pumps with “sorry out of use” signs on petrol station forecourt on May 11, 2017 , [+] 11,2022 in Kingston upon Thames, London, England. (Photo by Peter Daisley/Getty Images) getty images

Websites failed. We have all come across various web-based sites, software and services that have failed to work, jammed in some form or the other, hung or appeared to be corrupted in various ways. It’s easy to blame the cloud and hyperscaler Cloud Services Provider (CSP) organizations that provide these services, or point to some gremlin-like anomaly that has created a bug somewhere down the pipe.

In enterprise software application development and the related field data science, we refer to these events as ‘SaaS outages’, when one or more software-as-a-service (SaaS) functions fails to exit the datacenter and Cross the expanse of the cloud and web to reach us on our mobile device, desktop or other machine.

When is an outage not an outage?

But not every SaaS outage is actually an outage. According to Mike Hicks, principal solutions analyst at Cisco ThousandEyes, the inherent internal complexity of cloud applications means we must look at what is actually happening with any given cloud service and application before deciding who is to blame. Find out, what is the mistake and how to fix it. It is quite clear, as SaaS adoption has increased, they have become more complex and distributed in nature.

“Today, most applications rely on a vast web of interdependencies to function,” Hicks reminds us. “If one of these dependencies [sections, components or libraries of software code required by another part of the software code structure in order to function]For example, something like a search function, interfered with through updates or planned maintenance, can actually create a single point of failure that can render an application unusable. This is why single points of failure (SPOF) in applications are often confused with outages.

This means that simply because a small part of the now very cloud-interconnected codebase experiences an anomaly (say an update installed, but an incomplete set of software code was distributed, or it is not available for some reason), ), it is not a question of ‘clouds being down’, it is a more subtle internal cause of total disconnection. Hicks says the recent disruptions at Slack and X are good examples of this. The inability to send or load messages on Slack and brief server timeouts on X were initially thought to be the result of back-end connectivity issues. However, after taking a closer look, the Cisco ThousandEyes team says they saw that the user disruption was apparently being caused by bug fixes and system changes to certain functions that were disrupting these services.

“Overcoming this issue starts with engineers and IT teams being able to see the bigger picture and see how planned maintenance work from other teams may impact an application down the road. This is hard to do when you don’t own the infrastructure. Without the right tools to overcome it, IT teams are often faced with an onslaught of recurring issues with widespread user-facing impacts,” Hicks explained.

havoc of outage

Through their own research at Cisco ThousandEyes, the team notes that they have seen these SaaS outages (perhaps a slightly softer term than ‘outage’ as SaaS vendors would argue that many other services are up and online) and more. are becoming increasingly common as businesses become increasingly dependent on cloud applications, a positive development in one sense, but one that also creates a greater level of complexity overall. Without proper visibility and understanding of these types of system disruptions, these issues will continue to impact business performance.

“Some of the most business-critical applications in an organization today are SaaS apps. Nowadays, even the most traditional on-premises applications have or are starting to transition to SaaS-based offerings. This trajectory is very good and there is no question about cloud-powered apps outperforming legacy apps,” enthused Hicks. “But, as we rely on apps that are serviced from SaaS network infrastructure that the enterprise itself doesn’t actually ‘own’ because they’re maintained by external service providers outside the company’s perimeter of control, these applications are cloud based. There are also connected Internet networks that the organization cannot see. “So then, how do you troubleshoot outages and disruptions that are impacting your users, whether they are employees or customers?”

Speaking from experiences gained through customer interactions occurring at exactly this level, Hicks points to the ‘status page’ as a good place to start analyzing the status of any SaaS app. It is here that we can find a long list of specific services like login information, application programming interfaces (APIs), messaging protocols, etc.

Drowning in a ‘sea of ​​green’ indicators

“But, as I’m sure any technology operator has experienced, it is not unusual to encounter a status page that declares all services are online and working, despite growing complaints from users and the obvious presence of Displays ‘sea of ​​green’ indicator. issues. Why so? This is because it is in the ‘stitching’ within the distributed architecture powering the SaaS app where many problems arise,” Hicks explained. “In summary, monitoring SaaS app infrastructure is important, but you need to consider completeness of service delivery chain.

The reality described here is certainly real i.e. companies like Cisco ThousandEyes would not exist if we did not need this kind of network intelligence and observability control to ‘see’ the increasingly abstract world of cloud virtualization.

The company itself works to monitor network infrastructure, troubleshoot application delivery, and map Internet performance, all from its own SaaS-based platform (which presumably has its own mapping controls in place to ensure Transforms backwards so that the mapping process itself doesn’t break) very often, if at all). Within its toolbox, Cisco ThousandEyes is able to simulate the experience of real users with technologies that observe interactions such as page loading and analyze multi-step transactions made by users. This enables its engineers to display snapshots, perform service segmentation, and present detailed waterfalls (a waterfall chart is a time-based representation of data that displays relationships between events) and performance metrics.

Agreeing with many of these comments, but preferring to make the argument more comprehensive and clear, is Roman Spitzbart, VP EMEA Solutions Engineering at integrated observability and security platform company Dynatrace. Encouraging us to think about system health in the broadest sense possible, Spitzbart says that synthetic testing and real user monitoring capabilities will certainly help IT operations teams understand and manage the experience provided by their SaaS applications. are required for capacity.

Dynatrace introduced these capabilities to its own platform in 2019 to enable IT departments to monitor SaaS application performance through users’ web browsers. It shed light on what was previously a ‘black box’ understood only by cloud service providers.

“However, these capabilities are only effective when they are part of a joint approach to surveillance. Fragmented monitoring is old world,” Spitzbart said definitively. “It doesn’t work if you only look at SaaS application performance in isolation, because there are countless other factors that can impact users’ experience. This includes their This can include anything from network connections, whether the browser they are using is up to date, or if functionality they are using is dependent on a third-party plug-in from another service provider.

“To chart a clear path through this complexity, SaaS application monitoring needs to be included as part of an integrated approach to observability and security,” Spitzbart said. “Without it, IT teams will rely on piecing together insights from multiple monitoring tools to find the answers they need to effectively understand and manage the user experience for their SaaS applications.”

Is your cloud down? Maybe, but probably not, hyperscalers are really good at cleaning, balancing, and strengthening SaaS delivery pipes. Any machine has a lot of ghosts, it’s worth taking a deeper look inside first.