

Detecting QoS violations after they occur in systems with microservices results in long recovery times, as hotspots propagate and amplify across dependent services. Predictable performance is even more critical as cloud services transition from monolithic designs to microservices.


Performance unpredictability is a major roadblock towards cloud adoption, and has performance, cost, and revenue ramifications. Our evaluation shows that Sifter effectively biases towards anomalous and outlier executions, is robust to noisy and heterogeneous traces, is efficient and scalable, and adapts to changes in workloads over time. We have implemented Sifter, integrated it with several open-source tracing systems, and evaluate with traces from a range of open-source and production distributed systems. Sifter then biases sampling decisions towards traces that are poorly captured by this model. Sifter does so by using the incoming stream of traces to build an unbiased low-dimensional model that approximates the system's common-case behavior.
#WITH THE BENEFIT OF HINDSIGHT CODE#
Sifter captures qualitatively more diverse traces, by weighting sampling decisions towards edge-case code paths, infrequent request types, and anomalous events. In this work we present Sifter, a general-purpose framework for biased trace sampling. While effective at reducing overheads, uniform random sampling inevitably captures redundant, common-case execution traces, which are less useful for analysis and troubleshooting tasks. To reduce computational and storage overheads, most tracing frameworks do not keep all traces, but sample them uniformly at random. Finally, we explore the tail at scale effects of microservices in real deployments with hundreds of users, and highlight the increased pressure they put on performance predictability.ĭistributed tracing is a core component of cloud and datacenter systems, and provides visibility into their end-to-end runtime behavior. We then use DeathStarBench to study the architectural characteristics of microservices, their implications in networking and operating systems, their challenges with respect to cluster management, and their trade-offs in terms of application design and programming frameworks. DeathStarBench includes a social network, a media service, an e-commerce site, a banking system, and IoT applications for coordination control of UAV swarms. We first present DeathStarBench, a novel, open-source benchmark suite built with microservices that is representative of large end-to-end services, modular and extensible. In this paper we explore the implications microservices have across the cloud system stack. Microservices fundamentally change a lot of assumptions current cloud systems are designed with, and present both opportunities and challenges when optimizing for quality of service (QoS) and cloud utilization. Hindsight adds only nanosecond-level overhead to generate trace data, can handle GB/s of data per node, transparently integrates with existing distributed tracing systems, and persists full, detailed traces when an edge-case problem is detected.Ĭloud services have recently started undergoing a major shift from monolithic applications, to graphs of hundreds or thousands of loosely-coupled microservices. Our experimental evaluation shows that Hindsight successfully collects edge-case symptomatic requests in real-world use cases. Developers using Hindsight receive the exact edge-case traces they desire by comparison existing sampling-based tracing systems depend wholly on serendipity. Hindsight implements a retroactive sampling abstraction: when the symptoms of a problem are detected, Hindsight retrieves and persists coherent trace data from all relevant nodes that serviced the request. We propose a lightweight and always-on distributed tracing system, Hindsight, where each constituent node acts analogously to a car dash-cam that, upon detecting a sudden jolt in momentum, persists the last hour of footage. In this paper, we remove the dependence on luck for any edge-case where symptoms can be programmatically detected, such as high tail latency, errors, and bottlenecked queues. Application developers thus heavily depend on luck. For application developers troubleshooting rare edge-cases, the tracing framework is unlikely to capture a relevant trace at all, because it cannot know which requests will be problematic until after-the-fact. Today's distributed tracing frameworks only trace a small fraction of all requests.
