Observability Implementation

We designed and deployed a scalable IoT solution, empowering a client with real-time data insights for proactive decision-making and operational efficiency while reducing overall infrastructure cost by 30%.

Tags: #aws, #prometheus, #thanos, #iot

 

Background

 

A rapidly growing tech company was struggling to manage an overwhelming volume of metrics data, with approximately 15 million data points processed daily. This data comprised 11 million standard metrics and 4 million custom metrics, generated across numerous development and production Kubernetes clusters. As the company’s infrastructure expanded, the engineering teams increasingly faced challenges in maintaining efficient operations. The existing telemetry system, responsible for data collection, processing, caching, and querying, was becoming a significant bottleneck.

Engineers frequently reported long query times when retrieving historical data, with queries often timing out before completion. Performing timely analysis, troubleshooting issues, and maintaining service reliability was challenging. Additionally, the teams were grappling with missing metrics, making it challenging to gain a complete and accurate view of system performance. The lack of a cohesive strategy for long-term metrics storage further compounded these issues, leading to data loss and complicating trend analysis over extended periods.

Moreover, the absence of centralized metrics querying, alerting, and notification management introduced inconsistencies and inefficiencies in monitoring and response efforts. Each cluster operated in isolation, with disparate configurations and tools, resulting in a fragmented and unreliable telemetry ecosystem. The disorganized state of the metrics stack slowed down the engineering teams. It increased the risk of operational failures, as critical alerts were delayed or missed, and essential performance data was lost or inaccessible.

 

Solution

 

After thoroughly analysing the company’s existing metrics stack, we proposed a proof of concept (POC) with a new, optimized metrics stack setup. Given the company’s multi-cluster Kubernetes environment, we introduced Prometheus for metrics collection and Thanos for long-term data querying, ensuring scalability and efficiency.

To address the query performance and efficiency issues, we integrated Query Frontend with Memcached to cache query results and improve subsequent query times. The Query Frontend was configured to align queries with their step parameters, enhancing the cacheability of the results and optimizing query efficiency. 

Additionally, to manage the increasing volume of data, we implemented multi-shard partitioning in the Thanos Store Gateway. This involved creating multiple store statefulsets, partitioned by shards based on a hashmod of the blocks. This setup allowed the system to handle large numbers of objects in the S3 bucket more efficiently, ensuring faster access to data and preventing bottlenecks.

For long-term metrics storage, we selected AWS S3 due to its cost-effectiveness, scalability, and seamless integration with Thanos. 

To improve alerting and notification management, we implemented Prometheus Custom Resource Definitions (CRDs) to manage Alertmanager configurations and Prometheus rules across all clusters. This centralized approach enabled consistent alerting and metric rules management, while also allowing for isolated rule creation per team. This ensured that errors in one team’s configuration wouldn’t affect others, thereby improving overall reliability.

This updated solution now includes the use of Grafana for centralized metrics visualization, backed by AWS RDS for storing Grafana data. This addition emphasizes the importance of a unified and reliable visualization layer in the overall metrics stack, further enhancing the company’s ability to monitor, analyze, and optimize its infrastructure.

The project timeline included 3 weeks for the POC, 1 month for building the new metrics stack, and 3 months for a company-wide rollout. Throughout the process, we provided ongoing training and shared best practices to ensure the client’s teams were well-equipped to manage and optimize the new metrics stack.

 

Results

Following implementing the new metrics stack, the company experienced a dramatic improvement in its metrics management and query performance. The most notable impact was on long-range query times, which saw a reduction of over 70%. This improvement addressed one of the engineers' most pressing concerns, allowing them to access historical data quickly and efficiently without the frequent timeouts that had previously hindered their work. The integration of Memcached with Thanos Query Frontend played a crucial role in this enhancement, enabling faster and more reliable queries by leveraging cached results and optimizing the processing of complex queries.

The centralized metrics querying capability, made possible by integrating Prometheus and Thanos, significantly streamlined the debugging and troubleshooting processes across the company’s multiple Kubernetes clusters. Engineers could now access and analyze metrics from different environments through a unified interface, saving over 50% of their time previously spent navigating disparate systems and manually correlating data. This efficiency gains not only improved the speed and accuracy of issue resolution but also freed up engineers to focus on more strategic tasks, ultimately boosting productivity across the board.

In addition to the operational benefits, the optimized metrics stack led to substantial cost savings. By utilizing AWS S3 for long-term storage and implementing a more efficient query architecture, the company reduced its infrastructure costs by approximately 30%. The new system’s scalability and efficiency ensured that the company could continue to grow without facing exponential increases in telemetry costs. Overall, the company now benefits from a metrics management system that is not only faster and more reliable but also more cost-effective, positioning them for sustained operational excellence and growth.

Previous
Previous

CI/CD Modernization

Next
Next

Modernize IT Infrastructure