The Question

Server Fleet Utilization Analysis

A cloud provider tracks server uptime through a series of status logs. You are given a table server_utilization with columns server_id (int), status_time (timestamp), and session_status (string, either 'start' or 'stop'). Each server can start and stop multiple times throughout the period. Calculate the total uptime for the entire fleet of servers. The result should be returned as the total number of full days (24-hour periods) of cumulative uptime across all servers, rounded down to the nearest integer. Ensure your solution accounts for the chronological order of events per server and correctly pairs start events with their subsequent stop events.

Snowflake

Window Function

CTE

LEAD

Questions & Insights

Clarifying Questions

What are the possible values for `session_status`? I assume the values are strictly 'start' and 'stop'.

How should we handle unmatched events? For example, if a server has a 'start' but no subsequent 'stop' (or vice versa), should it be ignored? I will assume every 'start' has a corresponding 'stop' as per standard server fleet logging, or we only account for completed cycles.

Does "full days" mean rounding down the final aggregate or calculating the sum of whole days? Based on standard AWS utilization reporting, this usually implies taking the total cumulative uptime and dividing by 24 hours (86400 seconds) to get the total "server-days."

Can a server have overlapping sessions? Given the schema, I assume status_time is a linear log for each server_id, meaning sessions for a specific server are contiguous and non-overlapping.

Data Model Assumptions:

server_id: (Integer) Part of the composite primary key with status_time.

status_time: (Timestamp) The event logging time.

session_status: (String) Event type ('start' or 'stop').

Table Type: This is an Event Fact Table representing a state machine.

Relationships: 1:N relationship between a Physical Server (Dimension) and this Utilization Table (Fact).

Thinking Process

Identify State Pairs: We need to pair each 'start' event with its immediate subsequent 'stop' event for the same server.

Window Function Strategy: The LEAD() window function is ideal here. By partitioning by server_id and ordering by status_time, we can look ahead from a 'start' record to find its 'stop' record's timestamp.

Filtering: We only care about the duration calculation when the current row is a 'start'. The LEAD will then naturally pick up the 'stop'.

Time Calculation: Use Snowflake's DATEDIFF in seconds to ensure high precision before aggregating.

Aggregation: Sum all durations across all servers.

Unit Conversion: Convert the total seconds to days by dividing by 86400. In Snowflake, integer division or FLOOR is used to get the count of "full days."

Implementation Breakdown

Problem Set

Requirement: Calculate total uptime across the fleet in full days.

Constraint: Logic must handle multiple start/stop cycles per server.

Edge Cases:

Servers starting and stopping within the same minute (handled by second precision).

Large datasets (handled by avoiding self-joins and using Window Functions).

Approach

Window Functions: LEAD() to peek at the next event time.

Temporal Functions: DATEDIFF(second, ...) for interval calculation.

CTEs: To organize the "pairing" logic vs. the "aggregation" logic.

Cost: This approach is

O(N \log N)

due to the sort in the window function, which is the most efficient way to process linear event logs in Snowflake.

Implementation

Wrap Up

Advanced Topics

Indexing & Clustering: In Snowflake, if this table is massive (multi-terabyte), we should define a Clustering Key on (server_id, status_time). This co-locates data for the same server, significantly speeding up the PARTITION BY and ORDER BY operations in the window function by reducing data shuffling across the compute cluster.

Handling Large Windows: If a server has millions of events, the window function might hit memory limits. In such rare cases, a session-ization approach using a running sum (to create session IDs) might be more robust, though LEAD is standard for simple start/stop pairs.

Data Quality: In a real AWS environment, you might have "zombie" starts (a start without a stop). The query above handles this by filtering for stop_time IS NOT NULL. If we wanted to be stricter, we could also verify that the LEAD(session_status) is actually 'stop'.