A 2 AM alert fires. 5xx errors are spiking. Latency is through the roof. The instinct is to act immediately, but acting on the wrong diagnosis wastes the most critical minutes of the incident. At the application layer, a sophisticated Layer 7 DDoS attack looks identical to a bad deployment, a dependency failure, a CDN misconfiguration, or a DNS issue.
This guide gives Site Reliability Engineers (SRE) and DevOps engineers a structured path from “something is wrong” to “I know what this is” in under five minutes. Once you have a confident diagnosis, the response path is clear. If it points to Layer 7 DDoS, link to our first 30 minutes action plan for the response playbook.
The 2 AM Checklist: Is It a DDoS or an Internal Failure?
In the first five minutes of an availability drop, your only goal is to identify the nature of the beast. Wasted time in the “diagnosis” phase is the biggest contributor to high Mean Time to Isolate (MTTI). Use the table below to match your current dashboard symptoms to the most likely root cause.
Use this as a reference throughout the diagnostic process. Work through Sections 1 through 3 before drawing a conclusion from the Layer 7 DDoS row.
| Failure Mode | Primary Tell | Where to Check First |
|---|---|---|
| Bad Deployment | Error spike matches push window | CI/CD Change management logs |
| CDN Misconfig | High Origin TTFB vs. Low Edge Latency | CDN Performance/Cache-Miss dashboard |
| Dependency Failure | Selective 504s on specific API flows | Upstream Service Mesh/API metrics |
| DNS Failure | Traffic cliff + near-zero Server CPU | DNS Resolver logs & Ingress traffic |
| WAF or Edge Outage | Edge 5xx errors with zero origin traffic | Provider status page (e.g. Cloudflare) |
| Layer 7 DDoS | RPS spike + high variance in Source IPs | Ingress WAF/Load Balancer logs |
Once you have a baseline from your initial dashboard triage, you need a high-speed verification process. The table helps you narrow down the field but do validate the specific cause before shifting into mitigation mode. Your first priority is to confirm that the service interruption is external by quickly auditing your internal state.
Minimizing MTTI: Ruling Out Internal Deployments and Configuration Changes
The goal of your first two minutes is to reduce your Mean Time to Isolate, or MTTI. This metric tracks how quickly you can point to the specific cause of a failure. The goal is to determine if the “attack” is actually a self-inflicted wound.
Start with the Change Management Check. Open your deployment logs and feature flag dashboard. Did a push happen in the last fifteen minutes? Did someone toggle a global configuration or update a WAF rule? If the timing of the latency spike aligns perfectly with a “Success” message in your CI/CD pipeline, the code is your primary suspect.
Next, look at the Blast Radius. A true Layer 7 DDoS usually hits your ingress or a public-facing gateway, causing a broad failure across multiple nodes or regions. If the errors are localized to a specific microservice, a single pod, or a specific database shard, you are likely looking at an internal bottleneck or a logic error. Malicious traffic doesn’t usually pick and choose which pod to crash; it overwhelms the entire entry point.
Finally, use the Rollback Signal. If you have a recent deployment, trigger a rollback immediately. In a healthy environment, the error rate would begin to plateau or dip within two to three minutes as the old, stable code takes over. If you roll back and the 5xx errors continue to climb at the same trajectory, stop looking at your code. The call is coming from outside the house. Move on to analyzing the ingress traffic.
Differentiating Network and Dependency Outages from Layer 7 Attacks
If your internal environment is stable, check the infrastructure between your users and your servers. Three external failure modes frequently mask themselves as Layer 7 attacks. Identifying them early prevents you from wasting time on a mitigation strategy that will not work.
A CDN misconfiguration is the first suspect. A CDN, or Content Delivery Network, is the distributed system that caches your content near your users to reduce latency. When a CDN is configured incorrectly, your users experience high response times that look like a targeted attack. The concrete metric to check here is your Origin Time to First Byte, or Origin TTFB. This tracks how long it takes your server to respond to a request from the CDN. If your Client-to-Edge latency is normal but your Edge-to-Origin latency is spiking, your CDN is likely struggling with a cache-miss storm or a bad routing rule.
Next, consider a WAF or Global Edge Outage. If a provider like Cloudflare experiences a regional or global incident, your site will disappear for a significant portion of your users. This looks like a total collapse in traffic at your origin, but your edge monitoring will show a massive spike in 5xx errors. The concrete tell is a mismatch between edge status codes and origin logs. If your edge is reporting gateway timeouts but your ingress controller sees no traffic, the security layer itself is the bottleneck. Verify this by checking the public status page of your WAAP provider.
Next, investigate dependency failures. Modern applications rely on many third-party services, such as payment processors or authentication providers. When one of these external services fails, it triggers 504 Gateway Timeout errors. A dependency failure is almost always selective. You will see errors on specific API endpoints or user flows, while the rest of your site remains healthy. Check your upstream success rates for each specific service. If your login page is timing out but your homepage is loading perfectly, you are likely dealing with a provider outage rather than a broad infrastructure attack.
Finally, rule out a DNS failure. DNS, or Domain Name System, is the service that translates your web address into a machine-readable IP address. When your DNS provider has an outage, it creates a traffic cliff. You will see your request volume drop to near zero in seconds. Your server CPU usage will also drop to almost zero because no traffic is reaching your ingress controller. In a Layer 7 attack, your CPU usage would be rising due to the heavy volume of requests. If your metrics show a total collapse in traffic and a quiet server, your DNS resolution is the root cause.
Confirming the Layer 7 DDoS Signature through Telemetry
A sudden Requests Per Second (RPS) spike that has no obvious business cause is often the first indicator. If traffic triples in minutes and no marketing campaigns are active, the ingress logs usually show a surge that does not follow the typical daily curve.
The second signal appears as endpoint concentration on uncacheable routes like login pages, search queries, or checkout flows. These paths are targeted to bypass the CDN and hit the origin database directly. Telemetry often shows a disproportionate amount of traffic hitting a specific path, such as /api/v1/search, while homepage traffic remains flat. A quick check of the cache-miss ratio on edge nodes usually confirms a shift toward 100 percent misses on the targeted path.
High variance in source IPs combined with uniform request behavior provides further confirmation. In a legitimate surge, users arrive at different times and browse different pages. A botnet is more coordinated, with thousands of unique IPs making the exact same request at fixed intervals. Counting the requests per IP over a sixty-second window often reveals a cluster of addresses all making an identical number of requests per minute.
The final confirmation often comes from anomalous User-Agent patterns. While these strings identify the browser or device, a massive volume of identical browser strings across thousands of different IPs is statistically impossible for human traffic. When the most active IPs are all using the exact same version of a browser to hit a single uncacheable route, a Layer 7 DDoS is almost certainly the cause.
The 60-Second Verdict: A Layer 7 Confirmation Checklist
Once the telemetry has been analyzed, a quick final check helps ensure the mitigation path is the correct one. This list is designed to be answered using the dashboards already open in the middle of the incident.
- Internal and Path Stability: Are the recent deployment logs and feature flag changes cleared? Are the status pages for your CDN, WAF, and DNS providers showing green?
- Unexplained RPS Surge: Is the current Requests Per Second (RPS) spike unrelated to any scheduled marketing events, product launches, or known viral traffic?
- Endpoint Hotspots: Is the traffic concentrated on expensive, uncacheable routes like /search, /login, or POST-heavy API endpoints?
- Uniform Behavior: Do the logs show thousands of unique IPs all making requests at the exact same cadence or following the same rigid path through the application?
- User-Agent Anomalies: Is there a massive volume of identical or implausibly generic browser strings—such as an outdated Chrome version—hitting your origin?
- Resource Exhaustion Order (L7 vs. L4): Is the application showing CPU or database connection exhaustion while your ingress bandwidth remains well below your pipe’s capacity? If you see pinned CPU with low throughput, it is a Layer 7 signature.
- The “Whack-a-Mole” Signal: Have initial attempts to block the top most active IPs resulted in zero improvement to the overall 5xx error rate? If the error rate is indifferent to IP blocking, the attack is highly distributed.
Scoring the Verdict
A majority of “Yes” answers confirms a distributed Layer 7 attack. In this scenario, the standard response—scaling more pods or blocking individual IPs—will likely fail. The application logic itself is being exploited to exhaust the backend. The next move is to activate the DDoS mitigation layer or reach out to an emergency response provider.
If there is a majority of “No” answers or mixed results, the issue is likely a “look-alike.” A mixed result often points back to a dependency failure or a subtle CDN misconfiguration. In these cases, shifting to an attack mitigation posture could actually make the situation worse by blocking legitimate users or masking the real root cause in the infrastructure. Return to Section 2, re-examine the look-alike that most closely matches your current metrics, and validate it before shifting posture.
You Have a Diagnosis. Here Is What Happens Next.
Once the telemetry points to a confirmed Layer 7 signature, the diagnostic phase ends and the active defense phase begins. For an SRE, this is the most critical transition because every minute of indecision directly impacts the availability SLA. The goal is no longer to understand why the service is failing, but to restore stability by any means necessary.
If the internal and external checks confirm a malicious surge, the next step is a structured response. Go through the , which covers the specific technical maneuvers for filtering traffic at the ingress. This guide provides a parallel track for SRE, Security, and Communications teams to ensure the response is coordinated and the recovery is as fast as possible.
If the diagnosis remains unclear but the application is still bleeding, the most efficient path to recovery is to engage an Emergency Response Team through an “Under Attack” lifeline. This allows you to offload the filtering to experts so you can focus on the stability of your origin and your upstream dependencies.

