How embedding-based case similarity finds threats that rules miss

SHARE

similarity finds threats that rules miss

By Securaa

June 1, 2026

Table of contents

Every SOC has a folder somewhere. A shared drive, a Confluence page, a Notion workspace, a senior analyst’s brain. It’s full of incidents from the last two years. Phishing campaigns that almost worked. The week somebody’s service account started behaving strangely on a Thursday afternoon. The lateral movement attempt that got caught by an alert that almost didn’t fire. The insider exfil case that took four people three days to piece together.

That folder is, in a real sense, your SOC’s most valuable asset. It’s the record of every attack pattern your environment has actually seen. Every false positive that fooled an analyst for an hour before resolving. Every weird thing that turned out to matter, and every weird thing that turned out not to.

And the platform almost certainly cannot read it.

This is the gap that case similarity is supposed to close. The idea is simple enough to explain in a sentence. When a new case comes in, the platform finds the cases from your history that resemble it and surfaces them to the analyst working the queue. The analyst sees not just the alert in front of them, but also the seven times something that looked a lot like this happened before, what the SOC did about it, and how it resolved.

The trouble with this idea isn’t the concept. It’s what “resembles” means.

What rule-based similarity actually does

The way most platforms do this today is the way you’d expect, which is rules. The platform indexes cases on a set of fields. Source IP. Destination IP. User. Asset. Alert type. MITRE ATT&CK technique. When a new case comes in, the platform looks for prior cases that share enough of those fields to count as similar.

This works fine when the attacker is unimaginative. A brute force attempt against the VPN gateway today looks structurally identical to a brute force attempt against the VPN gateway last month, and the matcher will find the prior case without breaking a sweat. Same source country, same target, same alert type, same technique. Easy.

It falls apart the moment anything varies.

Consider what an attacker actually does between two campaigns. They rotate IPs. They use a different toolkit. The credential stuffing this week comes from a different ASN than the credential stuffing last week. The phishing email lands from a different domain, references a different urgent topic, and impersonates a different brand. The lateral movement uses WMI this time instead of PSExec. None of the indexable fields match. The matcher returns nothing. As far as the platform is concerned, this is a brand new kind of incident the SOC has never seen.

A senior analyst looking at the two cases side by side knows they’re the same campaign. The platform sees two unrelated events.

This is the part rules can’t do. They match on what the case contains, not on what the case is about. And in a SOC, what a case is about is almost always the thing that matters.

What embeddings change

The technical move that fixes this is straightforward, and it’s been quietly transforming search, recommendation, and document retrieval for years. Here’s the short version. Instead of indexing cases on a fixed set of fields, you convert each case into a vector. A vector is just a long list of numbers that captures the meaning of the case as a whole. Cases that mean similar things end up with similar vectors, even if none of the specific fields match.

The conversion is done by an embedding model that’s read enough text to have a sense of what concepts cluster together. It knows that “credential stuffing from a residential proxy” and “password spray from a botnet” are different in their indicators but related in their intent. It knows that an email about an urgent invoice from a misspelled vendor domain and an email about an urgent shipping notice from a freshly registered domain are the same kind of phishing attempt dressed in different clothes. It doesn’t know this because someone wrote a rule. It knows it because patterns of language and behavior that co-occur in attack data end up near each other in the vector space.

When a new case lands, the platform embeds it and asks which of the cases in its history have the closest vector. The match isn’t filtered through anyone’s view of which fields should count. It’s a direct measurement of conceptual closeness across everything the case contains. The alert types, the asset behavior, the sequence of events, the textual notes the analyst left, the resolution.

That last part is the one most worth thinking about. The analyst’s notes on a closed case are usually the highest-signal part of the entire record, and the rule-based matcher cannot read them. The embedding can. “Looked like exfil but turned out to be a Veeam backup misconfigured to write to S3” is a sentence that will never match a rule. But it’s exactly the kind of sentence that will pull up the right prior case six months later when the same thing happens on a different server.

What this looks like at the analyst’s desk

A case lands on a Wednesday afternoon. Unusual outbound traffic from a developer workstation to a cloud storage endpoint that hasn’t been seen in the environment before. The volume is moderate. The user is a real employee, not obviously compromised. The alert is medium severity.

In the old model, the analyst opens the case, pulls the user’s recent activity, checks the destination, runs a threat intel lookup on the IP, and starts the slow process of figuring out whether this is something or nothing.

In the model with embedding-based similarity, the case opens with a panel attached. Three similar prior cases. All resolved benign. In each, the developer was syncing a personal side project to their own cloud account. The pattern resolved when the analyst contacted the user.

The analyst still has to do the work. The point isn’t that the platform decided. The point is that the analyst now starts the investigation with the right hypothesis on the screen, instead of having to construct it from scratch. The first phone call is to the developer, not to the IR team. The case closes in twenty minutes instead of two hours.

Now flip the scenario. Same alert pattern, but the panel reads: One similar prior case. Resolved as confirmed data exfiltration. Threat actor used a misconfigured developer workstation as a staging point. The investigation that starts from that panel looks nothing like the investigation that starts from the first one. Same alert. Completely different posture. The difference is the memory.

Where this goes wrong

The obvious problem is the cold-start one. Embedding-based similarity is only as good as the cases it has to compare against. A SOC that has been live for two months has thin history, and the matches will be weak. By month nine the gap between a platform that does this and one that doesn’t is significant. By month eighteen it’s the surface analysts use most. This is also why a thirty-day pilot is the worst possible way to evaluate this capability. The pilot is exactly the window in which it can’t show what it does.

Then there’s the temptation to over-trust the panel. Similarity is not causation. Two cases can have nearly identical vectors and be completely unrelated underneath. A good system surfaces the similar cases, but doesn’t let the resemblance close the investigation by itself. The analyst still decides. What the similarity panel does is shape the first ten minutes of an investigation, which is where most of the wasted analyst time in a SOC actually lives.

There’s also the question of what’s actually getting embedded. If the platform only embeds alert metadata, source and destination and type and technique, the result is a slightly fancier version of the rule-based matcher and the benefit will be marginal. The unlock comes from embedding the whole case. The alerts, the enrichments, the sequence, and especially the analyst’s notes and resolution. A platform that doesn’t ingest the unstructured parts of the case is doing about half the job.

And then there’s the unsettling part. This approach surfaces patterns the SOC didn’t know it had. A new case comes in and the similarity panel pulls up five prior cases that share something subtle. Same time of day, same kind of unusual parent process, same downstream behavior. Nobody had noticed they clustered together. Sometimes this is a real attack pattern the SOC has been quietly missing for months. Sometimes it’s a benign environmental quirk nobody documented. Either way, the platform is now telling you things about your own history that you didn’t tell it.

What to ask

When you’re evaluating a platform that claims to do this, and the number of platforms making this claim is going to be high over the next two years, the questions worth asking aren’t about the model. They’re about what gets embedded.

Does the system embed the analyst’s resolution notes, or only the structured fields? Does it embed the sequence of events in a case, or only the case as a flat bag of attributes? When it surfaces a similar prior case, can the analyst see why the system considers them similar, or is it a black box? Does the similarity score get used to rank cases in the queue, or only to enrich them once opened? Can the analyst tell the system that two cases shouldn’t be considered similar, and does that feedback actually change future matches?

If the answers add up to “we run an embedding model on the alert fields and show you the top-k matches,” that’s a slight improvement over keyword search and not much more. If the answers describe embedding the full case including the human-written content, surfacing the reasoning behind the match, and learning from analyst feedback over time, you’re looking at something that will materially change how the SOC works once it has six months of history behind it.

That’s the layer we’ve been building inside Securaa. It’s the part of the product that gets quieter and more useful the longer a customer runs it.

The folder of past incidents is already sitting on your shared drive. The only question is whether your platform knows it’s there.

Talk With Our Team

See how we can help, live and in real time.