Designing Identity Graphs: Tools and Telemetry Every SecOps Team Needs
EngineeringThreatHuntingIdentity

Designing Identity Graphs: Tools and Telemetry Every SecOps Team Needs

DDaniel Mercer
2026-04-13
23 min read
Advertisement

A technical primer for building identity graphs with telemetry, ingestion pipelines, correlation keys, and anomaly detection for SecOps.

Security teams are no longer dealing with a single identity boundary. They are managing a moving map of employees, contractors, service accounts, APIs, devices, SaaS tenants, partner users, and machine identities across cloud, on-prem, and remote access layers. That is why the modern identity graph has become one of the most useful primitives in secops: it turns fragmented authentication, access, and behavior telemetry into a relationship model you can actually investigate. If you want a broader context on why visibility matters so much in security programs, see the idea behind real-time operational visibility and the importance of visibility audits in complex digital systems.

Mastercard’s Gerber was right in principle: if you cannot see the system, you cannot protect it. In identity security, that “system” is not just users and passwords; it is the network of who authenticated, from where, on what device, with which role, and how those relationships changed over time. A well-designed identity graph gives your team a way to answer questions like: Is this admin account unusually close to a newly created service principal? Why did a dormant contractor suddenly gain graph proximity to finance data? Are there correlations between impossible travel, privilege escalation, and failed MFA that point to compromise?

This guide is a technical primer for building that capability. We will cover data sources, ingestion pipelines, correlation keys, graph design, anomaly detection, and the telemetry your team needs for threat hunting and compliance. Along the way, we will connect the architecture to adjacent operational patterns like webhook ingestion, automation trust controls, and compliance-aware data handling.

What an Identity Graph Actually Is

Identity graph basics: nodes, edges, and attributes

An identity graph is a data model that represents identities as nodes and relationships as edges. A node might be a human user, a device, a service account, a cloud role, a VPN session, or even a SaaS tenant. An edge represents a relationship such as “authenticated from,” “owns,” “delegates to,” “performed action on,” or “shared same device.” The value comes from combining many weak signals into a coherent picture of trust, access, and behavior.

Unlike a traditional IAM directory, an identity graph is not limited to authoritative records. It is enriched by telemetry from authentication logs, endpoint security, SaaS audit trails, cloud control planes, and network activity. That makes it useful for both detection and investigation because it captures relationships that were never explicitly modeled in an HR system or directory service. For teams building dashboards and operational scorecards, the graph works much like high-signal telemetry dashboards in other domains: you are looking for the connections that explain behavior, not just the raw events.

Why graphs outperform flat tables for SecOps use cases

Flat tables are excellent for point-in-time reporting, but identity incidents are relational. An attacker often looks benign in one table: a successful login, a token issuance, a role assignment, a file access event. The suspicious pattern emerges only when you connect those records across systems and time. Graphs are therefore superior for questions involving lateral movement, privilege chaining, shadow admin paths, duplicate identities, and anomalous peer-group behavior.

Graphs also reduce analyst workload. Instead of correlating dozens of logs by hand, the graph can expose shortest paths, common neighbors, suspicious edge creation, and node centrality shifts. That makes it easier to spot whether a device is reused across multiple high-risk accounts, whether a contractor account has become a bridge into production systems, or whether a new service principal is acting like a human operator. This is similar to how analytics maturity models move from reporting to prediction and action.

Threat hunting and compliance are both graph problems

Threat hunting needs graph reasoning because adversaries exploit trust relationships. Compliance needs graph reasoning because auditors care about who had access to what, when, and under which controls. In practice, the same graph can support both if it is built with provenance, timestamps, and lineage. That means your model should preserve not only the current state of identity relationships, but also the history of how those relationships evolved.

This is where many programs fail: they build a single “golden record” and lose all the intermediate states. For secops, those intermediate states are often the signal. A temporary group membership, a short-lived token, or a transient cloud role may be the exact artifact that explains a breach path. A strong compliance posture also benefits from policies that resemble secure information-sharing architectures and compliance-aware document workflows—the principle is the same: preserve trust boundaries while maintaining traceability.

Data Sources Every Identity Graph Needs

Core sources: IAM, SSO, MFA, and directory telemetry

Your primary sources should include IdP logs, directory changes, authentication events, MFA challenges, password resets, and group membership changes. These are the canonical records that establish identity lifecycle and access posture. Common examples include Okta, Entra ID, Google Workspace, Ping, Active Directory, and LDAP-derived telemetry. You also want events around account creation, deprovisioning, role assignment, and privileged group changes because those edges carry high investigative value.

In practice, many teams underestimate the importance of negative events such as failed logins, denied MFA prompts, and conditional access blocks. Those failures often reveal reconnaissance, brute force attempts, token replay, or session hijacking. They also help build behavioral baselines that reduce false positives during investigations. If your team already ingests structured app events or message-based notifications, the same design lessons from webhook-driven reporting can be reused for identity pipelines.

Cloud and SaaS telemetry: the hidden majority of identity behavior

Most modern identity relationships are exercised in SaaS and cloud control planes, not just in the directory. You should ingest audit logs from Microsoft 365, Google Workspace, Slack, GitHub, AWS CloudTrail, Azure Activity Logs, GCP audit logs, Jira, Salesforce, and other business-critical systems. These sources reveal who granted app consent, who created tokens, which API key accessed which resource, and whether a service account interacted with a human-owned mailbox or ticketing system.

The reason cloud telemetry matters is simple: attackers increasingly prefer to operate through legitimate platform features instead of malware. A token with broad OAuth consent may be more useful than a stolen password, and a misconfigured CI role may be more valuable than an endpoint exploit. For teams dealing with cloud-native complexity, the operating challenge is similar to the one described in sustainable CI pipelines: optimize what you collect, reduce waste, and preserve the fidelity needed for decision-making.

Endpoint, network, and developer telemetry

Endpoint and network logs help tie identity events to devices and sessions. EDR telemetry can answer whether the login came from a managed laptop, an unmanaged personal device, or a suspicious host with malware indicators. VPN, ZTNA, proxy, and DNS logs show network context, while developer tooling such as Git logs, CI/CD logs, and container registry activity can reveal service account abuse or unauthorized code changes. These sources are especially important in environments where build agents and automation identities have broad permissions.

Secops teams should also look at systems that surface operational trust and quality signals. For example, patterns used in developer trust-signal instrumentation can be adapted to monitor commit provenance, repo access, and dependency changes. Likewise, practical guidance from rapid patch-cycle observability is relevant when identity-related changes need to be measured quickly and rolled back safely.

Designing Ingestion Pipelines for Identity Telemetry

Batch, stream, and hybrid ingestion patterns

An identity graph succeeds or fails on ingestion quality. Some telemetry is best ingested in batch, such as daily exports of directory objects or weekly CMDB snapshots. High-value security sources, however, should be streamed or near-real-time: sign-ins, MFA events, privileged role changes, API token creations, and access denials. A hybrid design is usually best because it balances freshness, cost, and complexity.

Streaming helps threat hunters detect fast-moving incidents, while batch jobs are useful for reconciliation and historical backfill. If a source only supports periodic export, treat it as an authoritative sync and attach ingestion timestamp, source timestamp, and latency metrics to every record. These operational concerns mirror the lessons in real-time data fabric design and SLO-aware automation: latency matters, but trust in the pipeline matters more.

Normalization, enrichment, and deduplication

Identity telemetry arrives in inconsistent formats and often contains duplicates. Normalization should standardize usernames, immutable IDs, email aliases, device IDs, tenant IDs, and timestamps. Enrichment should add critical context such as geo-IP, ASN, device posture, business unit, role tier, and asset criticality. Deduplication should be deterministic where possible, but careful not to collapse meaningful changes into a single record.

One useful strategy is to keep raw events immutable while producing normalized “fact” records and graph-ready edges as derived objects. That allows analysts and auditors to trace every relationship back to source evidence. It also reduces friction when regulatory teams need to verify the integrity of identity lineage. The same architecture philosophy appears in document capture systems, where raw artifacts are retained while structured records are generated downstream.

Pipeline governance, replay, and lineage

Because identity relationships are security-relevant, your pipeline needs replayability and lineage metadata. If a source mapping is wrong, you need to reprocess historical data without losing traceability. If a downstream detector triggers, you need to know exactly which source records and transformation rules contributed to the alert. That means versioning parsers, keeping schema evolution under control, and maintaining per-source quality metrics.

Good teams instrument ingestion like a product: success rate, lag, parse errors, null rates, identity-match rates, and edge-creation counts. These metrics should be visible in the same spirit as live analytics breakdowns, because pipeline health is part of security health. If ingestion silently degrades, your graph can become dangerously incomplete while still looking “green.”

Correlation Keys: How to Join Identities Without Creating Chaos

Prefer immutable identifiers over display names

The most common mistake in identity graph design is overreliance on mutable fields like display names or email addresses. Those change frequently and create false merges or missed joins. Instead, prioritize immutable identifiers where available: directory object IDs, subject GUIDs, tenant-local principal IDs, device UUIDs, cloud account numbers, and provider-issued unique IDs. Use emails and usernames as secondary, human-readable references rather than primary join keys.

You should also preserve alias history. An account that changed names twice in six months may be entirely legitimate, but it may also indicate an attacker trying to blend in or a business merger that created overlapping identities. Good correlation logic distinguishes stable identity from presentation metadata. This is analogous to how trust-signals audits separate underlying authority from visible branding.

Composite correlation across people, devices, sessions, and roles

Identity correlation becomes much stronger when you use multiple keys together. A human user can be linked to a device, a session, a token, a group membership, and a geographic pattern. A service account can be linked to a repo, a CI runner, a cloud role, and a deployment target. These composite links are where the graph starts to surface hidden relationships, especially when one key changes but the rest remain stable.

For example, if a user logs in from a managed device in New York on Monday, then the same principal appears from an unmanaged device in another region while retaining the same role and file access pattern, the graph should treat that as a meaningful shift. If a service identity suddenly acquires a new upstream dependency and begins to access secrets outside its usual scope, that may be an early indicator of abuse. It is the same kind of path analysis used in secure API exchange architectures, where trust is evaluated across multiple hops instead of a single credential.

Handle identity collisions, mergers, and shared accounts explicitly

Identity collisions happen constantly in enterprises. Two contractors may share a display name, a team may reuse a generic mailbox, or a merged company may bring in duplicate HR records. Do not suppress these realities; model them. The graph should support confidence scores, relationship types, and human review states so that ambiguous matches do not become invisible. Shared accounts should be labeled as shared accounts, not forced into the fiction of a single person node.

That distinction matters for both hunting and compliance. A shared admin account with no attribution is a blind spot; a service account with multiple owners and approved usage is not. In fact, a mature graph often improves accountability because it makes implicit trust explicit. This principle lines up with the operational transparency recommended in postmortem knowledge bases: write down what happened, preserve the evidence, and make ambiguity visible.

Graph Modeling Choices That Make or Break Detection

Choose the right node and edge types

Modeling should reflect how analysts think during an investigation. Useful node types include user, service account, device, host, application, role, group, session, token, repository, tenant, and resource. Useful edge types include authenticated_from, owns, member_of, assumed_role, consented_to, accessed, generated_token, initiated_session, and shared_device_with. Keep edge semantics precise because vague edges produce vague detections.

Edges should be time-bound wherever possible. A role assignment edge should have start and end timestamps, source system, and change ticket reference if available. This lets the graph support temporal reasoning such as “who was effectively an admin during the incident window?” or “which devices were shared before account compromise?” The same rigor appears in compliance-oriented document systems, where timestamps and provenance are non-negotiable.

Use confidence scores and provenance metadata

Not all relationships are equally trustworthy. A direct event from an IdP is more authoritative than an inferred relationship from a downstream app log. Provenance metadata should indicate source, transform version, extraction method, and confidence. Confidence scores do not need to be perfect; they need to be consistent enough for the analyst to prioritize high-risk paths and ignore weak, speculative ones unless the case warrants it.

This approach is especially useful in federated environments, where multiple tenants, partners, and subsidiaries contribute data. The graph can surface a likely relationship while still preserving the uncertainty. That is often better than pretending the world is exact. Teams that have learned to manage uncertainty in regulated data exchange workflows will recognize the value immediately.

Temporal graphs for before/after analysis

Most identity incidents are temporal, not static. The graph should support point-in-time snapshots as well as event timelines. This allows you to answer questions such as whether an edge existed before an alert, whether a permission chain was created minutes before exfiltration, or whether an account had abnormal access only during a short maintenance window. Temporal data also helps with compliance narratives, especially when an auditor asks for the exact access posture on a given date.

For teams planning resilience and long-lived observability, the lesson is similar to sustainable CI: systems that preserve time context are easier to operate, debug, and defend. Without time, graphs become decorative rather than investigative.

Anomaly Detection and Threat Hunting on Identity Graphs

Common identity anomalies to detect first

Start with anomalies that have clear operational meaning. Examples include impossible travel, first-time device use for privileged roles, new admin group membership, dormant account reactivation, service accounts authenticating interactively, token creation outside approved windows, and high-risk access to sensitive resources by low-history principals. These patterns are straightforward to explain to analysts and auditors, which makes them ideal first detections.

Another high-value pattern is relationship churn. If an identity suddenly gains many new edges in a short window, especially to sensitive systems or privileged groups, that is worth scrutiny. Graph-based alerts should also look for outlier path lengths, such as a contractor identity reaching production secrets through an unusual chain of role assumptions and cross-account delegation. That sort of reasoning is similar to how cost spikes are modeled in business systems: the change matters because the relationship structure changed.

Peer group and neighborhood analysis

Analysts should compare identities against their nearest peers, not just global baselines. A finance analyst’s access pattern should resemble that of similar finance analysts, not the whole company. Graph neighborhoods make it easier to see whether one identity stands out in terms of shared devices, unusual SaaS scopes, or cross-functional access. Peer-group analysis dramatically reduces false positives because it grounds behavior in context.

For instance, a developer with temporary access to a production project may not be suspicious if several teammates have the same entitlement during a release. But if that developer also owns a new token, a new device, and a newly accepted OAuth grant, the combination becomes meaningful. This is where descriptive, diagnostic, and predictive analytics converge in one workflow.

Graph queries for hunts and investigations

Your hunters should have a small set of repeatable graph queries they can run quickly. Examples include: all principals within two hops of a newly created admin role; all devices associated with multiple highly privileged identities; all service accounts that touched secrets after a password reset; and all identities whose access graph changed within the last 24 hours. These queries should be packaged into hunt playbooks so they can be reused during incidents.

Operationally, treat the graph as a living investigation workspace. You are not just generating alerts; you are mapping relationships that explain why the alert happened. If your team already uses streaming operational dashboards, borrowing ideas from real-time scoreboards can help: the important part is who changed, when, and relative to what baseline.

Compliance, Retention, and Evidence Quality

Retention policies should reflect investigative value

Identity graph data is only as useful as your ability to look back in time. Retention should be aligned with investigative needs, regulatory obligations, and storage costs. Keep raw telemetry long enough to reconstruct the graph after schema changes or parser fixes, and maintain derived graph history for the periods most likely to matter in incident response. If legal or regulatory requirements are strict, consider immutable storage tiers for high-value source data.

Retention should not be a blind archive. It should be indexed enough to support evidence requests and chain-of-custody requirements. That means capturing source system, event IDs, ingestion version, transformation time, and analyst actions taken on the record. For organizations that need strong governance, the compliance lessons from AI and document management translate well to security telemetry.

Evidence integrity and auditability

Auditors and incident responders both need to trust the graph. Preserve raw payloads where possible, store hashes for integrity verification, and separate evidence from interpretation. When a detector flags a suspicious relationship, the underlying facts should remain available for review. This separation is especially important when the graph is used to support disciplinary action, regulatory reporting, or legal review.

A practical rule: every suspicious edge should be explainable in plain English and traceable to source logs. If an analyst cannot tell where the edge came from, the graph is not yet production-grade. The discipline is similar to how teams prove product quality in quality validation partnerships: evidence beats assertion.

Access controls for the graph itself

Ironically, the identity graph can become a high-value target. Restrict who can query it, who can modify enrichments, and who can export graph snapshots. Separate analyst access from engineering access, and log every action taken against the graph. If the graph includes HR-adjacent or privacy-sensitive fields, apply data minimization and masking where appropriate.

Strong access controls also protect operator trust. If users know the graph is curated and governed, they are more likely to rely on it during an incident. That trust-building effect is similar to how trust-signal audits improve confidence in public-facing systems.

Operational Metrics, Tooling, and Team Workflow

Metrics that tell you whether the graph is healthy

A mature identity graph program should track source coverage, ingestion lag, entity resolution accuracy, false merge rate, orphan node rate, edge freshness, and alert precision. These metrics tell you whether the graph reflects reality or merely collects data. A graph with 95% source coverage but stale access edges may be more dangerous than a smaller, fresher graph because it creates a false sense of completeness.

Teams should review these metrics alongside security outcomes. How many hunts used graph queries last month? How many detections were escalated because of a graph-derived relationship? How many incidents had to be re-investigated because a missing edge changed the conclusion? This is the security equivalent of tracking business outcomes rather than vanity metrics, as seen in effective metrics programs.

Tooling stack: from storage to visualization

Your stack usually needs four layers: collection, normalization, graph storage, and analysis. Collection can be done with log pipelines, agents, cloud connectors, or event buses. Normalization often lives in ETL or ELT jobs. Graph storage may be a dedicated graph database, a search platform with graph capabilities, or a lakehouse plus materialized relationship views. Analysis and visualization should support query exploration, timeline replay, and investigation notes.

Pick tools based on query patterns, not hype. If your main use case is path finding and neighborhood analysis, optimize for graph traversal performance. If your main use case is compliance reporting, optimize for lineage, historical snapshots, and exportability. If your analysts need quick context during incident response, choose a UI that makes relationship drill-down intuitive rather than forcing them to write specialized queries under pressure.

Team workflow: detection, triage, investigation, and feedback

An identity graph only gets better when feedback loops are built into the workflow. Analysts should be able to mark false positives, confirm suspicious relationships, and add context that improves future detections. Engineering should use that feedback to refine mappings, thresholds, and confidence scoring. Compliance should periodically validate that the data retained in the graph still satisfies reporting and audit requirements.

Cross-functional workflow matters because identity data is shared infrastructure. It supports secops, IAM, audit, legal, and platform engineering all at once. In mature orgs, this is similar to the way creative operations or postmortem systems scale through shared process, not isolated heroics.

Implementation Roadmap: From Zero to Useful Graph

Phase 1: establish authoritative sources and core joins

Start with the minimum viable graph: directory, SSO, MFA, device, and cloud sign-in telemetry. Define canonical identifiers, normalize timestamps, and create a small set of high-confidence edges. At this stage, the goal is not elegance; it is reliable visibility into who exists, who can authenticate, and which systems they touch. Make sure you can answer basic questions such as which privileged accounts exist and which devices they used recently.

This phase should also establish ownership. Decide who owns source mappings, who approves schema changes, and who responds when a source disappears. A graph program without governance tends to decay quickly, just like poorly managed change pipelines in other technical domains. The discipline used in fast rollback pipelines is a good analogy: small changes, rapid feedback, strict observability.

Phase 2: enrich with cloud, SaaS, and session context

Next, add cloud audit logs, SaaS activity, session metadata, and endpoint posture. This is where the graph becomes operationally valuable for hunting and incident response because it starts connecting identity to behavior. Add enrichment fields that help rank risk, such as geo, device compliance, tenant, app sensitivity, and access tier.

Once this phase is in place, begin building hunts for newly created privilege edges, atypical access sequences, and service account abuse. At the same time, start producing compliance views such as “who had access to sensitive systems during the last quarter?” The graph should now support both retrospective and real-time security questions.

Phase 3: automate detections and governance

In the final phase, automate anomaly scoring and response workflows. Alert when high-risk graph changes occur, such as unexpected role inheritance, cross-tenant access, or unusual token behavior. Attach playbooks to alerts so analysts know what to validate first. Feed analyst decisions back into the graph to improve confidence and reduce noise over time.

For organizations scaling this across multiple teams, it can help to think in terms of product maturity. The graph becomes a platform with users, SLAs, roadmaps, and technical debt. The more the team treats it as a durable security product, the more value it creates. That mindset is echoed in trustworthy automation programs and real-time operational fabrics.

Reference Comparison: Data Sources and What They Catch

Data SourceWhat It Adds to the Identity GraphBest Detection Use CasesTypical PitfallPriority
SSO / IdP logsLogin, MFA, session, group and role changesImpossible travel, account takeover, privilege escalationOverreliance on email aliasesCritical
Directory servicesAuthoritative user, group, and device recordsLifecycle tracking, deprovisioning gapsDelayed sync and stale attributesCritical
Cloud audit logsRole assumptions, API activity, resource accessToken misuse, cross-account abuse, unusual admin actionsNoise from routine automationCritical
EDR / endpoint telemetryDevice posture and user-device bindingUnmanaged devices, suspicious logins, malware-assisted accessCoverage gaps for BYOD and contractorsHigh
SaaS audit logsApp consent, sharing, mailbox, and collaboration behaviorOAuth abuse, data exfiltration, lateral movementInconsistent vendor schemasHigh
CI/CD and developer telemetryBuild identities, repo access, deployment actionsSupply-chain abuse, secret theft, unauthorized releasesService accounts poorly attributedHigh

FAQ: Identity Graph Design for SecOps

What is the minimum set of data sources needed for a useful identity graph?

At minimum, ingest your identity provider, directory, MFA, and cloud sign-in logs. That combination lets you model who authenticated, from where, with what device context, and under which access changes. If you can add endpoint telemetry and cloud audit logs early, your graph becomes much more useful for hunting and incident response.

Should we build the graph in a graph database or a data lake?

Use the platform that best matches your query patterns and operational requirements. A graph database is often the best fit when analysts need frequent traversal, path finding, and neighborhood analysis. A lakehouse can work well if your team prioritizes historical replay, batch enrichment, and integration with broader analytics workflows.

How do we avoid false joins when correlating identities?

Use immutable identifiers first, then attach aliases and human-friendly names as secondary attributes. Preserve provenance and confidence scores so uncertain matches can be reviewed rather than blindly merged. Also treat shared accounts and duplicate identities as first-class entities instead of forcing them into a single user record.

What anomalies should we hunt for first?

Start with impossible travel, dormant account reactivation, new admin group membership, service accounts used interactively, unusual role assumptions, and sudden relationship churn. These patterns are easy to explain and often map directly to real security risk. Once those are stable, add peer-group analysis and temporal path analysis for more advanced hunts.

How does an identity graph help compliance teams?

It creates a defensible history of who had access to what and when. That helps with audits, access reviews, incident reconstruction, and evidence requests. If you preserve source lineage and timestamps, the graph can serve as a reliable audit artifact rather than just a security visualization.

Advertisement

Related Topics

#Engineering#ThreatHunting#Identity
D

Daniel Mercer

Senior Security Engineering Editor

Senior editor and content strategist. Writing about technology, design, and the future of digital media. Follow along for deep dives into the industry's moving parts.

Advertisement
2026-04-16T18:37:39.879Z