Back to Blog
2026-04-28Knowledge Tree Team9 min readProduct

The Infrastructure Time Machine: Rewind, Replay, and Simulate Your Infrastructure

time-machinesimulationincident-responseplanning

The Fundamental Problem: Infrastructure Has No Undo Button

Version control changed software development forever. Before Git, reverting a bad change was a painful, manual process. After Git, it's git revert and you're done.

Infrastructure has never had this capability. When you change a security group, modify an IAM policy, or update a Kubernetes deployment, there's no git log for your infrastructure. If something breaks, you're left debugging with incomplete information:

  • What was the state before the change?
  • What exactly changed?
  • Who changed it?
  • What was the dependency tree at that moment?

Knowledge Tree's Time Machine solves this by maintaining a complete, versioned history of your infrastructure state.

How the Time Machine Works

The Time Machine captures your knowledge graph at regular intervals (configurable from every 5 minutes to every hour). Each capture is a full snapshot of every resource, every relationship, and every property.

Temporal Graph Queries

You can query your infrastructure at any point in time:

MATCH (ec2:EC2 {id: 'i-0abcd1234'})
WHERE TIMESTAMP('2026-04-15T14:00:00Z')
RETURN ec2.name, ec2.instance_type, ec2.security_groups

This returns the state of that EC2 instance at exactly 2:00 PM on April 15 — including its security group attachments, even if they've since changed.

Change Diff

Compare any two points in time:

$ kt-discover time-machine diff --from 2026-04-15T14:00:00Z --to 2026-04-15T15:00:00Z

Changes detected in the last hour:
  Added:    17 resources (3 deployments, 12 pods, 2 services)
  Modified: 43 resources (8 security groups, 15 config maps, 20 IAM policies)
  Deleted:  5 resources (3 terminated EC2, 2 deleted S3 buckets)

  Critical changes:
    - Security group 'sg-api-prod' added ingress from 0.0.0.0/0:3306
    - IAM policy 'admin-access' attached to service-account 'ci-runner'

The Incident Timeline

For every incident, the Time Machine automatically generates a timeline showing exactly what changed in the hours leading up to it:

Incident: P1 - Payment processing down
  Time: 2026-04-15 14:32 UTC
  Duration: 47 minutes

  Preceding changes:
    14:15 - Deployment 'payment-worker' updated image from v2.3.1 to v2.4.0
    14:18 - ConfigMap 'payment-config' modified: TIMEOUT_MS 5000 → 1000
    14:22 - 3 pods entered CrashLoopBackOff
    14:25 - Service 'payment-svc' health check failing
    14:30 - PagerDuty alert fired
    14:32 - Incident declared

This turns post-incident analysis from detective work into a simple review. You can literally see what happened, in what order, and how it propagated through the dependency graph.

The What-If Simulator

The Time Machine is retrospective. The What-If Simulator is prospective — it lets you ask "what would happen if..." and get an answer based on your actual graph topology.

Scenario 1: Dependency Impact

$ kt-discover simulate "What if we terminate RDS instance payments-db-primary?"

Simulation result:
  Directly affected resources:
    - 12 EC2 instances lose database connectivity
    - 8 services will return errors on DB queries
    - 3 Lambda functions will fail on invocation
  
  Cascading impact:
    - 3 user-facing APIs will return 500 errors
    - 2 background job processors will stall
    - 1 monitoring dashboard will lose data

  Recommendation: Deploy read replica first, then redirect traffic.
  Estimated safe migration time: 45 minutes.

Scenario 2: Capacity Planning

$ kt-discover simulate "What if we increase payment-worker replicas from 6 to 12?"

Simulation result:
  Resource impact:
    - Requires 6 more EC2 instances (current cluster has 4 free slots)
    - Cluster will need 2 additional nodes to accommodate
    - Additional monthly cost: ~$1,200 (6 x t3.medium)
  
  Dependency impact:
    - Database connections increase from 18 to 36 (under RDS max of 80)
    - Redis connection pool utilization goes from 40% to 80%
    - Downstream API rate limits: not exceeded

  Recommendation: Scale DB connection pool first, then deploy.

Scenario 3: Compliance Check

$ kt-discover simulate "What if we apply 'deny-all-ssh' policy to production?"

Simulation result:
  Resources affected: 847 resources across 3 accounts
  
  Expected violations:
    - 23 security groups will lose SSM access (non-compliant with internal policy)
    - 12 EC2 instances will become unreachable via SSH (expected)
    - 3 bastion hosts will be completely locked out (HIGH RISK)
  
  Recommendations:
    - Add explicit SSM Session Manager policy before applying
    - Configure bastion exceptions for security team
    - Create maintenance window for the change

Technical Architecture

The Time Machine is built on a few key technologies:

Event Sourcing

Every change to the knowledge graph is captured as an immutable event. Events are stored in a dedicated event store (backed by PostgreSQL) that supports temporal queries.

Snapshot Compression

Full snapshots are taken at configurable intervals. Between snapshots, only deltas are stored. This keeps storage costs manageable:

  • Full snapshot: ~500 MB for 50,000 resources
  • Daily deltas: ~5-20 MB depending on change rate
  • Monthly storage: ~2-5 GB for active environments

Temporal Query Engine

The query engine rewrites graph queries to operate on the state of the graph at a specific point in time. This is transparent to the user — the same Cypher queries work, just with a TIMESTAMP clause.

Real Scenario: Root Cause Analysis in 5 Minutes

A Knowledge Tree customer experienced a 23-minute production outage. The on-call team couldn't figure out what caused it — nothing in their change management system showed a deployment at that time.

Using the Time Machine:

  1. They queried the change diff for the 30 minutes before the incident
  2. Found: A ConfigMap change pushed by a CI pipeline that reduced a timeout from 30s to 3s
  3. The change wasn't logged as a "deployment" — it was a quick config update that no one thought to record
  4. The graph showed exactly which services depended on that ConfigMap value

Root cause identified in under 5 minutes. Compare this to the industry average of 2-4 hours for RCA.

Use Cases Beyond Incidents

The Time Machine is valuable for more than post-incident analysis:

  • Capacity planning: See how your infrastructure grows over time and project future needs
  • Cost optimization: Identify resources that were created, never used, and can be decommissioned
  • Compliance evidence: Show auditors the complete history of every resource and change
  • Architecture review: Replay the evolution of your infrastructure over months or years
  • Onboarding: Show new team members how the infrastructure evolved to its current state

Getting Started

The Time Machine is included in all plans:

FeatureTeamBusinessEnterprise
Snapshot interval1 hour15 minutes5 minutes
Retention period30 days90 days365 days
What-If Simulator--IncludedIncluded
Temporal queries7 daysFull retentionFull retention
Export history--PDF/CSVAll formats

What would you ask your infrastructure if you could look back in time? Book a demo to see the Time Machine in action.