The Fundamental Problem: Infrastructure Has No Undo Button
Version control changed software development forever. Before Git, reverting a bad change was a painful, manual process. After Git, it's git revert and you're done.
Infrastructure has never had this capability. When you change a security group, modify an IAM policy, or update a Kubernetes deployment, there's no git log for your infrastructure. If something breaks, you're left debugging with incomplete information:
- What was the state before the change?
- What exactly changed?
- Who changed it?
- What was the dependency tree at that moment?
Knowledge Tree's Time Machine solves this by maintaining a complete, versioned history of your infrastructure state.
How the Time Machine Works
The Time Machine captures your knowledge graph at regular intervals (configurable from every 5 minutes to every hour). Each capture is a full snapshot of every resource, every relationship, and every property.
Temporal Graph Queries
You can query your infrastructure at any point in time:
MATCH (ec2:EC2 {id: 'i-0abcd1234'})
WHERE TIMESTAMP('2026-04-15T14:00:00Z')
RETURN ec2.name, ec2.instance_type, ec2.security_groups
This returns the state of that EC2 instance at exactly 2:00 PM on April 15 — including its security group attachments, even if they've since changed.
Change Diff
Compare any two points in time:
$ kt-discover time-machine diff --from 2026-04-15T14:00:00Z --to 2026-04-15T15:00:00Z
Changes detected in the last hour:
Added: 17 resources (3 deployments, 12 pods, 2 services)
Modified: 43 resources (8 security groups, 15 config maps, 20 IAM policies)
Deleted: 5 resources (3 terminated EC2, 2 deleted S3 buckets)
Critical changes:
- Security group 'sg-api-prod' added ingress from 0.0.0.0/0:3306
- IAM policy 'admin-access' attached to service-account 'ci-runner'
The Incident Timeline
For every incident, the Time Machine automatically generates a timeline showing exactly what changed in the hours leading up to it:
Incident: P1 - Payment processing down
Time: 2026-04-15 14:32 UTC
Duration: 47 minutes
Preceding changes:
14:15 - Deployment 'payment-worker' updated image from v2.3.1 to v2.4.0
14:18 - ConfigMap 'payment-config' modified: TIMEOUT_MS 5000 → 1000
14:22 - 3 pods entered CrashLoopBackOff
14:25 - Service 'payment-svc' health check failing
14:30 - PagerDuty alert fired
14:32 - Incident declared
This turns post-incident analysis from detective work into a simple review. You can literally see what happened, in what order, and how it propagated through the dependency graph.
The What-If Simulator
The Time Machine is retrospective. The What-If Simulator is prospective — it lets you ask "what would happen if..." and get an answer based on your actual graph topology.
Scenario 1: Dependency Impact
$ kt-discover simulate "What if we terminate RDS instance payments-db-primary?"
Simulation result:
Directly affected resources:
- 12 EC2 instances lose database connectivity
- 8 services will return errors on DB queries
- 3 Lambda functions will fail on invocation
Cascading impact:
- 3 user-facing APIs will return 500 errors
- 2 background job processors will stall
- 1 monitoring dashboard will lose data
Recommendation: Deploy read replica first, then redirect traffic.
Estimated safe migration time: 45 minutes.
Scenario 2: Capacity Planning
$ kt-discover simulate "What if we increase payment-worker replicas from 6 to 12?"
Simulation result:
Resource impact:
- Requires 6 more EC2 instances (current cluster has 4 free slots)
- Cluster will need 2 additional nodes to accommodate
- Additional monthly cost: ~$1,200 (6 x t3.medium)
Dependency impact:
- Database connections increase from 18 to 36 (under RDS max of 80)
- Redis connection pool utilization goes from 40% to 80%
- Downstream API rate limits: not exceeded
Recommendation: Scale DB connection pool first, then deploy.
Scenario 3: Compliance Check
$ kt-discover simulate "What if we apply 'deny-all-ssh' policy to production?"
Simulation result:
Resources affected: 847 resources across 3 accounts
Expected violations:
- 23 security groups will lose SSM access (non-compliant with internal policy)
- 12 EC2 instances will become unreachable via SSH (expected)
- 3 bastion hosts will be completely locked out (HIGH RISK)
Recommendations:
- Add explicit SSM Session Manager policy before applying
- Configure bastion exceptions for security team
- Create maintenance window for the change
Technical Architecture
The Time Machine is built on a few key technologies:
Event Sourcing
Every change to the knowledge graph is captured as an immutable event. Events are stored in a dedicated event store (backed by PostgreSQL) that supports temporal queries.
Snapshot Compression
Full snapshots are taken at configurable intervals. Between snapshots, only deltas are stored. This keeps storage costs manageable:
- Full snapshot: ~500 MB for 50,000 resources
- Daily deltas: ~5-20 MB depending on change rate
- Monthly storage: ~2-5 GB for active environments
Temporal Query Engine
The query engine rewrites graph queries to operate on the state of the graph at a specific point in time. This is transparent to the user — the same Cypher queries work, just with a TIMESTAMP clause.
Real Scenario: Root Cause Analysis in 5 Minutes
A Knowledge Tree customer experienced a 23-minute production outage. The on-call team couldn't figure out what caused it — nothing in their change management system showed a deployment at that time.
Using the Time Machine:
- They queried the change diff for the 30 minutes before the incident
- Found: A ConfigMap change pushed by a CI pipeline that reduced a timeout from 30s to 3s
- The change wasn't logged as a "deployment" — it was a quick config update that no one thought to record
- The graph showed exactly which services depended on that ConfigMap value
Root cause identified in under 5 minutes. Compare this to the industry average of 2-4 hours for RCA.
Use Cases Beyond Incidents
The Time Machine is valuable for more than post-incident analysis:
- Capacity planning: See how your infrastructure grows over time and project future needs
- Cost optimization: Identify resources that were created, never used, and can be decommissioned
- Compliance evidence: Show auditors the complete history of every resource and change
- Architecture review: Replay the evolution of your infrastructure over months or years
- Onboarding: Show new team members how the infrastructure evolved to its current state
Getting Started
The Time Machine is included in all plans:
| Feature | Team | Business | Enterprise |
|---|---|---|---|
| Snapshot interval | 1 hour | 15 minutes | 5 minutes |
| Retention period | 30 days | 90 days | 365 days |
| What-If Simulator | -- | Included | Included |
| Temporal queries | 7 days | Full retention | Full retention |
| Export history | -- | PDF/CSV | All formats |
What would you ask your infrastructure if you could look back in time? Book a demo to see the Time Machine in action.