Machine Memory

The Machine Memory component forms the backbone of state persistence and concurrency control in the Arvo orchestration system. It embodies a carefully considered balance between reliability and simplicity, implementing an optimistic locking strategy that prioritizes system recovery over perfect consistency.

Design Philosophy

State management in distributed systems presents unique challenges that require careful consideration of failure modes and recovery patterns. The Machine Memory interface adopts an optimistic approach to handling transient failures, implementing targeted retry strategies for different operations based on their impact on system consistency. Read operations employ a measured retry strategy with backoff, acknowledging that temporary unavailability should not immediately fail orchestration attempts. In contrast, write operations fail fast to maintain state consistency, as retrying writes could lead to incorrect state transitions or duplicate processing.

The locking mechanism follows the principle of "fail fast on acquire, be tolerant on release." This approach recognizes that while acquiring a lock is a critical operation that must succeed definitively, lock release failures should not block system progress. A mandatory TTL mechanism ensures system recovery even when explicit lock releases fail, preventing resource deadlocks while maintaining processing guarantees.

Interface

The IMachineMemory interface encapsulates these principles through four core operations. Each operation has specific reliability characteristics and error handling requirements derived from extensive experience with distributed orchestration systems:

interface IMachineMemory<T extends Record<string, any>> {
  read(id: string): Promise<T | null>;
  write(id: string, data: T): Promise<void>;
  lock(id: string): Promise<boolean>;
  unlock(id: string): Promise<boolean>;
}

Simple Implementation

While the interface supports sophisticated distributed implementations, the SimpleMachineMemory class provides an in-memory reference implementation suitable for development and testing scenarios. This implementation maintains atomic operations and correct locking semantics within a single node, making it ideal for container applications and request-scoped workflows.

const memory = new SimpleMachineMemory();
const orchestrator = createArvoOrchestrator({
  memory,
  executionunits: 0.1,
  machines: [workflow],
});

Production Implementations

Production deployments require careful consideration of distributed system challenges. When implementing the IMachineMemory interface for production use, several key aspects demand attention. The locking mechanism must include TTL-based expiry to prevent permanent resource locks, while read operations should implement a carefully tuned retry strategy that balances availability with responsiveness. Write operations must prioritize consistency, failing fast rather than potentially corrupting state through retry attempts.

Example of a Redis-based implementation considering these factors:

class RedisMachineMemory implements IMachineMemory<MachineMemoryRecord> {
  constructor(private redis: Redis) {}

  async read(id: string): Promise<MachineMemoryRecord | null> {
    // Implementation with retry strategy
  }
  // Additional implementation...
}

Observability and Monitoring

The Machine Memory system plays a critical role in orchestration reliability, making observability essential. Implementations should integrate with OpenTelemetry to provide insights into operation latencies, failure rates, and lock acquisition patterns. This telemetry helps identify potential bottlenecks and system health issues before they impact reliability.

Implementation Guidelines

A successful Machine Memory implementation must carefully consider error handling, timeout configurations, and retry strategies. Read operations should implement quick retries with exponential backoff, while writes should fail fast to maintain consistency. Lock operations require careful timeout configuration with mandatory TTL mechanisms to prevent deadlocks. These guidelines emerge from practical experience with distributed orchestration systems and aim to balance reliability with operational simplicity.

The interface's design recognizes that perfect consistency is often less important than reliable system recovery in distributed orchestration scenarios. By providing clear behavioral contracts and implementation guidelines, it enables developers to create robust state management solutions while avoiding common distributed systems pitfalls.