I’m sharing for discussion a minimalist approach for migration & upgrades of Dstack applications. The main function of the Replicatoor module is to pass secret state (a secure file) from an existing node to a new node.
As usual my goal with this is to push back against a tendency to overcomplicate it, which I think (but up for debate) would be the case with alternative approaches involving key managers, proxies, or ratls.
This is essentially the same as the replication module from amiller/dstack-vm. Instead of placing this in the CVM build, it goes in the application container. This way even without requiring special support from the CVM, applications can do contract-mediated upgrades and migrations.
This process involves three rounds.
- New node prepares a request (a public key and quote).
- The old node encrypts the secure file to the public key
- The new node decrypts the file.
In step 2 the old node also inspects the quote from new node, and compares it with reference values managed by UpgradeOperator contract.
Because this is packaged as a single page endpoint service (just 250 lines in replicatoor.py
) it can be included in an app container. It is also packaged as a microservice dockerfile on its own, so it can be “dropped in” to a Dstack docker compose next to the rest of the application.
See here for how the replicatoor is used as a drop-in module in the Teeheehee docker compose definition:
Help wanted: what’s still missing from Dstack upgrade process?
-
Full checking of tcbinfo. Currently the replicatoor is not checking TCB info in the quote. This needs the on chain PCCS to have as reference?
-
How can an auditor run the entire VM and rootfs on a non tdx machine for simulation? Ideally without running the entire teepod. Are there best practices applications can follow so it’s easier for this to take place?
-
What version of rootfs are we using? The file image itself is available on the machine itself, but I’m not sure yet how to reproduce it
-
No clear process for generating rtmr3 reference from a laptop. Right now I get it from the dashboard but that doesn’t help auditors
-
The UpgradeOperator contract doesn’t have an “audit surface” dashboard yet. Are invalid configurations still allowed? Were invalid configurations allowed previously and now disabled? For the Teeheehee bot, we are trying to prove an entire “chain of custody” so any gaps are a problem
How is this different than Dstack migration otherwise?
Right now the alternative approach in Dstack makes use of a Key Management Service (KMS). This is external to the CVM itself. Right now it isn’t in a TEE, so this Replicatoor is basically a speedrun of doing it within a TEE.
It also does not use the “tproxy” which is currently in the Dstack TEE but also involves more parts and is motivated mainly by serving https, which isn’t needed for applications like the Teeheehee bot.
My hope is this Replicatoor approach helps keep Dstack as simple as possible