Early Thoughts on Decentralized Root-of-Trust

This is a work-in-progress draft by Shunfan Zhou and Hang Yin. Welcome to comment and experiment with the idea.

Root-of-Trust (RoT) is the cornerstone of the TEE chain-of-trust. The end user verifies Remote Attestations signed by the CPU, which ultimately rely on keys derived from a set of hardware-held secrets. The hardware component that manages these root secrets, verifies firmware and applications, and issues Remote Attestations is called the Root-of-Trust.

Hardware Root-of-Trust

However, hardware RoT has several drawbacks due to the limitations of physical hardware:

  • Unrecoverable TCB: Usually, bugs in TEE can be fixed by microcode upgrades with a TCB Recovery process. However, when the secret keys in the RoT are extracted, an attacker can simulate a RoT outside of the hardware and issue arbitrary Remote Attestations using the secrets. This invalidates the TCB Recovery process.
  • Vendor/Physical Chip Locked: The app running in the TEE relies on keys derived from the RoT to seal data and maintain identities (e.g., for signing signatures and establishing TLS connections). When the RoT is lost, the corresponding data and identities are lost as well. This prevents workloads from migrating between TEEs and blocks users from building services with high availability.

Software Root-of-Trust

Given these significant drawbacks of hardware RoT, it’s imperative to explore alternative approaches that offer greater flexibility and resilience. One such approach is abstracting the RoT into software.

To understand how a software RoT can address these challenges, let’s reconsider its role in the context of a TEE and explore how it can enhance security:

  1. Measurement: Verify the authenticity of the child layer in the chain-of-trust.
    • In Intel SGX, RoT verifies the version of the CPU microcode, BIOS configuration, and the code loaded into the SGX enclave.
  2. Key Derivation Service: Provide a service to derive keys for the child layer. The derived keys can be used to implement data sealing, message signing, and Remote Attestation when combined with the measurements.

These two properties can also be implemented in software, offering better flexibility and resilience. A simple software RoT can be implemented as follows:

  • A KMS that maintains a set of secrets.
  • Implement Measurement: Verify the authenticity of a physical TEE by requesting and verifying the native Remote Attestation from the TEE (e.g., Intel SGX DCAP Remote Attestation). Once this measurement is taken, the RoT can trust that the desired software is running on the desired hardware environment with all the proper properties, usually integrity and confidentiality.
  • Implement Key Derivation Service: Once verification passes, the KMS can provide services allowing TEE applications to obtain derived keys from the RoT. The communication channel between the RoT and the TEE application can be secured using end-to-end encryption protocols like RA-TLS. Similar to a hardware RoT, the service provides capabilities to seal data, sign certificates, and implement Remote Attestation for TEE applications.

Implementing a software RoT offers several advantages but also introduces new challenges that must be carefully considered:

  • Pros
    • Always Recoverable TCB: TEEs with a compromised hardware RoT can be rejected by the RoT system when measuring their Remote Attestations. In case of a hardware RoT breach, the software RoT remains secure. The software RoT can blacklist the entire product line of the affected hardware to recover the TCB of the system.
    • Dynamic Measurement: Instead of relying on static secrets stored in the hardware RoT, a software RoT can measure the physical TEE periodically and detect vulnerabilities as they occur. When a vulnerability is detected, the RoT can trigger a TEE application migration, remove the affected TEE from the network, and perform a key rotation to reduce the risk (with forward and backward secrecy).
    • Application Migration: The keys the application relies on (for sealing, signing, etc.) are no longer associated with the hardware RoT. Regardless of which hardware TEE the application runs on, it can always obtain the same set of derived keys from the software RoT, decoupling TEE applications from the hardware implementation.
  • Cons
    • Centralization Risk: It’s challenging to run a software RoT securely. The RoT must (1) keep the secrets confidential, (2) ensure the integrity of its code, and (3) maintain the liveness of the service. These three properties are especially challenging without relying on trust in a centralized party.

Decentralized Root-of-Trust

While a software-based RoT offers significant benefits, it introduces the risk of centralization, which compromises security. To address this, a decentralized RoT can eliminate such risks and offer a more secure, distributed alternative.

We propose a decentralized RoT model. This approach distributes trust across multiple independent nodes, eliminating reliance on a single entity. A decentralized RoT serves as the foundational trust anchor for the entire TEE network, removing dependencies on specific TEE manufacturers or cloud providers.

A decentralized RoT can be implemented as follows:

  • A software RoT using Secure Multi-party Computation (MPC) to manage and use the root secrets. MPC distributes trust across multiple independent nodes, ensuring that no single entity can compromise the RoT. Attackers must control a majority (typically two-thirds) of nodes to access the secrets.
  • The RoT nodes run in TEEs to (1) measure the integrity of the code and (2) provide defense-in-depth to reduce the risk of MPC collusion attacks. With reproducible builds, anyone can verify that the RoT nodes have implemented the protocol properly.
  • Smart contracts on the blockchain act as the governance layer for the decentralized RoT, enforcing rules around RoT node behavior, including blacklisting certain hardware from joining the TEE network. By verifying Remote Attestations on-chain, smart contracts ensure that only nodes running approved code are part of the system.
  • Additionally, staking mechanisms can be introduced to add cryptoeconomic security, where node operators must stake tokens as collateral. Any malicious behavior risks losing this stake, incentivizing honest participation.

The decentralized architecture solves the centralization risk in software RoT because:

  1. Confidentiality: The secrets are protected by threshold MPC, eliminating the single point of failure, with TEE providing defense-in-depth.
  2. Integrity: The code is protected by both TEE Remote Attestation and threshold MPC for redundancy.
  3. Liveness: The MPC setup can tolerate the failure of up to one-third of the nodes. Cryptoeconomics can be introduced to further incentivize operators to behave well.

This decentralized model effectively overcomes the vulnerabilities of centralized RoT by distributing control across a secure, blockchain-governed network of MPC nodes. Additionally, the architecture reduces the complexity for users in verifying the security properties of TEE applications by providing a chain-of-trust. Users can easily verify the chain-of-trust by checking the signatures originating from the smart contract.

Implementation

  1. Phala’s SGX Network
    • DAO-governed KMS using TEE and on-chain governance (on a Substrate blockchain).
    • KMS is secured by TEE but not MPC yet.
    • Key rotation is implemented but triggered manually in catastrophes.
    • Apps are decoupled from HW RoT.
    • As of Sep 24, 2024, the network has 11 KMS (called Gatekeepers) and around 39,570 TEE workers.
  2. (Maybe others who implemented Ekiden?)

Discussion

  • Possibility of introducing “Open Source TEE” or very secure RoT?
  • Forward and backward secrecy with key rotation
  • Automatic HTTPS certbot for TEE apps via an extension of the chain-of-trust
  • Storage migration, persistence, and state continuity

Conclusion

The decentralized RoT model not only addresses the limitations of both hardware and centralized software RoT systems but also ensures a more secure, scalable solution for TEE systems.

Essentially, the architecture implements a decentralized RoT where the user ultimately trusts the security of the blockchain. The trust is then extended through protocol governance, the KMS implementation, the hardware TEE, and finally to the applications. By abstracting RoT and distributing trust across the network, the user no longer relies on any specific hardware implementation, and the hardware TEE becomes hot-swappable.

4 Likes

This is a great and detailed post. Thank you very much.

My only real critique is calling this MPC KMS a RoT can be misleading to some readers since it doesn’t replace the need for the hardware RoT, although I understand why you called it that. The way I’ve conceptualised it to myself is like a decentralised governance layer that wraps the centralised attestation service (ultimately Intel) so that you have additional control over the network to exclude nodes or TCBs even if the centralised attestation service hasn’t. Does that make sense? I guess it doesn’t capture the ability to issue shared keys.

Do you have plans to do a kind of regular/automated key rotation in Phala to get forward secrecy?

2 Likes

Yes, this is an accurate description on the attestation side. More importantly than governance, the design decouples the apps from the TEE hardware by providing a portable identity (also called a live-migration-ready identity). This identity can be used to sign messages, seal data, and eventually enables state migration between different hardware TEE implementations.

I see how the term could be misleading. I’m now considering a better name to reflect the concept.

Automated key rotation is tricky because downstream systems also need to adapt to it, and it’s usually a nontrivial, invasive change that affects user experience.

For stateless apps, rotation is easier, but client-side changes are still needed. When a rotation is triggered, the app just needs to discard the old key and use a new one. It should always use the latest key for authentication and I/O encryption. The client interacting with the TEE app must enforce the key rotation by always following the latest key.

Client-side changes can be minimized if the API endpoint is based on RA-TLS (or another TLS protocol). The app can set the expiration date of the issued certificate to cover only the valid period of the current key. This way, the client enforces the key rotation policy as well.

Things are more complex for stateful apps, which rely on the current key to encrypt the state. To achieve backward secrecy, the app has to re-encrypt its state with the new key. This process is time-consuming and usually interrupts the running service, especially for apps with large states like a key-value database. Although the burden of re-encryption can be mitigated with high-performance TEEs like TDX, developers might choose whether to implement state re-encryption based on their requirements.

So, I believe we should perform key rotation regularly, but how it’s enforced on the app side is up to the developers.

4 Likes

Do I understand this correctly that the proposal here is to have a “root authority” with its own set of keys, secret shared among a network of nodes, signing their own abstracted remote attestation quotes? While having a decentralized authority over remote attestation policies seems like the right way, I’m not sure that this authority itself needs to sign quotes or derive keys. These seem like they should be decoupled.

My view is that different use cases/apps can also have different security trade-offs and thus different policies. The decentralized root authority should probably be decoupled from policy decisions. Then the question is what exactly is gained by having the root authority perform vendor-specific quote verification and then issuing its own quotes? Note that even with such a system, the actual root of trust of individual remote attestations is still the vendor’s (e.g. Intel’s) root key. So is it just to simplify verifier implementations?

Otherwise, one can just publish policies in a system that protects integrity (e.g. a blockchain). And the (per app?) governance deals with setting these policies, with on-chain and off-chain components verifying vendor quotes against these policies. This is basically what we have at Oasis.

  1. (Maybe others who implemented Ekiden?)

At Oasis we have a conceptually similar setup, but there is no central root beyond “the Oasis consensus layer protects integrity of attestation and key derivation policies”. E.g. there can be multiple key manager committees with their own policies, multiple on-chain confidential compute runtimes with their own policies and multiple off-chain compute runtimes with their own policies.

We’re using SGX (and TDX in development branches) as the TEEs that host all of these components that talk to each other via secure channels authenticated by remote attestation and information published in the consensus layer. The policies are fairly flexible and already allow blacklisting specific FMSPCs or enforcing TCBRs before Intel does it by default. Most recently we blacklisted some FMSPCs during disclosures from Mark Ermolov regarding compromise of encrypted fuse keys on one of the EOLed CPUs.

Besides the straightforward “each TEE instance stores a replica of the root key that is used during derivation”, the key manager runtime also supports a proactive secret sharing MPC scheme (CHURP) where the key shares are stored in TEEs. The policies around thresholds, committee participation, key derivation etc. are configurable for each key manager committee.

Automated key rotation is tricky because downstream systems also need to adapt to it, and it’s usually a nontrivial, invasive change that affects user experience.

You’re right, it is important to make it as transparent as possible for the end-users. We have separate key types with different key rotation policies – ephemeral keys used for I/O encryption rotate quite frequently with old keys being securely erased and longer-term keys used for state encryption where it is up to the runtime to decide how and when to perform re-encryption. Light clients can be used to verify which set of keys is currently valid and using them for communication (e.g. encrypting transactions).

For state, besides impacting user experience, there is also a cost associated with re-encryption which needs to be paid by someone, and this should probably be up to the application which knows best what the trade-offs are (e.g. some parts may be more sensitive than others).

2 Likes