BONKbot's Secure Signer Architecture: Dedicated TPM-based System with WASM signing verification vs. TEEs

willvillanueva · October 11, 2024, 4:33pm

BONKbot’s Secure Signer Architecture: Dedicated TPM-based System with WASM signing verification vs. TEEs

Overview

At BONKbot, we’ve developed a non-custodial signer system that combines a custom Trusted Computing Device with user-authorized WASM modules for transaction validation. Our goal is to achieve CEX-like execution speeds on Solana while maintaining non-custodial security.

Terminology

TPM: Trusted Platform Module
HSM: Hardware Security Module
TEE: Trusted Execution Environment

Our Approach: Dedicated TPM-based System vs. TEEs

We run a dedicated system with TPM as the basis vs. other isolation methods (TEEs, VMs with SVE etc) on top of things.

Dedicated Hardware Environment: Unlike TEEs, which are typically used in shared or untrusted environments, our system operates on dedicated hardware. This allows for a more comprehensive and tailored security model. For example, using the TPM to verify the integrity, the platform key and the OS that is being run.
Side-Channel Attack Resistance: Our TPM-based approach is inherently more resistant to on-device side-channel attacks. TEEs, including SGX and TDX, have a history of vulnerabilities in this area and require specific measures to prevent such attacks.
Verifiability: TEEs essentially provide deep insulation of code within an untrusted environment. Our system, built on a custom minimal Linux kernel with TPM, offers better verifiability of the entire security stack.
Customization: The dedicated hardware and software environment allows for implementation of tailored security measures that aren’t possible with general-purpose TEEs. For example, being able to write a custom linux kernel with added security measures.
Less resource constraints: Bandwidth in/out of TEEs and their allocatable memory are severely limited. Using a dedicated system gives us as much RAM as we want.
TPM and TEE roles: Base layer TPMs (dTPM or fTPM), if correctly designed, give a high assurance about the code running and manage access to keys based on the state of the system (hardware, firmware, configuration, OS, application…). TEEs isolate execution from host environments, and sometimes add state-measurement for key access as well (or use TEE-unique keys to do this).
TEE advantages and limitations: The upside of the TEE is the reduction of the code required for the application to be useful, since hardware is abstracted away behind a message passing interface with the host environment. This makes TEE code far easier to audit IFF the host environment were trustworthy - which it likely isn’t. To get a trustworthy environment, the host needs to be measured against a known good configuration as well. And the known good configuration must be established first.
Use cases for TEEs vs TPMs: This means that TEEs are a quick way to trusted execution IFF one is willing to trust the host environment but does not control it (like cloud instances), and TPMs on dedicated systems provide the host trust. Unikernels are then the efficient way to translate trusted hosts into trusted applications.
Cost considerations: TEEs are cheaper to run, given the known tradeoffs.
Our decision rationale: We decided to use the TPM-unikernel-dedicated-host approach to prevent vendor lock-in, cloud-provider lock-in, and reliance of not well understood third party stacks, while playing it safer on the sidechannel risk (which is extremely hard to mitigate).
Security approach: We consider it easier to secure a unikernel host and app, vs a general purpose host and a TEE.
TEE usage: Now, this doesn’t mean that we do not use any TEE functionality - but we don’t rely on it as integral parts of our architecture.

Auditable Security of Trading Intent

WASM-based Transaction Validation: User-authorized WASM modules define and verify the “shape” of each transaction type. This ensures that every transaction matches the user’s intent precisely.
Sub-millisecond Verification: Our WASM modules perform transaction verification in under 0.5ms, enabling high-speed, intent-based execution.
Isolated Execution: WASM modules run in a WASI-targeted, sandboxed environment, communicating only via shared memory.
Constraint-based Intent Definition: Each transaction type is defined by a set of constraints. This allows for precise intent capture and validation.

Link to full technical blog post for more details

Link to twitter thread also diving into WASM modules

We welcome feedback and discussion.

guayabyte · October 13, 2024, 11:36pm

Thanks for sharing

You say that TEEs are cheaper, is this the cost of the hardware? Can you give us details of your AMD + TPM hardware and how much does it cost?

And, will you share the code to study it?

What did you use to customize and build the unikernel? Is the resulting operating system + your applications reproducibly build?

willvillanueva · October 21, 2024, 2:49am

There are two key differences:

You can have multiple TEEs on a single host, and the machine can be multi-tenant. Which means you can get it as cloud service with the associated cost savings (due to not having to manage colocation, host administration, etc).
Assuming that you trust your host environment you also only have to care&secure your TEE software without caring too much about the security of everything that makes up the host. So development time is less for a TEE.

Our hardware costs roughly 10k USD for purchase and an additional 300 USD/month for colocation stuff. Plus admin.

Intel would be slightly more expensive, maybe 15k USD purchase.

Mainline kernel with carefully removing anything unused (drivers, protocols, functionalities), all hardening features enabled, compiled with all hardening flags, statically linked (no modules, no helpers), then assembled via objcopy into a unikernel.

Yes, the build is repoducible.

mateusz · October 28, 2024, 9:18am

Heyo! I’m not an expert so please excuse the amount of questions.

Can you link some resources I could read to understand where TPM allows more comprehensive security? My understanding tapers off somewhere around TPM-based remote attestations

Interesting! How is that achieved? AFAIK TPMs would not help much against any physical intrusion, do you mean that it’s because the host is a part of the measurement?

Hm, so the trusted code part (security model) of SGX doesn’t even include the kernel, so I’d guess that sets tighter bounds on security, and for TDX we also use a similar minimal Linux kernel.
It’s a good point that TPM provides more guarantees over what in TEEs is the untrusted host. I wonder if we shouldn’t simply do both, although making the host itself reproducible is tricky (but I’d guess possible).

@socrates1024 I wonder what you think about this? TPM-based attestation on the host coupled with TEEs for separation of any workloads?
The host would have to be inaccessible via ssh and console which makes running arbitrary TEEs a bit more difficult but it’s similar to any of the other ideas floating around.

I’m not sure if I understand this point – I’m not aware of any limitation for TDX for sure at least.

Why can’t it be said about TDX in particular that it gives high assurance about the code running? I’m not aware of major differences in what TPM provides. With SGX the same can be said, as long as you exclude the untrusted host’s kernel

I think this is true of SGX, but not of TDX – would you say this applies to TDX as well?

Agreed, although the trust in host environment in TEEs is quite similar to trust in the infrastructure around the TPM host - in particular both can be selectively censored, made selectively unavailable, turned off, tampered with at the hardware level. One difference you brought up is the OS-based side channels, which could be there, but doesn’t really improve the guarantees since you anyway trust the infrastructure hosting with not tampering physically

Seems to me you though it thoroughly and arrived at the correct decision for your use case!

I wonder – how much of a difference is it between images that use TPM-based vs TDX-based attestations. My intuition is that they are almost the same image, since both have to be reproducibly built with a minimal kernel, both are performing attestations, both have a similar attack surface (sans OS-based side channels). Maybe the keys are handled differently in the TDX approach? I’d be curious to see what we can reuse between the two approaches.

mateusz · October 28, 2024, 9:21am

I’d be curious to take a look at the code - I assume it’s open source if you provide remote attestation - but it doesn’t seem it’s linked in the blog post

willvillanueva · November 8, 2024, 7:14am

Blockquote
Interesting! How is that achieved? AFAIK TPMs would not help much against any physical intrusion, do you mean that it’s because the host is a part of the measurement?

Physical instrusion is not the only (and not the most dangerous) sidechannel. the biggest issue are on-system measurement via software the reason is that those attacks:

Leave less traces & are non-destructive
Only required breaking into the system, not breaking into a datacenter
Have a higher degree of success

breaking into a datacenter and de-capping a running system that has tamper sensors activate is a much bigger accomplishment then breaking a shared hosting environment (cloud), and running measurements on the system.

Blockquote
I’m not sure if I understand this point – I’m not aware of any limitation for TDX for sure at least.

TDX is not TEE. TDX is a virtualization technology with added isolation. also, TDX is NOT unlimited. It strongly depends on the processor you are using:

Random example:
https://www.intel.com/content/www/us/en/products/sku/237557/intel-xeon-silver-4514y-processor-30m-cache-2-00-ghz/specifications.html

Scroll down to Default Maximum Enclave Page Cache

Also, to move memory in/out of the virtual machine you have to share memory. This also requires context switching and locking. That reduces bandwidth.
We’re only very recently seeing systems like Hyperlight that optimize for function isolation on VMs. Even there the latency introduced by moving memory in/out of the VM is around 0.01ms. We simply dont think about any of these constraints because we don’t have to. TEEs and TDX are for SHARED systems. We’re using a dedicated system.

Blockquote
Why can’t it be said about TDX in particular that it gives high assurance about the code running? I’m not aware of major differences in what TPM provides. With SGX the same can be said, as long as you exclude the untrusted host’s kernel

At no point do we claim that TDX or TEEs are bad. They are a different way to get to the same goal. Just have different expense and complication tradeoffs. It’s like comparing vim vs emacs.

Blockquote
I think this is true of SGX, but not of TDX – would you say this applies to TDX as well?

yes, it applies for TDX as well. you can run a tdx domain purely with processor setup and shared memory. no need to do anything else.

Blockquote
n particular both can be selectively censored, made selectively unavailable, turned off, tampered with at the hardware level. One difference you brought up is the OS-based side channels, which could be there, but doesn’t really improve the guarantees since you anyway trust the infrastructure hosting with not tampering physically

we have to provide either trusted or no service. if the service is present, we have to make sure that we can trust it

One should first define one’s threat model before discussing solutions. Censorship is NOT in our threat model because we are a single provider. People can export their keys. And continue somewhere else.

Blockquote
I wonder – how much of a difference is it between images that use TPM-based vs TDX-based attestations.

Attestation protocols betweenTDX and TPM are very different. TDX attestation runs through and SGX-hosted attestation service. TPM allows direct attestation that doesnt require the SGX to be established as a trust domain (which, to be exact, is only publicly verifyable IFF the host itself is first attested through the TPM…)

Otoh, TDX attestation allows one-shot shareable attestation while TPM attestation must be interactive (at least at the current date)

Moe · December 3, 2024, 6:34pm

Can you elaborate more on what you mean by Intel TDX is not a TEE?
To my understanding it provides hardware-isolated virtualization which protects the confidentiality and integrity of the application running within it. It is part of Intel’s TEE offering since the CPU should support the instructions that facilitate the TDX module to create those Trusted Domains.

Intel TDX VMs are not necessarily limited to the EPC. I believe the reference you shared is only regarding the SGX part because those Xeon CPUs also support SGX. Therefore, the EPC is mentioned with the maximum possible size. But that is for SGX and not TDX if I’m not wrong.
TDX virtual machines could utilize the memory assigned to them as much is available on the host machine. It utilizes Intel’s MK-TME to encrypt the memory of each of those TD VM’s memory and isolate them from each other.

Could you please elaborate more on “IFF one is willing to trust the host environment”?
The main purpose of TEEs is to run sensitive application without trusting the host environment or admin/hypervisor. These are part of Intel’s SGX and TDX strict threat model.
The trust boundaries in Intel SGX would be the enclave code + CPU packages, and in case of Intel TDX that would be the Guest OS + your built-in application + CPU packages (such as firmware, micro-code). Enclaves and TD VMs are isolated from each other and from the host system.

If you mean by moving memory in and out for the several TDX VM running on the same hardware, then that could indeed cause a lot of context switching if the resources given to those VMs are not managed well. However, if you consider using bare-metal providers, then you would also have a dedicated system for your application and launch a single large TDX VM for example.
Or did I misunderstand something?