THOUGHT.LOG

GPU TEEs and the AWS Headache

2025.01.21ANANSI.ALPHACONFIDENTIAL.COMPUTE

ok so this has been driving me nuts for weeks now

AWS drops these new P6 instances with Blackwell GPUs and they're insanely fast. like we're talking about exactly what I need for Anansi's proof generation stuff. should be perfect right?

except here's the thing that makes me want to throw my laptop out the window. they didn't expose ANY gpu tee controls. none. zero. nothing in the docs about encrypted HBM, no gpu attestation api that customers can actually use. it's like they built this amazing race car but forgot to include the steering wheel

meanwhile azure has had confidential gpus working for months. MONTHS. their NCCads H100 v5 instances give you cpu AND gpu attestation right out of the box. they literally have a github repo that walks you through the whole thing. i can spin one up right now and get attestation tokens in like 10 minutes

google cloud? same deal. A3 confidential VMs with H100s, nvidia NRAS attestation, the whole nine yards. it just works

seriously AWS what are you doing

your competitors are shipping actual confidential computing features while you're over here writing blog posts about how fast your new gpus are. cool story but enterprises need more than speed benchmarks

this is what i'm talking about

every single piece working together perfectly. no missing parts, no "oh we'll add that later"
that's what cloud infrastructure should look like too

but whatever, can't sit around waiting for AWS to figure it out. so here's what we're actually doing for anansi alpha

the hack that actually works

two cloud MPC. sounds fancy but it's pretty straightforward. we use AWS P6 instances for the raw compute power because yeah they're stupid fast. but we keep everything confidential by doing multi party computation with azure or gcp confidential gpus

basically your data never exists in cleartext on any single cloud. each one only sees random garbage that looks like noise. even if AWS could somehow peek at the gpu memory (which they probably can't but still) all they'd see is meaningless random numbers

here's the setup:

AWS side: P6-B200 for speed (obviously)
other cloud: azure NCCads H100 v5 or gcp A3 confidential
secret sharing: both gpus work on shares, never see actual data
attestation: azure/gcp gives us proper gpu attestation via NRAS
aws side: nitro enclave handles keys and policy stuff

the proof bundle we spit out has everything. azure/gcp cpu attestation, gpu NRAS tokens, aws nitro enclave attestation, mpc protocol transcripts. boom. cryptographic proof that no single cloud provider ever saw your secrets

and here's the kicker - even if AWS could somehow read the pcie bus or gpu memory, doesn't matter. they're only seeing random shares that are useless without the other half

why this isn't just paranoia

look i know this sounds like overkill. "just trust the cloud provider" right? except when you're dealing with enterprises running their most sensitive AI workloads, trust isn't enough anymore

i've been in meetings with compliance teams. they don't want to hear about your security promises or your certifications. they want cryptographic proof. they want attestation tokens they can verify themselves. they want audit trails that would hold up if things go sideways

that's exactly what anansi gives them. every single computation comes with a proof bundle that anyone can verify independently. no trust required

bottom line

aws has the fastest hardware but no confidential computing story. azure and gcp have confidential computing but slower gpus. so we just use both. problem solved

this is what's shipping in anansi alpha. code's gonna be open source soon so you can see exactly how we're doing it. and hey if anyone from aws is reading this - please just expose those gpu tee controls already. would make all our lives easier

anyway that's how we're solving confidential ai compute right now. not waiting around for perfect solutions, just building what works with what we've got