Quantifying Info I: Quantifying The Information Leakage of Computing on Encrypted Transactions

Quintus · May 8, 2023, 1:49pm

Part 1

The following is an abstraction of a problem. For more context on application, a brief description is attached below.

Description

The setting
An adversary is given a finite set of functions \mathcal{F} where every element f_i\in\mathcal{F} is a mapping from a set X to a set L. \mathcal{F}. The adversary is also given a prior distribution, \mathcal{D}, over F.

The game
A function f^* is drawn from \mathcal{D}. The adversary does not know f^*.

Now there are n rounds. In each round i, the adversary selects any x_i\in X and is told f^*(x_i).

After the n'th round, the adversary attempts to guess f^*.

Example
X:={(1,1),(0,1),(0,0),(1,0)}
L:={0,1}
\mathcal{F}:={\text{AND, OR, NAND, XOR}}
\mathcal{D} is uniform over F.
n=1

A valid policy for the adversary is to query (1,1). If f^*(1,1)=1, choose either OR or AND according to a cointoss. If f^*(1,1)=0 choose NAND or XOR in the same way.

The success rate of this policy if \frac{1}{2} which is better than complete random guessing which yields \frac{1}{4}.

This is a relatively basic example with small sets. In general, |L|,n<<|F|,|X|. One can also reasonably assume \mathcal{D} is power law distributed.

Question

Is there existing literature which studies this setting? Information theory and computational learning theory seem like the most obvious areas to investigate.
Given a setting, how much can the adversary learn from the best possible policy? One way of measuring this is the reduction of entropy from the prior distribution to the posterior belief after the last query as a function of n (see part 2 for some more ideas)
Assuming the adversary is polynomially bounded (runs in probabilistic polynomial time), what is the best policy they can find given the setting?

The actual application

This question comes up when considering doing computation on private information. Feedback from this computation may degrade the privacy and this question aims to quantify to what degree privacy is degraded. More particularly, this problem comes up in a blockchain setting in which a transaction must be ordered in a block along with other transactions. This process requires doing computation on these transactions. In many cases the ordering of transactions within the block has an impact on how a block fares according to some objective function. In order to improve upon brute force search over all orderings, additional computation on private transactions may be required.

A transaction in blockchain can be considered a function which takes as input the state of a VM and outputs the future state (although the input and output can be generalised). In this case \mathcal{F} is the set of all transactions (finite because of gas limits), X is the set of all possible VM states and L are the bits of information leaked (e.g. boolean for success/revert of the transaction). One can simplify the problem by considering \mathcal{F} the set of all swaps, with each swap parameterised by pool, size and slippage limit.

Edit

The most concrete direction I have in mind at the moment is looking at bounds on concept learning with membership queries in computational learning theory (see this or this) and combining that with results from PAC-Bayes which is PAC learning with priors.

Part 2

Now we layer on some economics.

the-non-economist · May 14, 2023, 7:41pm

EDIT: Sorry I skimmed through some of the writing and forgot about the polynomial bound. Clearly what I suggested below is a naive and not so polynomial time algorithm.

I do have to ask couple questions from here though

How much overlap is there in function in F every time an adversary has to solve ? (like can you just take advantage of memoization) so they just have to query without solving a lot of equalities?

with memoization and right kind of data structure you can do it linear time or better. I don’t think you even need a paper to refer to at that point.

Could you give an example of a function F in closed-form?

Are you assuming that the adversary already knows what F and the ranges of the functions f_i are? (therefore the values in X). Can adversaries evaluate f_i(X) before choosing x? If F a finite set? If so, then I think it’s an easy trap to fall into thinking this as a learning problem.

I think it could rather be seen as a “search” problem where you’re trying to find values of X where each f_i(x) values are mutually exclusive. (regardless of whether X is a discrete or continuous set)

If you already know what each f_i looks like then you can just first solve for each pair f_i and f_j for x values where f_i(x) = f_j(x). You can create a set X’ using values that have overlap and just simply pick x values that aren’t in this set. If X \ X’ isn’t a null set then you can pretty much figure out what the chosen f is in 1 iteration. If not (the example you posted), then you can attach each segment in X’ with a number of functions it overlaps

Downside of the following approach would of course be that you’d have to have different ways of generating X’ per set of problems classes F. So really the papers you’d want to read just becomes how to solve for equalities efficiently when you have (|F| choose 2) number of equalities to solve for.

the-non-economist · May 14, 2023, 7:53pm

2 Examples - 1 the same as you posted and 1 my own.

Example 1:
AND(x) = OR(x) → (1,1) and (0,0)
AND(x) = NAND(x) → empty
AND(x) = XOR(x) → (0,0)
OR(x) = NAND(x) → (0,1), (1,0)
OR(x) = XOR(x) → (0,0), (1,0), (0,1)
NAND(x)=XOR(x) → (1,0), (0,1), (1,1)

(1,1): 2 overlaps
(0,0): 2 overlaps
(1,0): 3 overlaps
(0,1): 3 overlaps

From here we can tell that we should prob choose (1,1) or (0,0) as our first iteration.
If n=2, and we’ve narrowed it down f to OR and AND, then we repeat this again with OR and AND

AND(x) = OR(x) → (1,1) and (0,0)
(1,1): 1 overlap
(0,0): 1 overlap
(1,0): 0 overlap
(0,1): 0 overlap

So 2nd iteration we choose (1,0) or (0,1), which will give us a definitive answer.

Example 2:
F: x^2, x^3, -x, x
X: all real numbers

Again, just evaluate equalities and solve for x - you get

(0,0): 4 overlaps
(1,1): 3 overlaps
(-1,1): 2 overlaps
(-1,-1): 2 overlaps
(the rest): 0 overlap

Now you just choose any x that’s X\ {(0,0), (1,1), (-1,1), (-1,-1)} and you got yourself a solution

Quintus · May 19, 2023, 10:14pm

Nice! Thanks for the response

The VC dimension seems to be quite closely related to (or the opposite of?) what you are talking about, if you haven’t looked at it.

How much overlap is there in function in F every time an adversary has to solve ? (like can you just take advantage of memoization) so they just have to query without solving a lot of equalities?

We can assume they play the game many times with the same F (e.g. always trying to figure out what kind of swap a txn is)

Could you give an example of a function F in closed-form?

I assume you mean an element of F. Its easier to think of the functions as programs than analytic functions. e.g. swaps could be thought of as some code that does something like take a list of pools as input (each pool parameterised by a price, liquidity, asset pair) and checks whether a set of these pools conforms to certain parameters (e.g. prices are within a certain bound) before returning true or false

the-non-economist · June 27, 2023, 5:05am

Sorry it took forever for me to reply. Been busy with writing and people ops…

I was debating btw just giving quick answers vs long-form writing to have a convo but decided with the latter. There’s TL;DR at the end if you’re in a hurry…

Long-version:

Seems like the beauty is in the context - mixing info leakage and adversarial games (I read Pt 2).

So are you trying to access mempool data and figure out what kind of txn it is but just faster via learning algo? Or preventing it via reverse engineering the best way to achieve it? There are many more applications I suppose.

I don’t think I know the domain as much as you do but I see why you’re looking at learning algos vs. deterministic approach. Writing a separate deterministic algo every time there’s a major change in smart contracts that’s been deployed would be a lot of (unpredictable as well) work. Info Leakage prevention (and any other market microstructure design choices) is baked in to Smart Contract design. Really cool stuff…

---- Thinking out loud on the construct of the problem.

There seems to be limited time (or compute resources) to guess what kind of txn and seems like the distribution over txns are pretty much given + you’d use the model quite often so you’d want something that’ll get better with time.

What are you using concept learning for?

Correct me if I’m wrong on this. What I was thinking was you’re doing concept learning with txn types as concepts and their parameters as concept’s features. In that case, I see why you’re thinking about concept modeling + computational learning theory + PAC-Bayesian approach: Learn the concepts, then use idea from computational learning theory to find a “shortest” sequence of data points to pick to guess f*. And if you attach a prior then you can let your algo get better over time since the algo would run in real-time everyday (a lot of data points…). PAC-Bayes is nice for not worrying about picking the right prior…

Not something I could have thought of personally. But I guess that’s the beauty of collaboration - you fill in each other’s gaps and you learn from each other!

— Thoughts on solution areas - I think there are some other approaches.

Exactly what’s the runtime limit? Just polynomial time is good?

Could try LDA. Gibbs or VI - welp I’m a right now… But literally a parameter to choose in most packages. Thinking that EVM data should really be mix of numerical and code data if any of the addresses link to a smart contract at some point so maybe not so many tweaks in the ETL pipeline (hopefully). Just plug-in data and it’ll tell you with some statistical significance what type of txn it might be (f* guess).

Last thought is that if getting the explicit concepts of some sort of categorization of txn types isn’t important (although a good artifact to have IMO). Then maybe possible to skip that and turn it into a 1 part problem with just trying to optimize your reward (as in Pt.2). Think black-box optimization methods here - I think Bayesian Opt would work nicely here. But be wary that it suffers from curse of dimensionality (although I don’t think you have that many dimensions in data for EVM? - dimensionality reduction is nice too). So I’d focus more on black-box optimization part here - with search criteria being robust to data size, time complexity, and fitness to your data shape

I’ll admit I’m a bit of a Bayesian ML fan.

Anyways, thanks for a fun problem - this was a fun pre-sleep exercise!

TL; DR:
2 additional ideas -

Could try do LDA (Latent Dirichlet Allocation). EVM data should really be mix of numerical and code data if any of the addresses link to a smart contract at some point so maybe not so many tweaks in the ETL pipeline (hopefully). Just plug-in data and it’ll tell you with some statistical significance what type of txn it might be (guess of f*).
Pt1 and 2 don’t have to be different parts. If you don’t care about getting the explicit concepts of some sort of categorization of txn types (although a good artifact to have IMO). Then maybe possible to just turn it into a 1 part problem with blackbox optimization. My favorite is Bayesian Optimization - it’ll constantly update and make a good guess within a few guesses. Complexity is really in matter of choosing kernel and acquisition function. There are ton of good literature on it -

I’ll admit I’m a bit of a Bayesian fan. And personally, couldn’t have thought of the computational learning theory approach. But I guess that’s the beauty of collaboration - you fill in each other’s gaps and you learn from each other!

Anyways, thanks for a fun problem - this was a fun pre-sleep exercise!