Privacy Concerns with Data API

jtraglia · October 17, 2022, 8:45pm

I’d like to modify the data API validator registration endpoint to be more private. Given a list of validator pubkeys, it’s possible to download all of the registrations and run analytics on them. One particular point of concern, these could be used to group validators that were previously not associated. Because validators send registrations in batches, registrations for validators on the same machine will most likely have the exact same timestamp. This information could facilitate DoS attacks.

There are multiple ways to go about this.

Remove the validator registration data API endpoint.
Return “yes” or “no” to whether the validator is registered, instead of the registration.
Require some type of signed message to get the registration.

mateusz · October 18, 2022, 9:29am

The endpoint is very useful to solo stakers that want to verify their once-a-year block will actually be built how they want.
I agree that signing the message would be the best way to go, but it’s also the most complex - can it be easily done?
Otherwise, how about confirming whether an exact registration is seen - pubkey + gas + fee recipient? I think that wouldn’t leak as much information as the current API while still being useful.

metachris · October 18, 2022, 11:16am

Yes, this is a useful endpoint for both solo stakers and staking pools. It seems it would still achieve it’s goal without exposing the feeRecipients.

One idea could be that the API returns only yes or no about whether this is an active (or known?) validator on the relay.

It could also be an idea to allow sending pubkey+feeRecipient/gasLimit, and have the API return whether that’s the latest known/active registration (for someone to check the latest feeRecipient without exposing it publicly).

Lastly, staking pools want a way to check many pubkeys at once. This API could also accept an array of and return all the responses at once.

jtraglia · October 18, 2022, 2:44pm

That’s an interesting alternative. Just to be clear, something like (but joined into one line):

https://boost-relay.flashbots.net/relay/v1/data/validator_registration
    ?pubkey=<pubkey>
    ?feeRecipient=<feeRecipient>
    ?gasLimit=<gasLimit>

That would return something like:

{
  "found": true
}

This would resolve my biggest concern (timestamps) but it would still be possible for anyone to check if a validator is connected to a relay at a given point in time. That information (pubkey, fee recipient, gas limit) is mostly public. You can assume 30000000 for the gas limit and look at the most recent (post-merge) block to determine the fee recipient.

I think this is okay though. It’s a good balance between privacy and usability. Requesting a signed message would require several changes to validators/remote-signers and I don’t think that’s worth it.

jtraglia · October 18, 2022, 8:06pm

I like this idea too. Send a POST request with an array of registrations:

[
  {
    "pubkey": "<a_pubkey>",
    "fee_recipient": "<a_fee_recipient>",
    "gas_limit": "<a_gas_limit>"
  },
  {
    "pubkey": "<b_pubkey>",
    "fee_recipient": "<b_fee_recipient>",
    "gas_limit": "<b_gas_limit>"
  }
]

The gas_limit field could be optional. I don’t think the fee_recipient should be optional though. Only return found entries. For example, if the first one exists and the second one doesn’t:

[
  {
    "pubkey": "<a_pubkey>",
    "fee_recipient": "<a_fee_recipient>",
    "gas_limit": "<a_gas_limit>"
  }
]

Or, if you want to me be minimalist, an array of found public keys:

[
  "<a_pubkey>"
]

There should be some limits/restrictions. For example, limit the number of entries to 1,000 per request. This would prevent someone from sending a request with 10,000,000 duplicate entries.

metachris · October 18, 2022, 8:12pm

I like that API proposal a lot.

It’s an important detail whether fee_recipient should be optional!

It would make the API even more private, because you already need to know both, and can’t look anyone up just by pubkey. Downside is, it would probably make it a little harder to casually check a valid registration.

If there’s no specific needs otherwise, then I think there’s a strong argument for going with the more private solution (requiring pubkey and fee_recipient as inputs to the APIs, with gas_limit optional).

metachris · October 18, 2022, 8:15pm

Another important question: Should the API only check active validators, or also also checking all ever registered ones?

I think active validators only sounds more useful, but would love to hear more use-cases.

12345 · October 18, 2022, 9:10pm

Ultimately, this API is useful to check that many validators are correctly/actively registered. Any of the proposals here would work for me!

As an idea to normalize the data a bit:

{
  "fee_recipient_1": [ "pubkey_1a,gas_limit", "pubkey_1b", ..., "pubkey_1n" ],
  "fee_recipient_2": [ "pubkey_2a", "pubkey_2b,gas_limit", ..., "pubkey_2n,<optional-gas-limit>" ],
}

Results in a response with immediately actionable information. Under ideal conditions, this response would be empty.

[
  "<missing_pubkey_a>",
  "<...>",
  "<missing_pubkey_n>"
]

metachris · October 25, 2022, 10:36am

12345:

As an idea to normalize the data a bit:

{
  "fee_recipient_1": [ "pubkey_1a,gas_limit", "pubkey_1b", ..., "pubkey_1n" ],
  "fee_recipient_2": [ "pubkey_2a", "pubkey_2b,gas_limit", ..., "pubkey_2n,<optional-gas-limit>" ],
}

I prefer the more verbose version which @jtraglia proposed above, because I feel it’s more clear and expressive (at the cost of additional bytes):

[
  {
    "pubkey": "<a_pubkey>",
    "fee_recipient": "<a_fee_recipient>",
    "gas_limit": "<a_gas_limit>"
  },
  {
    "pubkey": "<b_pubkey>",
    "fee_recipient": "<b_fee_recipient>"
  }
]

Imo, the API should return the found entries, but it could make sense to add an optional flag to request a list of keys that were not found too

12345 · October 25, 2022, 10:53am

Definitely agree, returning the set of found entries is more intuitive. However, the real actionable information we need is the compliment of that set.

Perhaps returning two arrays, registered and unregistered?

metachris · October 25, 2022, 1:09pm

My gut feeling it’s better to return one, and have a query arg that allows returning the other. But open to discussion!

metachris · November 8, 2022, 11:24am

Put together a more ‘formal’ spec here: [spec proposal] improved API to get validators · Issue #237 · flashbots/mev-boost-relay · GitHub

Will start implementation soon, if you have any remaining comments or thoughts then please post them soon