Mempool Dumpster - a Free Mempool Transaction Archive

metachris · September 15, 2023, 7:56am

Mempool Dumpster is a free mempool transaction archive, brought to you by Flashbots in collaboration with various transaction providers (Infura, bloXroute, Chainbound, Eden, as well as local geth nodes):

Data: mempool-dumpster.flashbots.net
The data is published under the CC-0 public domain license (allowing any use whatsoever)
Source code: GitHub - flashbots/mempool-dumpster: Dump all the mempool transactions 🗑️ ♻️ (in Parquet + CSV)

The core output is a daily Parquet and CSV file, which contains:

Transaction metadata
Information about which sources it was received from
Raw transaction (RLP-encoded, only in the Parquet file)

You can find the files on the website for a particular month, like this one: 2023-09.

CSV example: Mempool Dumpster - Daily CSV Example · GitHub
Daily summary file, like 2023-09-14_summary.txt
For fast query execution, we recommend working with the Parquet output files (more details below).

ClickHouse

We strongly recommend ClickHouse to work with the datasets - it’s versatile, performant, and optimized for large datasets.

You can use ClickHouse locally or also ClickHouse Cloud, where you can upload the CSV files and have a nice user interface.
ClickHouse is open-source, and you can install it based on the installation instructions, or using Homebrew:

brew install clickhouse

Download data

Grab a dataset for a particular day from here:

wget https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-14.parquet

Explore the dataset

# Show the schema
$ clickhouse local -q "DESCRIBE TABLE '2023-09-14.parquet';"

# Get exclusive transactions from bloxroute
$ clickhouse local -q "SELECT COUNT(*) FROM '2023-09-14.parquet' WHERE length(sources) == 1 AND sources[1] == 'bloxroute';"

# Count exclusive transactions from all sources (query takes <1 sec)
$ clickhouse local -q "SELECT sources[1], COUNT(*) FROM '2023-09-14.parquet' WHERE length(sources) == 1 GROUP BY sources[1];"
eden	22460
local	309
infura	133
apool	86
bloxroute	29871
chainbound	20008

# Get details for a particular hash
$ clickhouse local -q "SELECT timestamp,hash,from,to,hex(rawTx) FROM '2023-09-14.parquet' WHERE hash='0x09ec56414462c3980772999e0d27fa0a79dcb667a156e1a1532ed0f5eaa672f3';"

These are the available fields:

$ clickhouse local -q "DESCRIBE TABLE '2023-09-14.parquet';"
timestamp	Nullable(DateTime64(3))
hash	Nullable(String)
chainId	Nullable(String)
from	Nullable(String)
to	Nullable(String)
value	Nullable(String)
nonce	Nullable(String)
gas	Nullable(String)
gasPrice	Nullable(String)
gasTipCap	Nullable(String)
gasFeeCap	Nullable(String)
dataSize	Nullable(Int64)
data4Bytes	Nullable(String)
sources	Array(Nullable(String))
rawTx	Nullable(String)

Have fun, and please share your results, insights and feature requests!

vitaliy · September 15, 2023, 9:07am

I recommend using GitHub - dvush/mempool-dumpster-rs: Utils for flashbots/mempool-dumpster for downloading data.

It converts everything including sourcelog files and transaction-data files to parquet.
Its convenient and can be used to update dataset continuously (by rerunning this command every day).

For example,

mempool-dumpster get 2023-09

would download all sourcelog and transaction data for a month (2023-09). If data for some days is already there it will only download what is missing.
so you can rerun it every day and download the latest files only

0xEvan · September 22, 2023, 12:53am

Thanks for the data @metachris and the useful download tool @vitaliy .

I have some questions about the data accuracy for the month of September:

I took all of the sourcelog hash’s that appear more than 10 times and searched that against the transaction-data. There are about 222k unique hash’s that appear more than 10 times. The max count for some of these transactions ranges anywhere from 10-80 in a fairly distributed fashion.

Within the transaction-data files, the max count for duplicate hash’s that appear in the transaction-data tops out at 10.

Some of these transactions have quite a lot of ETH in them. Here is a subset of these transactions. I removed the timestamp column to do this groupby - it seems that the entire row is identical besides the timestamp.

Here is a subset of these questionable transactions:
shape: (222_716, 3)

hash	value	count
str	f64	u32
“0xd5fe940e36f0f1751c942d0a607a5eece6a8c703e4bbdff4643b289aa2130c85”	242.5	4
“0xb629b6b6c56a04e6b691d7f280e68352651338f2682ab54249edfd820d3efb63”	140.83454	10
“0xab8686aec7609924a4292bea3a2c77de017fc461c57fb00fc0dd96a86d35481c”	100.0019	10
“0x47ecdc4d7d21d26a4c689e03b34080f0aa5530e1cc661f25a177c5f8996be8e9”	96.0	5
“0xed39307ab815f96b2e6bded72276e2e42305df500b595aa0dc23ab83f5a392b0”	96.0	5
“0xc0d026c482810220444e2231dcdb9be03dd08fe6533f83e9d774c2240b827ba7”	84.71442	9
“0x229a59832b9dd174d6c6326adb9754aad9e7ec8055e32ef74d643c693589a5ef”	45.318917	9
“0xbc7b0dca0f8697b67ee33a7dd859fa293061d454449f4baa8cf716a44940e4da”	32.0	9
“0x9381f46465a175e69da23ecb87821f8574c317eaee64b55efa7e666b8afe1c3f”	27.992319	9
“0x2437a267466257bc2f15af6e9e5915ae916996d3f0bc9bac90f71e2dad4f697b”	26.398904	10

Are you able to verify whether this is accurate transaction data or if there is some sort of replication bug in the ingestion process?

metachris · September 22, 2023, 8:07am

Thanks for sharing your insights and questions!

Some notes:

Several sources resend large amounts of old, pending and already included transactions.
Mempool dumpster only checks inclusion status since a few days, and filters out already included transactions (we’ll backfill this and sanitize the old data in the next couple of days)
Transactions are not validity-checked yet (they could have an invalid nonce, or the sender not enough balance to actually transfer the value)
It does not seem like a bug in the ingestion process, these are transactions that were repeatedly sent by some sources, and might just be no good.

Looking at some of your examples with the highest value:

0xd5fe940e36f0f1751c942d0a607a5eece6a8c703e4bbdff4643b289aa2130c85 - already included on-chain
0xb629b6b6c56a04e6b691d7f280e68352651338f2682ab54249edfd820d3efb63 - pending since 20 days
0xab8686aec7609924a4292bea3a2c77de017fc461c57fb00fc0dd96a86d35481c - pending since 20 days
0x47ecdc4d7d21d26a4c689e03b34080f0aa5530e1cc661f25a177c5f8996be8e9 - probably being kicked around the mempool for a while

offerm · September 27, 2023, 10:14am

current way of presentation does not provide real value to users.
for example, check p99 - it looks like chainbound is much faster on top 1% of the transactions…
but 1% when chainbound first is 3120 txs while 1% when blox first is 6106 tx.
Same for median and the others.

Usually, you do this stats on all results together. So median also tells you how fast/slow the provider is.

With the current format you can’t actually tell which source is faster and by how much.

I suggest to change the report so it will look like this:

metachris · September 29, 2023, 8:27am

In the example above, the number and percentage of which source is faster is already clearly displayed, isn’t it?

Also “median” shows by how much 50% were faster, p90 shows how much the top 10% were faster, p99 shows the speed advantage of the top 1%.

I’m not sure what’s missing, and how you would like calculate the numbers specifically

offerm · September 29, 2023, 3:09pm

With the current report (the example above) you calculate seperate statistics for the 2 groups.
in the “bloxroute first” group the median is 5ms and also in the “chainbound first” the median is 5m
but what is teh real median?

It is clear that bloXroute provides first 50% of the transactons, but the time difference is not clear.

What I suggest is to calculate the timediff for each transaction and to take the median, p25, p75, p10 and p90 on that variable,.

This will show the full picture.

offerm · September 29, 2023, 3:17pm

Formatting - I leave that to you… The first 4 lines value - you have that already in current report. To calculate the next 5 lines, you create a new variable diff=bloxtime-chainTime.

negative value means that bloXroute was first, positive value means that chainbound was first. Now you sort all these values in ascending order and you check the value in 10%, 25%, 50% (median), 75%, 90%. If the number is negative you write bloxroute won p% of txs by at lease abs(value)' if the number is positive you write chainbound won (100-p)% by at least abs(value)`

metachris · September 29, 2023, 7:03pm

That’s already what’s happening: mempool-dumpster/common/analyzer.go at main · flashbots/mempool-dumpster · GitHub

sui414 · October 2, 2023, 7:25am

i think the original summary is helpful and pretty clear, to show the stats for the subset of each mempool’s winning tx set 's distribution - so to have an idea about “when X mempool see this % of txs first, how faster do they see it”.

but i think what @offerm suggested also has value to show the stats for the whole shared tx distribution.

it’s like a “bayesian conditional distribution” f(-delta|they_win) vs “global distribution” f(delta) if to make an analogy. And imo both are valuable to show.

I think ideally they meant something like this -

take the whole distribution. for each shared tx hash;
take the very first ts seen by each mempool;

image2186×992 343 KB
take the delta of the 2 mempools’ first ts;
then take the stats (p25…p99):

image2288×1318 286 KB

so the result summary table can add a column like this:

Also i think the way @offerm described in the summary table (e.g. BLOXROUTE won 10% of tx by at least xxx ms) is hard to comprehend / less intuitive for users altho it is strict accurate statistical description, imo prob just listing the stats in a column like above is easier to read.

sui414 · October 2, 2023, 8:03am

smol tweak, maybe move the global column to the first, and add the total count will make it more obv it’s the global stats:

0xEvan · October 12, 2023, 2:31am

hey side question, what kind of IDE are you using?

0xEvan · October 12, 2023, 2:44am

@metachris some additional questions,

what are the gasTipCap and ‘gasFeeCap’ columns in transactions-data?
what is the difference between “local” and “mempoolguru” source? Mempoolguru runs 7 full nodes geographically distributed. Mempoolguru also collects from infura. Does this mean that there is overlap between the sources? If so how much overlap?

sui414 · October 12, 2023, 12:42pm

it’s https://hex.tech/

metachris · October 16, 2023, 9:04am

gasTipCap: EIP-1559 tip per gas
GasFeeCap: EIP-1559 fee cap per gas
local vs mempool guru overlap: to find the specific overlap, you can download the dataset and dig into it yourself - Mempool Dumpster ♻️

0xEvan · October 16, 2023, 7:13pm

local vs mempool guru overlap: to find the specific overlap, you can download the dataset and dig into it yourself - Mempool Dumpster ♻️

i briefly did and it looks like there is a non-trivial amount of overlap. Additionally, it looks like mempool guru is being recorded for each mempool guru node local mempool the tx is being sent to so there were a number of transactions being shown as being submitted to mempool guru as 7 (their total number of local nodes).

I didn’t see further motivation to look into the issue more and I don’t think there is anything better that can be done, just wanted to note that this does appear to exist and needs to be taken into account for anyone doing mempool analysis

metachris · October 18, 2023, 9:29am

I’m not quite sure what you mean with “needs to be taken into account for analysis” – could you clarify?

There is a large amount of overlap between the sources, as is expected, and as exists across all sources. The number of mempool guru nodes doesn’t really matter because we have a single subscription where MD receives transactions from.

Here’s a few queries to show the amount of overall transactions for local and mempool guru, and the number of transactions that were seen by both (for 2023-10-17):

$ clickhouse local -q "SELECT COUNT(*) FROM '2023-10-17.parquet' WHERE has(sources, 'mempoolguru');"
948792
$ clickhouse local -q "SELECT COUNT(*) FROM '2023-10-17.parquet' WHERE has(sources, 'local');"
950397
$ clickhouse local -q "SELECT COUNT(*) FROM '2023-10-17.parquet' WHERE hasAll(sources, ['mempoolguru', 'local']);"
938537

0xEvan · October 19, 2023, 9:56pm

First of all, that’s a nifty keyword has from your queries! Secondly, while I agree that we should expect a large overlap between sources, I am questioning the reason for why that is. I think if someone is submitting a transaction to multiple sources, then it makes sense that it is recorded multiple times. However, if multiple sources are just picking up the same tx from the same mempool and then reporting it independently, then I think this is a (slight) issue and at the very least, needs to be filtered out for additional analysis hence my phrase “needs to be taken into account for analysis”.

It appears that you are talking about “local” and “mempool guru” as if they are different sources - at least this is my impression and I will assume this is true, but please correct me if I am wrong here.

The way that “mempool guru” collects data is defined as

"Currently, we collect mempool data from 7 full nodes across the globe (marked on the map below), as well as from RPC services such as Infura and QuickNode. "

The reason why I think it’s wrong to consider the “local” node different from mempool guru is that mempool guru transactions are already geographically distributed. So, for example, if the local node is located in Europe/USA next to one of the mempool guru full nodes, the transactions will practically be identical. However, I can understand the rational to run local nodes that are farther away than the mempool guru nodes and I hope that some thought went into the geographical location for where the local node is being run.

Additionally, my understanding is that you collect mempool txs from Infura and Quicknode separately than mempool guru. Or do you consider the mempool guru txs that come from Infura and Quicknode as the actual txs coming from Infura and Quicknode?

__
The overall issues I am seeing here are that there are multiple mempool transactions being collected from each source, but it’s not necessarily because the tx is being submitted to multiple mempools. Rather it seems to be because each source is just collecting the same tx from the same mempool source. However, if I am misunderstanding something, then please let me know!

metachris · October 20, 2023, 12:14pm

For MD the source doesn’t really matter, it’s just trying to get a most complete picture of mempool transactions from across all available sources.

The sources currently include Apool, Bloxroute, Chainbound, Eden, Infura, Local, Mempoolguru.

Transactions from all these sources provide a good, geographically distributed, picture of mempool transactions at any point in time.

alexchenzl · October 22, 2023, 2:25pm

Hi @metachris, the daily archive data is awesome. But I’m curious some information about how the daily data are collected, such as:

in which AWS region are those mempool APIs called ?
Which subscription plan is used with Bloxroute, since higher plan should get better performance.