Mempool Dumpster - a Free Mempool Transaction Archive

Mempool Dumpster is a free mempool transaction archive, brought to you by Flashbots in collaboration with various transaction providers (Infura, bloXroute, Chainbound, Eden, as well as local geth nodes):

The core output is a daily Parquet and CSV file, which contains:

  • Transaction metadata
  • Information about which sources it was received from
  • Raw transaction (RLP-encoded, only in the Parquet file)

You can find the files on the website for a particular month, like this one: 2023-09.

ClickHouse

We strongly recommend ClickHouse to work with the datasets - it’s versatile, performant, and optimized for large datasets.

  • You can use ClickHouse locally or also ClickHouse Cloud, where you can upload the CSV files and have a nice user interface.
  • ClickHouse is open-source, and you can install it based on the installation instructions, or using Homebrew:
brew install clickhouse

See also:

Download data

Grab a dataset for a particular day from here:

wget https://mempool-dumpster.flashbots.net/ethereum/mainnet/2023-09/2023-09-14.parquet

Explore the dataset

# Show the schema
$ clickhouse local -q "DESCRIBE TABLE '2023-09-14.parquet';"

# Get exclusive transactions from bloxroute
$ clickhouse local -q "SELECT COUNT(*) FROM '2023-09-14.parquet' WHERE length(sources) == 1 AND sources[1] == 'bloxroute';"

# Count exclusive transactions from all sources (query takes <1 sec)
$ clickhouse local -q "SELECT sources[1], COUNT(*) FROM '2023-09-14.parquet' WHERE length(sources) == 1 GROUP BY sources[1];"
eden	22460
local	309
infura	133
apool	86
bloxroute	29871
chainbound	20008

# Get details for a particular hash
$ clickhouse local -q "SELECT timestamp,hash,from,to,hex(rawTx) FROM '2023-09-14.parquet' WHERE hash='0x09ec56414462c3980772999e0d27fa0a79dcb667a156e1a1532ed0f5eaa672f3';"

These are the available fields:

$ clickhouse local -q "DESCRIBE TABLE '2023-09-14.parquet';"
timestamp	Nullable(DateTime64(3))
hash	Nullable(String)
chainId	Nullable(String)
from	Nullable(String)
to	Nullable(String)
value	Nullable(String)
nonce	Nullable(String)
gas	Nullable(String)
gasPrice	Nullable(String)
gasTipCap	Nullable(String)
gasFeeCap	Nullable(String)
dataSize	Nullable(Int64)
data4Bytes	Nullable(String)
sources	Array(Nullable(String))
rawTx	Nullable(String)

Have fun, and please share your results, insights and feature requests! :zap:

4 Likes

I recommend using GitHub - dvush/mempool-dumpster-rs: Utils for flashbots/mempool-dumpster for downloading data.

  1. It converts everything including sourcelog files and transaction-data files to parquet.
  2. Its convenient and can be used to update dataset continuously (by rerunning this command every day).

For example,

mempool-dumpster get 2023-09

would download all sourcelog and transaction data for a month (2023-09). If data for some days is already there it will only download what is missing.
so you can rerun it every day and download the latest files only

3 Likes

Thanks for the data @metachris and the useful download tool @vitaliy .

I have some questions about the data accuracy for the month of September:

I took all of the sourcelog hash’s that appear more than 10 times and searched that against the transaction-data. There are about 222k unique hash’s that appear more than 10 times. The max count for some of these transactions ranges anywhere from 10-80 in a fairly distributed fashion.
image

Within the transaction-data files, the max count for duplicate hash’s that appear in the transaction-data tops out at 10.

Some of these transactions have quite a lot of ETH in them. Here is a subset of these transactions. I removed the timestamp column to do this groupby - it seems that the entire row is identical besides the timestamp.

Here is a subset of these questionable transactions:
shape: (222_716, 3)

hash value count
str f64 u32
“0xd5fe940e36f0f1751c942d0a607a5eece6a8c703e4bbdff4643b289aa2130c85” 242.5 4
“0xb629b6b6c56a04e6b691d7f280e68352651338f2682ab54249edfd820d3efb63” 140.83454 10
“0xab8686aec7609924a4292bea3a2c77de017fc461c57fb00fc0dd96a86d35481c” 100.0019 10
“0x47ecdc4d7d21d26a4c689e03b34080f0aa5530e1cc661f25a177c5f8996be8e9” 96.0 5
“0xed39307ab815f96b2e6bded72276e2e42305df500b595aa0dc23ab83f5a392b0” 96.0 5
“0xc0d026c482810220444e2231dcdb9be03dd08fe6533f83e9d774c2240b827ba7” 84.71442 9
“0x229a59832b9dd174d6c6326adb9754aad9e7ec8055e32ef74d643c693589a5ef” 45.318917 9
“0xbc7b0dca0f8697b67ee33a7dd859fa293061d454449f4baa8cf716a44940e4da” 32.0 9
“0x9381f46465a175e69da23ecb87821f8574c317eaee64b55efa7e666b8afe1c3f” 27.992319 9
“0x2437a267466257bc2f15af6e9e5915ae916996d3f0bc9bac90f71e2dad4f697b” 26.398904 10

Are you able to verify whether this is accurate transaction data or if there is some sort of replication bug in the ingestion process?

1 Like

Thanks for sharing your insights and questions!

Some notes:

  • Several sources resend large amounts of old, pending and already included transactions.
  • Mempool dumpster only checks inclusion status since a few days, and filters out already included transactions (we’ll backfill this and sanitize the old data in the next couple of days)
  • Transactions are not validity-checked yet (they could have an invalid nonce, or the sender not enough balance to actually transfer the value)
  • It does not seem like a bug in the ingestion process, these are transactions that were repeatedly sent by some sources, and might just be no good.

Looking at some of your examples with the highest value: