Issue Details
- Number
- 29662
- Title
- More control of maintenance processes at startup/restart
- Description
- ### Please describe the feature you'd like to see added.
I'd like some way to control the levelDB compaction processes a bit more. Specifically, if I could control when they are scheduled (hourly in the background, for example), and possibly limit resources they consume (I/O, in particular), I think it would help.
### Is your feature related to a problem, if so please describe it.
The problem I'm having is that when I reboot my bitcoin-core node (which runs with `txindex=1`), the RPC listener comes up pretty promptly, so my healthchecks (currently just TCP) pass. However, the service is not in its usual baseline state; it is doing a lot of read I/O, and logging about levelDB compaction. This condition lasts for about an hour.
Here is an illustration of the I/O level relative to baseline. The left side is the restart, and the lower right side is after this abates. Dark blue is read:
<img width="634" alt="image" src="https://github.com/bitcoin/bitcoin/assets/35895831/f0902e3d-a41f-4a45-bbd4-b22c18537fa6">
Here's a summary of the logs at this time:
<img width="500" alt="image" src="https://github.com/bitcoin/bitcoin/assets/35895831/4d42e318-5430-45a7-b761-5c16394d76ab">
The "Compacting" log line is extremely elevated during this period (though it does occur at a much lower level after):
<img width="765" alt="image" src="https://github.com/bitcoin/bitcoin/assets/35895831/815b361b-b242-4a89-a32a-a7c5c7e66155">
Here's a graph of RPC trace P50 duration before, during, and after this phase:
<img width="1200" alt="image" src="https://github.com/bitcoin/bitcoin/assets/35895831/f6c6ca27-6db2-4ef2-937f-d390da18e66d">
The real problem is the last graph, the elevated RPC latency. The tail and head latencies are also much worse than normal, so it's not just a tail latency issue I could solve with timeouts / hedging. The server goes from microsecond/millisecond latency to 10s of seconds, especially for `sendrawtransaction` (10-30s max), with `listunspent` a distant second (3-4s max).
This latency abates as soon as the high I/O and compaction logging stops. I am therefore making the **intuitive leap** (so experts, please consider this critically and I welcome other explanations) that resource utilization during compaction is causing some RPCs to be very slow. I also considered lock contention (perhaps `cs_main`) but I couldn't see it in the code.
### Describe the solution you'd like
I'd like the experts to recommend a solution. Intuitively, it seems like levelDB could amortize this compaction work during normal operation as a background task ([the README seems to imply it already should?](https://github.com/google/leveldb?tab=readme-ov-file#read-performance)). Or maybe some way to limit resources used for compaction?
### Describe any alternatives you've considered
Currently, I'm looking at alternative ways to do the RPC I enabled `txindex` for, which is `getrawtransaction` without the blockhash. But, that will only sidestep this issue with compaction and RPC latency.
### Please leave any additional context
I am using a slower filesystem than most. It is a regionally-replicated NFS store, which we chose for resiliency reasons. Intuitively, I'd expect this problem to be less severe (or shorter duration) with lower-latency storage, but still present.
Command line args:
```
txindex="1", rpcworkqueue="1024", rpc_*="redacted", debug="coindb", debug="estimatefee", debug="reindex", debug="leveldb", debug="walletdb", debug="lock", debug="rpc", dbcache="5734", datadir="/home/bitcoin/data", chain="main"
```
- URL
-
https://github.com/bitcoin/bitcoin/issue/29662
- Closed by
-
Back to List