Restoring state-sync snapshot takes a long time #698

webmaster128 · 2023-01-22T22:03:27Z

I recently started to play with using state-sync to create a fresh node in the Nois testnet. There we observed that applying the last chunk of the snapshot takes approximately 1 hour and requires something between 13 and 14 GB of memory. As many community members consider this unexpectedly long, I though it's worth debugging what's going on here.

Network

Chain ID: nois-testnet-003
Started: 2022-10-07 (~3.5 months ago)
Height: 3.4 million
Node: wasmd 0.29.0-rc2 with custom build script, including cosmos-sdk v0.45.8, iavl v0.19.1, tendermint v0.34.21, tm-db v0.6.7

Snapshot

Height: 3404000
Chunks: 18 (i.e. ~180 MB in total)

System and utalization

VPS, 4 vCPU, 16GB
Swap: none
All 4 cores are under 90-100% load
Memory usage raises up to 13 GB, sometimes slights above. In earlier attempts with 8 GB or 4 GB machines the node process was killed due to OOM.

Logs

The following logs show how the snapshot is downloaded easily, the fist 17 chunks are applied within 1 minute each and the last chunk takes 62 minutes to apply.

Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="�ܸ\vU\x15���>Vx��n�ʜ|���\x1a\x18�����V�F" height=3402000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash=" K�֍�T�\x11�u���\x1bUch�������w�6N+�\v" height=3400000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="���w!�7�֗�e��bDBd'�=\x1aJ�F��\b�4�T" height=3398000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="\x01ƹ\\'��\x17+�f�X�i�Y�,�S,p:3~Ҧ�g\x1b�" height=3396000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="��,�\x02��\aj�\t\x7f�����7��G4\x16�m�01ʖȟ" height=3236000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="q�QI�}\x04_D�����S�{�I�\x19�dA� *�-g\n�" height=3234000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="����{\x02\"��6sm\x01�p\x05~҂�f+\x00O4�&\x04��c�" height=2508000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="O�(�t��B��FI\f*3��J��v�\x01\x1dIz8��\x1e��" height=3400500 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="\x18�p�����~�f �@�q��mב�\x17��ֈ�[�I�" height=3397500 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="��\a�Df\x03c\x0f����ԑ�\x11\x0e)\x1d>�I���J\x1c�\x1c�h" height=2914000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF Discovered new snapshot format=1 hash="=[aO�+ʤ�$���MA����!-:\b�D��� ػ�" height=2904000 module=statesync
Jan 22 19:47:57 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=A0A5D1A7BF2AF5B58E6E8414A0830E1DD105E33E7EF471452DE8058DA9EC535A height=3404002 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Offering snapshot to ABCI app format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Snapshot accepted, restoring format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=2 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=0 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=1 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Fetching snapshot chunk chunk=3 format=1 height=3404000 module=statesync total=18
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF VerifyHeader hash=1152E87653548B640646D894483B3177A923A7B91BD687F0BA585DBCED8A1E83 height=3404000 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=A0A5D1A7BF2AF5B58E6E8414A0830E1DD105E33E7EF471452DE8058DA9EC535A height=3404002 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=38608248F22723878BC0C396849491FADCDA9FD620E10CA47DCA93BC2E007046 height=3404001 module=light
Jan 22 19:47:58 syncer noisd[7744]: 7:47PM INF Header has already been verified hash=1152E87653548B640646D894483B3177A923A7B91BD687F0BA585DBCED8A1E83 height=3404000 module=light
Jan 22 19:48:00 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=4 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:02 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=5 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:02 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=0 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=6 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=1 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:04 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=2 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:06 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=7 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:06 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=3 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:08 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=8 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:08 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=4 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:10 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=9 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:10 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=5 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:12 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=10 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:13 syncer noisd[7744]: 7:48PM INF Applied snapshot chunk to ABCI app chunk=6 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:14 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=11 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:16 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=12 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:18 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=13 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:20 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=14 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:22 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=15 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:24 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=16 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:25 syncer noisd[7744]: 7:48PM INF Fetching snapshot chunk chunk=17 format=1 height=3404000 module=statesync total=18
Jan 22 19:48:55 syncer noisd[7744]: 7:48PM INF Discovered new snapshot format=1 hash="�b%�\x02t����\\\x18�۾��D�\x12-���h���t���" height=3300000 module=statesync
Jan 22 19:49:23 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=7 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="\x1c\n�w�E2.T�5�ٽX\x0f`�����iV��ҶZ=\x06\n" height=3348000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="��\fd�H\x17(��3�5\x04ME�|\x1b%�à�\x04-G��\x1cǫ" height=3270000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="�u���.S\t��>\x04�pk����D]��Q\x14W\b�G�M�" height=3268000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="�\f\x1c\x1c���W̪Kf}�^�����0�5�!��n\x13�d\x11" height=3266000 module=statesync
Jan 22 19:49:27 syncer noisd[7744]: 7:49PM INF Discovered new snapshot format=1 hash="��\x14��ݡ\x10�\x1fǺ@_\x02l��y^�\x17\x1d,\x14��Q�'�k" height=3264000 module=statesync
Jan 22 19:49:32 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=8 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:34 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=9 format=1 height=3404000 module=statesync total=18
Jan 22 19:49:59 syncer noisd[7744]: 7:49PM INF Applied snapshot chunk to ABCI app chunk=10 format=1 height=3404000 module=statesync total=18
Jan 22 19:50:33 syncer noisd[7744]: 7:50PM INF Applied snapshot chunk to ABCI app chunk=11 format=1 height=3404000 module=statesync total=18
Jan 22 19:51:12 syncer noisd[7744]: 7:51PM INF Applied snapshot chunk to ABCI app chunk=12 format=1 height=3404000 module=statesync total=18
Jan 22 19:51:50 syncer noisd[7744]: 7:51PM INF Applied snapshot chunk to ABCI app chunk=13 format=1 height=3404000 module=statesync total=18
Jan 22 19:52:47 syncer noisd[7744]: 7:52PM INF Applied snapshot chunk to ABCI app chunk=14 format=1 height=3404000 module=statesync total=18
Jan 22 19:53:40 syncer noisd[7744]: 7:53PM INF Applied snapshot chunk to ABCI app chunk=15 format=1 height=3404000 module=statesync total=18
Jan 22 19:54:26 syncer noisd[7744]: 7:54PM INF Applied snapshot chunk to ABCI app chunk=16 format=1 height=3404000 module=statesync total=18
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="\"�\r-�d�\b�>�)������\r\x04\x11��\a�\x11�N��ȡ" height=3392000 module=statesync
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="��Uޔ�s��-\x03���\x1bV\x1a��㢵!�T\x05L���\"\x7f" height=3388000 module=statesync
Jan 22 20:08:27 syncer noisd[7744]: 8:08PM INF Discovered new snapshot format=1 hash="HF�}\x13ˉ���uh=�����J�v��҂�۰\a!@Q" height=3384000 module=statesync
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Applied snapshot chunk to ABCI app chunk=17 format=1 height=3404000 module=statesync total=18
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Verified ABCI app appHash="�\\�oEا(b�\af���fԅ��\x1e�����Y�m���" height=3404000 module=statesync
Jan 22 20:56:09 syncer noisd[7744]: 8:56PM INF Snapshot restored format=1 hash="�Bh�!/\x16/��T�b�d}o\x06��oO�ٛ\f\x1b(n���" height=3404000 module=statesync

Data/wasm directory

Size:

$ du -hs ~/.noisd/data/*
3.1G	/root/.noisd/data/application.db
28M	/root/.noisd/data/blockstore.db
125M	/root/.noisd/data/cs.wal
16K	/root/.noisd/data/evidence.db
4.0K	/root/.noisd/data/priv_validator_state.json
20K	/root/.noisd/data/snapshots
14M	/root/.noisd/data/state.db
15M	/root/.noisd/data/tx_index.db

Element count:

~/.noisd/data/application.db    31441423
~/.noisd/data/blockstore.db     10073
~/.noisd/data/evidence.db       1
~/.noisd/data/state.db          6050
~/.noisd/data/tx_index.db       458683

Wasm files are restored using the extension from wasmd. But those are only 11 small files stored to disk:

ls -lAh ~/.noisd/wasm/wasm/state/wasm/
total 4.0M
-rw-r--r-- 1 root root 482K Jan 22 20:56 298c8955ce09b32f261b824765d44e6892f9ae2f03fcd926d83ac47748f328d3
-rw-r--r-- 1 root root 200K Jan 22 20:56 6cec0fbc37e5c5da8774c344d5811989ce2178d8f2a17070120ad594a22bdce2
-rw-r--r-- 1 root root 421K Jan 22 20:56 70b58a1bcf77580016c4e8831041b617823073f27fee9f276ab171166c516c7f
-rw-r--r-- 1 root root 158K Jan 22 20:56 83f24155b5120a00c8f31aaefaa47f0a962b068ef5fa72950c78dab94c3a9272
-rw-r--r-- 1 root root 239K Jan 22 20:56 97c168d786d30a3d0cf1b94e488191842d0119c56db40fdacd57f9083bb798ea
-rw-r--r-- 1 root root 477K Jan 22 20:56 b999602332b97208d4cab78bbe7aea859153a12562eed1fced55816b1be9c197
-rw-r--r-- 1 root root 129K Jan 22 20:56 b9bb582396128a14936a1bc6c15ddb43d604bd70acd2260be9a2bd0a4a7d272c
-rw-r--r-- 1 root root 480K Jan 22 20:56 dfbc61e0cf078911383b2fa286adeee4b10842df2c9f9f70d73b6dea3f42c243
-rw-r--r-- 1 root root 207K Jan 22 20:56 ea99f400369f52a5472ff3cbf7f9a2c36e3d382d2ab1261bcc0e9db2ad4b10be
-rw-r--r-- 1 root root 620K Jan 22 20:56 edce62230570224683397fdb9f7df19ee7f807b9960ecf1b40ab2efe5879ace8
-rw-r--r-- 1 root root 626K Jan 22 20:56 f960f6a7560c45fbb2cbbe9d207d291653bbab1fddf920b2a2e9115b0e3ac9aa

Profiling

Some initial profiling data provided by @jhernandezb shows a lot of compating activity. Maybe he can elaborate more on the profiling of things.

Other notes

Validator @polkachu reports snapshot restore times in the order of magnitude of 1 hour for Nois testnet, Sei and EVMOS. Other chains are much faster.
Stargazes has way more .wasm codes but their state syncs are significantly faster
The 31 million KV entries in application.db might be more than typical
In this discussion, a user reports significant write slowdowns when a certain number of keys is reached

My gut feeling here is that at a certain number of database elements, things become slow. But I am far from a Go profiling expert or database expert and don't know how to debug this further. For us it is not a big deal right now and we can try to use less KV elements for the app. But maybe this is a good chance to prevent other mainnets running into problems.

The text was updated successfully, but these errors were encountered:

PolkachuIntern · 2023-01-22T22:47:26Z

Our state-sync scripts run every Sunday, so this issue is rather timely. Nois testnet state-sync took a bit over an hour today. In contrast, Stargaze mainnet state-sync took about 10 minutes. I am not knowledgable enough to articulate the reasons behind it. As Simon mentioned, Evmos and Sei state-sync is also slow and at times unreliable.

kaisbaccour · 2023-01-23T12:38:17Z

Would it make sense to make the inventory of the number of application.db KV entries for all these networks and some other very busy networks to say if the rule stands or no?

jhernandezb · 2023-01-23T15:49:41Z

Adding profiles:
cpu.pb.gz
mem.pb.gz

Binary can be built using:
https://github.com/noislabs/full-node/tree/main

One thing to mention is that stargaze still has fast node disabled by default (will be changed to enabled in our next upgrade).

Seems that fast node is contributing to more cpu cycles and garbage collection

tac0turtle · 2023-03-06T10:32:07Z

there are a couple issue in iavl to assist with this. Here is one pr that aims to increase the speed. #664.

This issue could be to how large your current state is? ill transfer this issue to iavl so it can be focused on where the actual problem may lie. Lots of the compacting is being cleaned up as we speak.

tac0turtle · 2023-10-06T13:17:31Z

with iavlv1 this should be significantly faster. Would you want to test with this version?

tac0turtle transferred this issue from cosmos/cosmos-sdk Mar 6, 2023

tac0turtle added the C:state sync label Mar 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restoring state-sync snapshot takes a long time #698

Restoring state-sync snapshot takes a long time #698

webmaster128 commented Jan 22, 2023

PolkachuIntern commented Jan 22, 2023

kaisbaccour commented Jan 23, 2023

jhernandezb commented Jan 23, 2023 •

edited

Loading

tac0turtle commented Mar 6, 2023 •

edited

Loading

tac0turtle commented Oct 6, 2023

Restoring state-sync snapshot takes a long time #698

Restoring state-sync snapshot takes a long time #698

Comments

webmaster128 commented Jan 22, 2023

PolkachuIntern commented Jan 22, 2023

kaisbaccour commented Jan 23, 2023

jhernandezb commented Jan 23, 2023 • edited Loading

tac0turtle commented Mar 6, 2023 • edited Loading

tac0turtle commented Oct 6, 2023

jhernandezb commented Jan 23, 2023 •

edited

Loading

tac0turtle commented Mar 6, 2023 •

edited

Loading