Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CHIP Persistent data storage writes are not power-loss safe #977

Open
kepstin opened this issue Nov 27, 2024 · 2 comments
Open

CHIP Persistent data storage writes are not power-loss safe #977

kepstin opened this issue Nov 27, 2024 · 2 comments

Comments

@kepstin
Copy link

kepstin commented Nov 27, 2024

In this discord thread: https://discord.com/channels/330944238910963714/1311124568081174628 a user reported an issue where following a power loss, all of their matter devices disappeared. From the Matter server logs:

2024-11-26 20:12:41.236 (MainThread) ERROR [chip.storage] Expecting ',' delimiter: line 29 column 111 (char 4111)
2024-11-26 20:12:41.236 (MainThread) CRITICAL [chip.storage] Could not load configuration from /data/chip.json - resetting configuration...

It looks like the configuration file was present, but had invalid data in it. The user was able to restore a backup correct the issue, but when I looked at the code to persist the storage: (edit: this code doesn't seem to be responsible for the user's issue, which happened in the Matter SDK code - but this code is incorrect.)

async def async_save(self) -> None:
"""Save persistent data to disk."""
assert self.server.loop is not None
def do_save() -> None:
# make backup before we write a new file
if self.filename.is_file():
self.filename.replace(self.filename_backup)
with open(self.filename, "w", encoding="utf-8") as _file:
_file.write(json_dumps(self._data))
LOGGER.debug("Saved data to persistent storage")
async with self._save_lock:
await self.server.loop.run_in_executor(None, do_save)

I noticed that the logic here is not power loss safe, and might result in a partially written or missing file for unlucky users.

The current logic does:

  • Rename chip.json to chip.json.backup. Following this, if power is lost, there may be no chip.json file present at all.
  • Write a new chip.json file. If the write is interrupted, or power is lost before the system flushes the write cache, a partially written or corrupt configuration file may be present on reboot, resulting in errors like this user saw.

I'd recommend the following logic instead:

  • (optional) Copy chip.json to chip.json.backup, leaving the existing file in place. (Ideally, run fsync on the newly written file to ensure the backup is updated before the original file gets replaced.)
  • Write the new data to a temporary file, like chip.json.new.
  • Run fsync (or fdatasync) on the newly written file to ensure the file contents are persisted on the disk. (see the notes on https://docs.python.org/3/library/os.html#os.fsync regarding flushing the python buffers). This is needed to ensure ordering - making sure that the file contents will be written before the file gets renamed.
  • Rename the temporary file to the final filename, atomically replacing the existing file.
  • (optional) Run fsync on the renamed file to ensure the rename is committed to storage. (This is only needed if data stored elsewhere - e.g. in a sqlite database - might depend on this data having been persisted)
@kepstin kepstin changed the title Persistent data storage writes are not power-loss safe CHIP Persistent data storage writes are not power-loss safe Nov 27, 2024
@kepstin
Copy link
Author

kepstin commented Nov 27, 2024

… Actually I think I'm looking at the wrong code here - the error messages don't match up. Is the code in matter-server/server/storage.py used? is it for something else?

@kepstin
Copy link
Author

kepstin commented Nov 27, 2024

Looking closer, the actual code responsible for this might be in the Matter SDK itself, here:

https://github.com/project-chip/connectedhomeip/blob/e98f2b239aae657a10043bad9aa61f4717d059be/src/controller/python/chip/storage/__init__.py#L164-L183

Unfortunately, this code has the same issue, where a power loss during write (prior to the fsync) can result in an incompletely written file which will be corrupt on restart. It's arguably worse in a different way, since it doesn't remove/truncate the file before doing a write, so on a power loss you can get a mix of the old and new files present.

I've opened project-chip/connectedhomeip#36654 regarding the Matter SDK code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant