Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encode LogMsg using protobuf #8347

Draft
wants to merge 17 commits into
base: main
Choose a base branch
from
Draft

Encode LogMsg using protobuf #8347

wants to merge 17 commits into from

Conversation

jprochazk
Copy link
Member

@jprochazk jprochazk commented Dec 6, 2024

Related

What

This PR introduces Serializer::Protobuf to re_log_encoding, and inverts the dependency graph of re_protos, which no longer depends on other re_* crates. This meant the conversion impls from protobuf types to rerun types and back had to be moved into their respective crates. For example, From<StoreId> for re_protos::common::v0::StoreIdis now inre_log_types`.

When encoding a file using this serializer, the data is encoded using a combination of:

  • A custom stream-level protocol
  • Protocol buffers
  • Arrow IPC

The stream-level protocol has changed only a bit, because compression is no longer done for all messages in the stream, and only the contents of ArrowMsg are ever compressed at all. This means the uncompressed_len and compressed_len could be unified to just len.

The actual layout of the messages has not changed, LogMsg is preserved and so are its semantics.

The stream of data stored in an example RRD file using this new encoding looks like:

FileHeader { b"RRIO", version, compression, serializer }     ;; 10 bytes
MessageHeader { kind, len }                                  ;; 8 bytes
SetStoreInfo { application_id, store_id, store_source, ... } ;; len bytes
MessageHeader { kind, len }                                  ;; 8 bytes
ArrowMsg { store_id, arrow_msg }                             ;; len bytes
MessageHeader { kind: End, len: 0 }                          ;; 8 bytes

Note that this stream-level protocol is only used for .rrd files. On the wire, we will use gRPC, which has its own protocol.

In the case of ArrowMsg, the schema+chunk is encoded using Arrow IPC into a byte payload, which may additionally be compressed. The compression setting is stored separately for every ArrowMsg, but per-message compression functionality is not yet exposed through re_log_encoding.

@jprochazk jprochazk added include in changelog 🪵 Log & send APIs Affects the user-facing API for all languages dataplatform Rerun Data Platform integration labels Dec 6, 2024
Copy link

github-actions bot commented Dec 6, 2024

Latest documentation preview deployed successfully.

Result Commit Link
0d3ec40 https://landing-hx1w29tq8-rerun.vercel.app/docs

Note: This comment is updated whenever you push a commit.

Copy link

github-actions bot commented Dec 6, 2024

Web viewer built successfully. If applicable, you should also test it:

  • I have tested the web viewer
Result Commit Link
3d405ff https://rerun.io/viewer/pr/8347

Note: This comment is updated whenever you push a commit.

@jprochazk jprochazk marked this pull request as ready for review December 6, 2024 18:40
Copy link
Contributor

@zehiko zehiko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for untangling these dependency issues!

generally looks ok to me, few comments and a question about more clear re-use of common arrow seriailzation logic to make re_log_encoding crate clearer.

crates/store/re_chunk_store/src/lib.rs Show resolved Hide resolved
crates/store/re_log_encoding/src/codec/mod.rs Outdated Show resolved Hide resolved
crates/store/re_log_encoding/src/codec/wire.rs Outdated Show resolved Hide resolved

/// Helper function that deserializes raw bytes into arrow schema and record batch
/// using Arrow IPC format.
fn read_arrow_from_bytes<R: std::io::Read>(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we have this duplication?

could common arrow serialialization logic be moved somewhere and used between "wire" and "file" protocol?

we also now have codec, decode, encoder and then in protobuf we also have encoder.rs and decoder.rs, not sure if this is temporary, but I wonder if it could be more obvious what is common logic and then what is specific to our file stream and our grpc stream.

Copy link
Member Author

@jprochazk jprochazk Dec 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: decoder encoder being in two places, that's because we still have to continue supporting msgpack for the time being. So all the new protobuf encoding stuff is under protobuf. But fundamentally we have 3 layers, our custom stream-level protocol (crate::decoder, crate::encoder), the message encoding (crate::protobuf, with some leftovers in crate::decoder and crate::encoder due to msgpack), and the arrow serialization (should be in crate::arrow).

re: codec, it's there separately because that's the format used by gRPC, with its own MessageHeader. the existing encoder and decoder will eventually only be used for files (either local or served over HTTP), so I thought it'd be better to keep them separate.

The naming could be clearer, I will reshuffle the modules a bit.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, I think just some naming reshuffling might help to make things more obvious. Perhaps even worth splitting grpc codec into encoder and decoder just for consistency...

UnsupportedEncoding,

#[error("Invalid file header")]
InvalidFileHeader,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this unused? ditto for few below

crates/store/re_chunk_store/src/protobuf_conversions.rs Outdated Show resolved Hide resolved
@jprochazk jprochazk marked this pull request as draft December 9, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataplatform Rerun Data Platform integration include in changelog 🪵 Log & send APIs Affects the user-facing API for all languages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Replace our serde/MsgPack encoding of enum LogMsg with protobuf
2 participants