-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Node metrics #948
base: master
Are you sure you want to change the base?
Node metrics #948
Conversation
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@@ -207,7 +208,7 @@ func NewConfig(ctx *cli.Context) (*Config, error) { | |||
EnableNodeApi: ctx.GlobalBool(flags.EnableNodeApiFlag.Name), | |||
NodeApiPort: ctx.GlobalString(flags.NodeApiPortFlag.Name), | |||
EnableMetrics: ctx.GlobalBool(flags.EnableMetricsFlag.Name), | |||
MetricsPort: ctx.GlobalString(flags.MetricsPortFlag.Name), | |||
MetricsPort: ctx.GlobalInt(flags.MetricsPortFlag.Name), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this backward compatible? i.e. if node doesn't update the env var from string to int, does it continue to work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bash variables are untyped. From bash's perspective, everything is just a string (regardless of quotation mark usage). Changing the flag from a GlobalString
to a GlobalInt
just causes golang to parse the data into an int when it reads it.
@@ -101,6 +110,8 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) ( | |||
return | |||
} | |||
|
|||
s.metrics.ReportStoreChunksDataSize(size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if the store operation gets reverted in L125?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a general rule of thumb, should we report incremental metrics if the operation as a whole fails? Or should we only report metrics for an operation if it is successful? (in another PR, you suggested that I should report latencies even when there are failures).
I can make this only report if the request ends up being valid, but I want to be consistent with the way we handle scenarios like this.
node/grpc/server_v2.go
Outdated
@@ -124,6 +135,10 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) ( | |||
} | |||
|
|||
sig := s.node.KeyPair.SignMessage(batchHeaderHash).Bytes() | |||
|
|||
timeElapsed := time.Since(start) | |||
s.metrics.ReportStoreChunksLatency(timeElapsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s.metrics.ReportStoreChunksLatency(time.Since(start))
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
for m.isAlive.Load() { | ||
var size int64 | ||
err := filepath.Walk(m.dbDir, func(_ string, info os.FileInfo, err error) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this thread safe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is almost certainly not thread safe (i.e. if level DB deletes a file or a directory mid-walk, then filepath.Walk()
will return an error). My hope was that if the race condition was sufficiently rare, we could still extract meaningful metrics data.
Currently, this will log an error whenever this method is unable to fetch new data. Will an error log cause problems if it triggers every once in a while? If so, should this be downgraded to a a logger.info()
call?
Unfortunately, levelDB doesn't expose API that tells you the size of the DB (that I know of). My reasoning was that this metric would be sufficiently valuable to justify a hacky collection method.
In theory, we could have the levelDB wrapper track the quantity of data, at the cost of some extra book keeping (every DB modification would need to update a special size
key-value pair). This wouldn't tell us the size of the files on disk (which may vary depending on things like compaction and indexes), but would give us a very good idea of the approximate size if the DB. If I implemented such a thing, it would need to be in a stand alone PR.
The final option would be to just delete this metric entirely. I'll defer to your judgement on this.
|
||
// NewV2Metrics creates a new V2Metrics instance. dbSizePollPeriod is the period at which the database size is polled. | ||
// If set to 0, the database size is not polled. | ||
func NewV2Metrics( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this compatible with v1 metrics?
v1 metrics also registers mux and initializes its own server
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good point. It passed the test, but I'm not sure what the metrics will actually look like (as discussed previously, my plan is to look at resulting metrics once we get a proper end-to-end test set up).
I will extract the metrics server logic into a common context.
Can't wait to delete v1, having to support both at the same time makes for some ugly code. 🤮
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I've now made it so we use the same registry/server as v1. The result is a bit hacky since the v1 metrics code is in another repo.
node/grpc/server_v2.go
Outdated
@@ -166,6 +183,15 @@ func (s *ServerV2) GetChunks(ctx context.Context, in *pb.GetChunksRequest) (*pb. | |||
return nil, api.NewErrorInternal(fmt.Sprintf("failed to get chunks: %v", err)) | |||
} | |||
|
|||
var size uint64 | |||
for _, chunk := range chunks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
len(chunk) * len(chunks)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
size := 0
if len(chunks) > 0 {
size = len(chunks[0]) * len(chunks)
}
s.metrics.ReportGetChunksDataSize(size)
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@@ -98,6 +98,13 @@ var ( | |||
Required: true, | |||
EnvVar: common.PrefixEnvVar(EnvVarPrefix, "DB_PATH"), | |||
} | |||
DBSizePollPeriodFlag = cli.DurationFlag{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DBSizeMetricPollPeriodFlag? Better indicating it's for metrics
s.metrics.ReportGetChunksDataSize(size) | ||
|
||
elapsed := time.Since(start) | ||
s.metrics.ReportGetChunksLatency(elapsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems "elapsed" can be inlined to Report call
const namespace = "eigenda_node" | ||
|
||
// V2Metrics encapsulates metrics for the v2 DA node. | ||
type V2Metrics struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MetricsV2? It's in "XxxV2" format for other components
|
||
func (m *V2Metrics) ReportStoreChunksLatency(latency time.Duration) { | ||
m.storeChunksLatency.WithLabelValues().Observe( | ||
float64(latency.Nanoseconds()) / float64(time.Millisecond)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
float64(latency.Milliseconds()) should work
@@ -19,7 +19,9 @@ const ( | |||
) | |||
|
|||
type StoreV2 interface { | |||
StoreBatch(batch *corev2.Batch, rawBundles []*RawBundles) ([]kvstore.Key, error) | |||
// StoreBatch stores a batch and its raw bundles in the database. Returns the keys of the stored data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any atomicity guarantees which should be commented?
Same for "DeleteKeys"
Why are these changes needed?
Adds metrics to the v2 DA node.
Checks