Node metrics #948

cody-littley · 2024-12-03T17:46:35Z

Why are these changes needed?

Adds metrics to the v2 DA node.

Checks

I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
I've checked the new test coverage and the coverage percentage didn't drop.
Testing Strategy
- Unit tests
- Integration tests
- This PR is not tested :(

Signed-off-by: Cody Littley <[email protected]>

ian-shim · 2024-12-09T19:26:02Z

node/config.go

@@ -207,7 +208,7 @@ func NewConfig(ctx *cli.Context) (*Config, error) {
 		EnableNodeApi:                  ctx.GlobalBool(flags.EnableNodeApiFlag.Name),
 		NodeApiPort:                    ctx.GlobalString(flags.NodeApiPortFlag.Name),
 		EnableMetrics:                  ctx.GlobalBool(flags.EnableMetricsFlag.Name),
-		MetricsPort:                    ctx.GlobalString(flags.MetricsPortFlag.Name),
+		MetricsPort:                    ctx.GlobalInt(flags.MetricsPortFlag.Name),


Is this backward compatible? i.e. if node doesn't update the env var from string to int, does it continue to work?

Bash variables are untyped. From bash's perspective, everything is just a string (regardless of quotation mark usage). Changing the flag from a GlobalString to a GlobalInt just causes golang to parse the data into an int when it reads it.

https://tldp.org/LDP/abs/html/untyped.html

ian-shim · 2024-12-09T19:28:39Z

node/grpc/server_v2.go

@@ -101,6 +110,8 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) (
 			return
 		}

+		s.metrics.ReportStoreChunksDataSize(size)


what if the store operation gets reverted in L125?

As a general rule of thumb, should we report incremental metrics if the operation as a whole fails? Or should we only report metrics for an operation if it is successful? (in another PR, you suggested that I should report latencies even when there are failures).

I can make this only report if the request ends up being valid, but I want to be consistent with the way we handle scenarios like this.

ian-shim · 2024-12-09T19:28:58Z

node/grpc/server_v2.go

@@ -124,6 +135,10 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) (
 	}

 	sig := s.node.KeyPair.SignMessage(batchHeaderHash).Bytes()
+
+	timeElapsed := time.Since(start)
+	s.metrics.ReportStoreChunksLatency(timeElapsed)


nit: s.metrics.ReportStoreChunksLatency(time.Since(start))?

ian-shim · 2024-12-09T19:31:09Z

node/grpc/v2_metrics.go

+
+		for m.isAlive.Load() {
+			var size int64
+			err := filepath.Walk(m.dbDir, func(_ string, info os.FileInfo, err error) error {


is this thread safe?

It is almost certainly not thread safe (i.e. if level DB deletes a file or a directory mid-walk, then filepath.Walk() will return an error). My hope was that if the race condition was sufficiently rare, we could still extract meaningful metrics data.

Currently, this will log an error whenever this method is unable to fetch new data. Will an error log cause problems if it triggers every once in a while? If so, should this be downgraded to a a logger.info() call?

Unfortunately, levelDB doesn't expose API that tells you the size of the DB (that I know of). My reasoning was that this metric would be sufficiently valuable to justify a hacky collection method.

In theory, we could have the levelDB wrapper track the quantity of data, at the cost of some extra book keeping (every DB modification would need to update a special size key-value pair). This wouldn't tell us the size of the files on disk (which may vary depending on things like compaction and indexes), but would give us a very good idea of the approximate size if the DB. If I implemented such a thing, it would need to be in a stand alone PR.

The final option would be to just delete this metric entirely. I'll defer to your judgement on this.

ian-shim · 2024-12-09T19:34:24Z

node/grpc/v2_metrics.go

+
+// NewV2Metrics creates a new V2Metrics instance. dbSizePollPeriod is the period at which the database size is polled.
+// If set to 0, the database size is not polled.
+func NewV2Metrics(


Is this compatible with v1 metrics?
v1 metrics also registers mux and initializes its own server

Very good point. It passed the test, but I'm not sure what the metrics will actually look like (as discussed previously, my plan is to look at resulting metrics once we get a proper end-to-end test set up).

I will extract the metrics server logic into a common context.

Can't wait to delete v1, having to support both at the same time makes for some ugly code. 🤮

Ok, I've now made it so we use the same registry/server as v1. The result is a bit hacky since the v1 metrics code is in another repo.

jianoaix · 2024-12-10T00:57:10Z

node/grpc/server_v2.go

@@ -166,6 +183,15 @@ func (s *ServerV2) GetChunks(ctx context.Context, in *pb.GetChunksRequest) (*pb.
 		return nil, api.NewErrorInternal(fmt.Sprintf("failed to get chunks: %v", err))
 	}

+	var size uint64
+	for _, chunk := range chunks {


len(chunk) * len(chunks)?

done

size := 0 if len(chunks) > 0 { size = len(chunks[0]) * len(chunks) } s.metrics.ReportGetChunksDataSize(size)

Signed-off-by: Cody Littley <[email protected]>

jianoaix · 2024-12-11T20:59:40Z

node/flags/flags.go

@@ -98,6 +98,13 @@ var (
 		Required: true,
 		EnvVar:   common.PrefixEnvVar(EnvVarPrefix, "DB_PATH"),
 	}
+	DBSizePollPeriodFlag = cli.DurationFlag{


DBSizeMetricPollPeriodFlag? Better indicating it's for metrics

jianoaix · 2024-12-11T21:00:46Z

node/grpc/server_v2.go

+	s.metrics.ReportGetChunksDataSize(size)
+
+	elapsed := time.Since(start)
+	s.metrics.ReportGetChunksLatency(elapsed)


It seems "elapsed" can be inlined to Report call

jianoaix · 2024-12-11T21:02:20Z

node/grpc/v2_metrics.go

+const namespace = "eigenda_node"
+
+// V2Metrics encapsulates metrics for the v2 DA node.
+type V2Metrics struct {


MetricsV2? It's in "XxxV2" format for other components

jianoaix · 2024-12-11T21:17:54Z

node/grpc/v2_metrics.go

+
+func (m *V2Metrics) ReportStoreChunksLatency(latency time.Duration) {
+	m.storeChunksLatency.WithLabelValues().Observe(
+		float64(latency.Nanoseconds()) / float64(time.Millisecond))


float64(latency.Milliseconds()) should work

jianoaix · 2024-12-11T21:18:51Z

node/store_v2.go

@@ -19,7 +19,9 @@ const (
 )

 type StoreV2 interface {
-	StoreBatch(batch *corev2.Batch, rawBundles []*RawBundles) ([]kvstore.Key, error)
+	// StoreBatch stores a batch and its raw bundles in the database. Returns the keys of the stored data


Any atomicity guarantees which should be commented?

Same for "DeleteKeys"

cody-littley added 18 commits November 26, 2024 15:28

Add metrics to relay.

362a390

Signed-off-by: Cody Littley <[email protected]>

Incremental progress.

3b4637e

Signed-off-by: Cody Littley <[email protected]>

Incremental progress.

60f015e

Signed-off-by: Cody Littley <[email protected]>

Incremental progress, need running averages.

2d7e9ef

Signed-off-by: Cody Littley <[email protected]>

Added running average metrics for GetChunks

b8c7d35

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into relay-metrics

a6692c4

Signed-off-by: Cody Littley <[email protected]>

Documentation

b9d71d6

Signed-off-by: Cody Littley <[email protected]>

Add time window to metrics doc

671f0c8

Signed-off-by: Cody Littley <[email protected]>

Added GetBlob metrics.

4adb7ea

Signed-off-by: Cody Littley <[email protected]>

Cleanup.

5579a88

Signed-off-by: Cody Littley <[email protected]>

Cleanup test

2b84f21

Signed-off-by: Cody Littley <[email protected]>

Add locking for running average metric.

fb0cad5

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into relay-metrics

a2c05cb

Signed-off-by: Cody Littley <[email protected]>

Add cache metrics.

dfd2925

Signed-off-by: Cody Littley <[email protected]>

Fix test bug

24f5f5d

Signed-off-by: Cody Littley <[email protected]>

Made suggested change.

c3adb70

Signed-off-by: Cody Littley <[email protected]>

Added metrics for v2 DA node.

5c8c173

Signed-off-by: Cody Littley <[email protected]>

Added metrics documentation.

1795654

Signed-off-by: Cody Littley <[email protected]>

cody-littley requested review from jianoaix and ian-shim December 3, 2024 17:46

cody-littley self-assigned this Dec 3, 2024

Merge branch 'master' into node-metrics

434c6b9

Signed-off-by: Cody Littley <[email protected]>

cody-littley marked this pull request as ready for review December 6, 2024 16:48

Revert deletions.

5c9274c

Signed-off-by: Cody Littley <[email protected]>

cody-littley marked this pull request as draft December 6, 2024 17:50

cody-littley added 4 commits December 6, 2024 11:53

Remove documentation.

4d4bfe9

Signed-off-by: Cody Littley <[email protected]>

Reimplement without metrics framework.

d9d898c

Signed-off-by: Cody Littley <[email protected]>

Cleanup.

8bd8ff1

Signed-off-by: Cody Littley <[email protected]>

Stop background thread when metrics are stopped.

2070eee

Signed-off-by: Cody Littley <[email protected]>

cody-littley marked this pull request as ready for review December 6, 2024 18:24

Revert unintentional change

cffa884

Signed-off-by: Cody Littley <[email protected]>

ian-shim reviewed Dec 9, 2024

View reviewed changes

jianoaix reviewed Dec 10, 2024

View reviewed changes

cody-littley added 5 commits December 10, 2024 10:07

Made suggested changes.

5c511c9

Signed-off-by: Cody Littley <[email protected]>

Don't start two metrics servers.

a15117f

Signed-off-by: Cody Littley <[email protected]>

Fix compile issue.

1076a8f

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into node-metrics

bbf9005

Signed-off-by: Cody Littley <[email protected]>

Merge branch 'master' into node-metrics

62ec4f6

Signed-off-by: Cody Littley <[email protected]>

jianoaix reviewed Dec 11, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Node metrics #948

Node metrics #948

cody-littley commented Dec 3, 2024 •

edited

Loading

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024 •

edited

Loading

ian-shim Dec 9, 2024

cody-littley Dec 10, 2024 •

edited

Loading

cody-littley Dec 10, 2024

jianoaix Dec 10, 2024

cody-littley Dec 10, 2024

jianoaix Dec 11, 2024

jianoaix Dec 11, 2024

jianoaix Dec 11, 2024

jianoaix Dec 11, 2024

jianoaix Dec 11, 2024

Node metrics #948

Are you sure you want to change the base?

Node metrics #948

Conversation

cody-littley commented Dec 3, 2024 • edited Loading

Why are these changes needed?

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley Dec 10, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cody-littley commented Dec 3, 2024 •

edited

Loading

cody-littley Dec 10, 2024 •

edited

Loading

cody-littley Dec 10, 2024 •

edited

Loading