Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Node metrics #948

Open
wants to merge 30 commits into
base: master
Choose a base branch
from
Open

Conversation

cody-littley
Copy link
Contributor

@cody-littley cody-littley commented Dec 3, 2024

Why are these changes needed?

Adds metrics to the v2 DA node.

Checks

  • I've made sure the tests are passing. Note that there might be a few flaky tests, in that case, please comment that they are not relevant.
  • I've checked the new test coverage and the coverage percentage didn't drop.
  • Testing Strategy
    • Unit tests
    • Integration tests
    • This PR is not tested :(

Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley self-assigned this Dec 3, 2024
@cody-littley cody-littley marked this pull request as ready for review December 6, 2024 16:48
Signed-off-by: Cody Littley <[email protected]>
@cody-littley cody-littley marked this pull request as draft December 6, 2024 17:50
@cody-littley cody-littley marked this pull request as ready for review December 6, 2024 18:24
Signed-off-by: Cody Littley <[email protected]>
@@ -207,7 +208,7 @@ func NewConfig(ctx *cli.Context) (*Config, error) {
EnableNodeApi: ctx.GlobalBool(flags.EnableNodeApiFlag.Name),
NodeApiPort: ctx.GlobalString(flags.NodeApiPortFlag.Name),
EnableMetrics: ctx.GlobalBool(flags.EnableMetricsFlag.Name),
MetricsPort: ctx.GlobalString(flags.MetricsPortFlag.Name),
MetricsPort: ctx.GlobalInt(flags.MetricsPortFlag.Name),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this backward compatible? i.e. if node doesn't update the env var from string to int, does it continue to work?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bash variables are untyped. From bash's perspective, everything is just a string (regardless of quotation mark usage). Changing the flag from a GlobalString to a GlobalInt just causes golang to parse the data into an int when it reads it.

https://tldp.org/LDP/abs/html/untyped.html

@@ -101,6 +110,8 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) (
return
}

s.metrics.ReportStoreChunksDataSize(size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if the store operation gets reverted in L125?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general rule of thumb, should we report incremental metrics if the operation as a whole fails? Or should we only report metrics for an operation if it is successful? (in another PR, you suggested that I should report latencies even when there are failures).

I can make this only report if the request ends up being valid, but I want to be consistent with the way we handle scenarios like this.

@@ -124,6 +135,10 @@ func (s *ServerV2) StoreChunks(ctx context.Context, in *pb.StoreChunksRequest) (
}

sig := s.node.KeyPair.SignMessage(batchHeaderHash).Bytes()

timeElapsed := time.Since(start)
s.metrics.ReportStoreChunksLatency(timeElapsed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s.metrics.ReportStoreChunksLatency(time.Since(start))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


for m.isAlive.Load() {
var size int64
err := filepath.Walk(m.dbDir, func(_ string, info os.FileInfo, err error) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this thread safe?

Copy link
Contributor Author

@cody-littley cody-littley Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is almost certainly not thread safe (i.e. if level DB deletes a file or a directory mid-walk, then filepath.Walk() will return an error). My hope was that if the race condition was sufficiently rare, we could still extract meaningful metrics data.

Currently, this will log an error whenever this method is unable to fetch new data. Will an error log cause problems if it triggers every once in a while? If so, should this be downgraded to a a logger.info() call?

Unfortunately, levelDB doesn't expose API that tells you the size of the DB (that I know of). My reasoning was that this metric would be sufficiently valuable to justify a hacky collection method.

In theory, we could have the levelDB wrapper track the quantity of data, at the cost of some extra book keeping (every DB modification would need to update a special size key-value pair). This wouldn't tell us the size of the files on disk (which may vary depending on things like compaction and indexes), but would give us a very good idea of the approximate size if the DB. If I implemented such a thing, it would need to be in a stand alone PR.

The final option would be to just delete this metric entirely. I'll defer to your judgement on this.


// NewV2Metrics creates a new V2Metrics instance. dbSizePollPeriod is the period at which the database size is polled.
// If set to 0, the database size is not polled.
func NewV2Metrics(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this compatible with v1 metrics?
v1 metrics also registers mux and initializes its own server

Copy link
Contributor Author

@cody-littley cody-littley Dec 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good point. It passed the test, but I'm not sure what the metrics will actually look like (as discussed previously, my plan is to look at resulting metrics once we get a proper end-to-end test set up).

I will extract the metrics server logic into a common context.

Can't wait to delete v1, having to support both at the same time makes for some ugly code. 🤮

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I've now made it so we use the same registry/server as v1. The result is a bit hacky since the v1 metrics code is in another repo.

@@ -166,6 +183,15 @@ func (s *ServerV2) GetChunks(ctx context.Context, in *pb.GetChunksRequest) (*pb.
return nil, api.NewErrorInternal(fmt.Sprintf("failed to get chunks: %v", err))
}

var size uint64
for _, chunk := range chunks {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

len(chunk) * len(chunks)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

	size := 0
	if len(chunks) > 0 {
		size = len(chunks[0]) * len(chunks)
	}
	s.metrics.ReportGetChunksDataSize(size)

@@ -98,6 +98,13 @@ var (
Required: true,
EnvVar: common.PrefixEnvVar(EnvVarPrefix, "DB_PATH"),
}
DBSizePollPeriodFlag = cli.DurationFlag{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DBSizeMetricPollPeriodFlag? Better indicating it's for metrics

s.metrics.ReportGetChunksDataSize(size)

elapsed := time.Since(start)
s.metrics.ReportGetChunksLatency(elapsed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems "elapsed" can be inlined to Report call

const namespace = "eigenda_node"

// V2Metrics encapsulates metrics for the v2 DA node.
type V2Metrics struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MetricsV2? It's in "XxxV2" format for other components


func (m *V2Metrics) ReportStoreChunksLatency(latency time.Duration) {
m.storeChunksLatency.WithLabelValues().Observe(
float64(latency.Nanoseconds()) / float64(time.Millisecond))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

float64(latency.Milliseconds()) should work

@@ -19,7 +19,9 @@ const (
)

type StoreV2 interface {
StoreBatch(batch *corev2.Batch, rawBundles []*RawBundles) ([]kvstore.Key, error)
// StoreBatch stores a batch and its raw bundles in the database. Returns the keys of the stored data
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any atomicity guarantees which should be commented?

Same for "DeleteKeys"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants