[Bug]: fail to search on QueryNode 216: Timestamp lag too large #36960

zhoujiaqi1998 · 2024-10-17T08:57:32Z

Is there an existing issue for this?

I have searched the existing issues

Environment

- Milvus version:2.4.1
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    pulsar
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024/10/17 07:33:52.777 +00:00] [WARN] [proxy/task_scheduler.go:469] ["Failed to execute task: "] [traceID=aafa07524fa2fb9567204a960e5b68f7] [error="failed to search: failed to search/query delegator 216 for channel milvus-cluster-rootcoord-dml_2_453104445904537346v0: fail to search on QueryNode 216: Timestamp lag too large"] [errorVerbose="failed to search: failed to search/query delegator 216 for channel milvus-cluster-rootcoord-dml_2_453104445904537346v0: fail to search on QueryNode 216: Timestamp lag too large\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:512\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:466\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\nWraps: (2) failed to search\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:188\n | [...repeated from below...]\nWraps: (4) failed to search/query delegator 216 for channel milvus-cluster-rootcoord-dml_2_453104445904537346v0\nWraps: (5) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:696\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:180\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:154\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:218\n | golang.org/x/sync/errgroup.(*Group).Go.func1\n | \t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (6) fail to search on QueryNode 216\nWraps: (7) Timestamp lag too large\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) merr.milvusError"]
[2024/10/17 07:33:52.777 +00:00] [WARN] [proxy/impl.go:2919] ["Search failed to WaitToFinish"] [traceID=aafa07524fa2fb9567204a960e5b68f7] [role=proxy] [db=default] [collection=benchmark_faq_zlkt_] [partitions="[]"] [dsl=] [len(PlaceholderGroup)=4108] [OutputFields="[id]"] [search_params="[{"key":"topk","value":"100"},{"key":"anns_field","value":"dense_vector"},{"key":"metric_type","value":"IP"},{"key":"params","value":"{\"nprobe\":1}"}]"] [guarantee_timestamp=0] [nq=1] [error="failed to search: failed to search/query delegator 216 for channel milvus-cluster-rootcoord-dml_2_453104445904537346v0: fail to search on QueryNode 216: Timestamp lag too large"] [errorVerbose="failed to search: failed to search/query delegator 216 for channel milvus-cluster-rootcoord-dml_2_453104445904537346v0: fail to search on QueryNode 216: Timestamp lag too large\n(1) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).Execute\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:512\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).processTask\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:466\n | github.com/milvus-io/milvus/internal/proxy.(*taskScheduler).queryLoop.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_scheduler.go:545\n | github.com/milvus-io/milvus/pkg/util/conc.(*Pool[...]).Submit.func1\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/conc/pool.go:81\n | github.com/panjf2000/ants/v2.(*goWorker).run.func1\n | \t/go/pkg/mod/github.com/panjf2000/ants/[email protected]/worker.go:67\nWraps: (2) failed to search\nWraps: (3) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:188\n | [...repeated from below...]\nWraps: (4) failed to search/query delegator 216 for channel milvus-cluster-rootcoord-dml_2_453104445904537346v0\nWraps: (5) attached stack trace\n -- stack trace:\n | github.com/milvus-io/milvus/internal/proxy.(*searchTask).searchShard\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/task_search.go:696\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry.func1\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:180\n | github.com/milvus-io/milvus/pkg/util/retry.Do\n | \t/go/src/github.com/milvus-io/milvus/pkg/util/retry/retry.go:44\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).ExecuteWithRetry\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:154\n | github.com/milvus-io/milvus/internal/proxy.(*LBPolicyImpl).Execute.func2\n | \t/go/src/github.com/milvus-io/milvus/internal/proxy/lb_policy.go:218\n | golang.org/x/sync/errgroup.(*Group).Go.func1\n | \t/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75\n | runtime.goexit\n | \t/usr/local/go/src/runtime/asm_amd64.s:1598\nWraps: (6) fail to search on QueryNode 216\nWraps: (7) Timestamp lag too large\nError types: (1) *withstack.withStack (2) *errutil.withPrefix (3) *withstack.withStack (4) *errutil.withPrefix (5) *withstack.withStack (6) *errutil.withPrefix (7) merr.milvusError"]

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

milvus-log.zip

Anything else?

No response

zhoujiaqi1998 · 2024-10-17T10:13:58Z

milvus-log.zip

yanliang567 · 2024-10-17T10:43:32Z

/assign @aoiasd
/unassign

xiaofan-luan · 2024-10-18T05:25:54Z

@zhoujiaqi1998
it seems that for some reason querynode can not consume from pulsar for a long time.
Try to restart querynode see if it helps

xiaofan-luan · 2024-10-18T05:26:53Z

it seems that once pulsar falls into a state, the consumer can not recovered. 2.5 upgrade to pulsar client to 0.12 and hopefully that could help

zhoujiaqi1998 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2024

zhoujiaqi1998 assigned yanliang567 Oct 17, 2024

sre-ci-robot assigned aoiasd and unassigned yanliang567 Oct 17, 2024

yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Oct 17, 2024

yanliang567 added this to the 2.4.14 milestone Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: fail to search on QueryNode 216: Timestamp lag too large #36960

[Bug]: fail to search on QueryNode 216: Timestamp lag too large #36960

zhoujiaqi1998 commented Oct 17, 2024

zhoujiaqi1998 commented Oct 17, 2024

yanliang567 commented Oct 17, 2024

xiaofan-luan commented Oct 18, 2024

xiaofan-luan commented Oct 18, 2024

[Bug]: fail to search on QueryNode 216: Timestamp lag too large #36960

[Bug]: fail to search on QueryNode 216: Timestamp lag too large #36960

Comments

zhoujiaqi1998 commented Oct 17, 2024

Is there an existing issue for this?

Environment

Current Behavior

Expected Behavior

Steps To Reproduce

Milvus Log

Anything else?

zhoujiaqi1998 commented Oct 17, 2024

yanliang567 commented Oct 17, 2024

xiaofan-luan commented Oct 18, 2024

xiaofan-luan commented Oct 18, 2024