Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: [benchmark][cluster] inserting a partition raises error partition not found in concurrent dql & dml scene #36989

Open
1 task done
wangting0128 opened this issue Oct 18, 2024 · 0 comments
Assignees
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test
Milestone

Comments

@wangting0128
Copy link
Contributor

wangting0128 commented Oct 18, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:2.4-20241017-2bfd22f2-amd64
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):pulsar    
- SDK version(e.g. pymilvus v2.0.0rc2):2.4.5rc7
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

argo task: multi-vector-corn-1-1729173600
test case name: test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster

server:

NAME                                                              READY   STATUS      RESTARTS         AGE     IP              NODE         NOMINATED NODE   READINESS GATES
multi-vector-corn-1-1729173600-1-etcd-0                           1/1     Running     0                3h9m    10.104.26.98    4am-node32   <none>           <none>
multi-vector-corn-1-1729173600-1-etcd-1                           1/1     Running     0                3h9m    10.104.32.180   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-etcd-2                           1/1     Running     0                3h9m    10.104.19.137   4am-node28   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-datanode-5888f5b48568st   1/1     Running     3 (3h8m ago)     3h9m    10.104.1.225    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6grfpg   1/1     Running     3 (3h8m ago)     3h9m    10.104.5.53     4am-node12   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6lfvqg   1/1     Running     3 (3h9m ago)     3h9m    10.104.4.2      4am-node11   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6ndtjp   1/1     Running     3 (3h8m ago)     3h9m    10.104.1.222    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-indexnode-c5c8f5f6v7psv   1/1     Running     3 (3h9m ago)     3h9m    10.104.9.174    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-mixcoord-554d75997bxspn   1/1     Running     3 (3h8m ago)     3h9m    10.104.1.220    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-proxy-57b7d4949f-7gv48    1/1     Running     3 (3h9m ago)     3h9m    10.104.1.223    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-querynode-5c77b5567frhd   1/1     Running     3 (3h9m ago)     3h9m    10.104.6.182    4am-node13   <none>           <none>
multi-vector-corn-1-1729173600-1-milvus-querynode-5c77b556xp86q   1/1     Running     3 (3h9m ago)     3h9m    10.104.23.42    4am-node27   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-0                          1/1     Running     0                3h9m    10.104.19.133   4am-node28   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-1                          1/1     Running     0                3h9m    10.104.32.176   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-2                          1/1     Running     0                3h9m    10.104.26.99    4am-node32   <none>           <none>
multi-vector-corn-1-1729173600-1-minio-3                          1/1     Running     0                3h9m    10.104.18.192   4am-node25   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-0                  1/1     Running     0                3h9m    10.104.32.178   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-1                  1/1     Running     0                3h9m    10.104.19.134   4am-node28   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-2                  1/1     Running     0                3h9m    10.104.18.191   4am-node25   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-bookie-init-cdqts         0/1     Completed   0                3h9m    10.104.1.224    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-broker-0                  1/1     Running     0                3h9m    10.104.9.173    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-proxy-0                   1/1     Running     0                3h9m    10.104.9.175    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-pulsar-init-zxbrj         0/1     Completed   0                3h9m    10.104.1.219    4am-node10   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-recovery-0                1/1     Running     0                3h9m    10.104.9.172    4am-node14   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-zookeeper-0               1/1     Running     0                3h9m    10.104.32.177   4am-node39   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-zookeeper-1               1/1     Running     0                3h9m    10.104.25.183   4am-node30   <none>           <none>
multi-vector-corn-1-1729173600-1-pulsar-zookeeper-2               1/1     Running     0                3h8m    10.104.19.140   4am-node28   <none>           <none> 

{pod=~"multi-vector-corn-1-1729173600-1-milvus-.*"} |~ "partition not found|c2537709715697ef8aa3770fc1962c3a|scene_test_partition_hybrid_search_DX3SsnBo|453295868220979021" partition_not_found.log
image

client log:

[2024-10-17 20:05:14,654 - ERROR - fouram]: RPC error: [batch_insert], <MilvusException: (code=200, message=partition not found[partition=scene_test_partition_hybrid_search_DX3SsnBo])>, <Time:{'RPC start': '2024-10-17 20:05:14.339202', 'RPC error': '2024-10-17 20:05:14.654836'}> (decorators.py:146)
[2024-10-17 20:05:14,655 - ERROR - fouram]: (api_response) : [Collection.insert] <MilvusException: (code=200, message=partition not found[partition=scene_test_partition_hybrid_search_DX3SsnBo])>, [requestId: 1f4c5a1c-8cc3-11ef-b43c-7e05d3331439] (api_request.py:57)
[2024-10-17 20:05:14,655 - ERROR - fouram]: [CheckFunc] insert request check failed, response:<MilvusException: (code=200, message=partition not found[partition=scene_test_partition_hybrid_search_DX3SsnBo])> (func_check.py:106)
[2024-10-17 20:05:14,656 - ERROR - fouram]: [func_time_catch] :  (api_request.py:127)
[2024-10-17 20:05:23,020 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-17 20:05:23,020 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     hybrid_search                                                                   2035     0(0.00%) |   4470      23   35405   2600 |    0.30        0.00 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     query                                                                            255     0(0.00%) |   5339      97   83184    740 |    0.00        0.00 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     scene_test_partition_hybrid_search                                               237     1(0.42%) | 427450    1847  827518 408000 |    0.40        0.10 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: grpc     search                                                                          2045     0(0.00%) |  27488    3236   64599  27000 |    0.70        0.00 (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:05:23,021 -  INFO - fouram]:          Aggregated 

Expected Behavior

No response

Steps To Reproduce

concurrent test and calculation of RT and QPS

        :purpose:  `DQL & DML(partition)`
            verify concurrent DQL & DML(partition) scenario,
            which has 4 vector fields(IVF_FLAT, HNSW, DISKANN, IVF_SQ8) and scalar fields: `int64_1`, `varchar_1`

        :test steps:
            1. create collection with fields:
                'float_vector': 128dim,
                'float_vector_1': 128dim,
                'float_vector_2': 128dim,
                'float_vector_3': 128dim,
                scalar field: int64_1, varchar_1
            2. build indexes:
                IVF_FLAT: 'float_vector'
                HNSW: 'float_vector_1',
                DISKANN: 'float_vector_2'
                IVF_SQ8: 'float_vector_3'
                INVERTED: 'int64_1', 'varchar_1'
                default scalar index: 'id'
            3. insert 1 million data into 10 partitions
            4. flush collection
            5. build indexes again using the same params
            6. load collection
                replica: 1
            7. concurrent request:
                - scene_test_partition_hybrid_search
                    (partition: create->insert->flush->index again->load->hybrid_search->release->hybrid_search failed->drop)  <- insert raises error
                - search
                - hybrid_search
                - query

Milvus Log

No response

Anything else?

test result:

[2024-10-17 20:42:59,475 -  INFO - fouram]: Print locust final stats. (locust_runner.py:56)
[2024-10-17 20:42:59,475 -  INFO - fouram]: Type     Name                                                                          # reqs      # fails |    Avg     Min     Max    Med |   req/s  failures/s (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: grpc     hybrid_search                                                                   2569     0(0.00%) |   4356      23   35405   2600 |    0.24        0.00 (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: grpc     query                                                                            325     0(0.00%) |   4863      90   83184    640 |    0.03        0.00 (stats.py:789)
[2024-10-17 20:42:59,475 -  INFO - fouram]: grpc     scene_test_partition_hybrid_search                                               303     1(0.33%) | 425817    1847  827518 411000 |    0.03        0.00 (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]: grpc     search                                                                          2605     0(0.00%) |  27140    3236   64599  27000 |    0.24        0.00 (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]: --------|----------------------------------------------------------------------------|-------|-------------|-------|-------|-------|-------|--------|----------- (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]:          Aggregated                                                                      5802     1(0.02%) |  36624      23  827518  16000 |    0.54        0.00 (stats.py:789)
[2024-10-17 20:42:59,476 -  INFO - fouram]:  (stats.py:790)
[2024-10-17 20:42:59,479 -  INFO - fouram]: [PerfTemplate] Report data: 
{'server': {'deploy_tool': 'helm',
            'deploy_mode': 'cluster',
            'config_name': 'cluster_2c8m',
            'config': {'queryNode': {'resources': {'limits': {'cpu': '32.0', 'memory': '32Gi'}, 'requests': {'cpu': '17.0', 'memory': '17Gi'}}, 'replicas': 2},
                       'indexNode': {'resources': {'limits': {'cpu': '8.0', 'memory': '8Gi'}, 'requests': {'cpu': '5.0', 'memory': '5Gi'}}, 'replicas': 4},
                       'dataNode': {'resources': {'limits': {'cpu': '2.0', 'memory': '8Gi'}, 'requests': {'cpu': '2.0', 'memory': '5Gi'}}},
                       'cluster': {'enabled': True},
                       'pulsar': {},
                       'kafka': {},
                       'minio': {'metrics': {'podMonitor': {'enabled': True}}},
                       'etcd': {'metrics': {'enabled': True, 'podMonitor': {'enabled': True}}},
                       'metrics': {'serviceMonitor': {'enabled': True}},
                       'log': {'level': 'debug'},
                       'image': {'all': {'repository': 'harbor.milvus.io/milvus/milvus', 'tag': '2.4-20241017-2bfd22f2-amd64'}}},
            'host': 'multi-vector-corn-1-1729173600-1-milvus.qa-milvus.svc.cluster.local',
            'port': '19530',
            'uri': ''},
 'client': {'test_case_type': 'ConcurrentClientBase',
            'test_case_name': 'test_hybrid_search_locust_dql_dml_partition_hybrid_search_cluster',
            'test_case_params': {'dataset_params': {'metric_type': 'L2',
                                                    'dim': 128,
                                                    'scalars_index': {'id': {}, 'int64_1': {'index_type': 'INVERTED'}, 'varchar_1': {'index_type': 'INVERTED'}},
                                                    'vectors_index': {'float_vector_1': {'index_type': 'HNSW',
                                                                                         'index_param': {'M': 8, 'efConstruction': 200},
                                                                                         'metric_type': 'L2'},
                                                                      'float_vector_2': {'index_type': 'DISKANN', 'index_param': {}, 'metric_type': 'IP'},
                                                                      'float_vector_3': {'index_type': 'IVF_SQ8',
                                                                                         'index_param': {'nlist': 2048},
                                                                                         'metric_type': 'L2'}},
                                                    'scalars_params': {'float_vector_1': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}},
                                                                       'float_vector_2': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}},
                                                                       'float_vector_3': {'params': {'dim': 128}, 'other_params': {'dataset': 'sift'}}},
                                                    'extra_partitions': {'partitions': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                        'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                        'partition_9'],
                                                                         'data_repeated': False},
                                                    'dataset_name': 'sift',
                                                    'dataset_size': 1000000,
                                                    'ni_per': 10000},
                                 'collection_params': {'other_fields': ['float_vector_1', 'float_vector_2', 'float_vector_3', 'int64_1', 'varchar_1'],
                                                       'shards_num': 2},
                                 'resource_groups_params': {'reset': False},
                                 'database_user_params': {'reset_rbac': False, 'reset_db': False},
                                 'index_params': {'index_type': 'IVF_FLAT', 'index_param': {'nlist': 1024}},
                                 'concurrent_params': {'concurrent_number': 20, 'during_time': '3h', 'interval': 20, 'spawn_rate': None},
                                 'concurrent_tasks': [{'type': 'scene_test_partition_hybrid_search',
                                                       'weight': 1,
                                                       'params': {'nq': 1,
                                                                  'top_k': 1,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'ef': 64}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'search_list': 32}, 'anns_field': 'float_vector_2', 'top_k': 30},
                                                                           {'search_param': {'nprobe': 16}, 'anns_field': 'float_vector_3', 'top_k': 400}],
                                                                  'rerank': {'RRFRanker': []},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'data_size': 3000,
                                                                  'ni': 3000}},
                                                      {'type': 'search',
                                                       'weight': 8,
                                                       'params': {'nq': 1000,
                                                                  'top_k': 1,
                                                                  'search_param': {'nprobe': 1000},
                                                                  'expr': 'int64_1 >= 0',
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'output_fields': None,
                                                                  'ignore_growing': False,
                                                                  'group_by_field': None,
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'hybrid_search',
                                                       'weight': 8,
                                                       'params': {'nq': 1,
                                                                  'top_k': 100,
                                                                  'reqs': [{'search_param': {'nprobe': 128}, 'anns_field': 'float_vector', 'top_k': 100},
                                                                           {'search_param': {'ef': 64}, 'anns_field': 'float_vector_1', 'top_k': 10},
                                                                           {'search_param': {'search_list': 32}, 'anns_field': 'float_vector_2', 'top_k': 30},
                                                                           {'search_param': {'nprobe': 16}, 'anns_field': 'float_vector_3', 'top_k': 400}],
                                                                  'rerank': {'WeightedRanker': [0.85, 0.95, 0.51, 0.32]},
                                                                  'output_fields': ['*'],
                                                                  'ignore_growing': False,
                                                                  'guarantee_timestamp': None,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'timeout': 600,
                                                                  'random_data': True,
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}},
                                                      {'type': 'query',
                                                       'weight': 1,
                                                       'params': {'ids': None,
                                                                  'expr': 'int64_1 > -1 && ',
                                                                  'output_fields': ['*'],
                                                                  'offset': None,
                                                                  'limit': None,
                                                                  'ignore_growing': False,
                                                                  'partition_names': ['_default', 'partition_1', 'partition_2', 'partition_3', 'partition_4',
                                                                                      'partition_5', 'partition_6', 'partition_7', 'partition_8',
                                                                                      'partition_9'],
                                                                  'timeout': 600,
                                                                  'consistency_level': None,
                                                                  'random_data': True,
                                                                  'random_count': 20,
                                                                  'random_range': [0, 100000],
                                                                  'field_name': 'id',
                                                                  'field_type': 'int64',
                                                                  'check_task': 'check_response',
                                                                  'check_items': None}}]},
            'run_id': 2024101764162756,
            'datetime': '2024-10-17 17:33:36.785500',
            'client_version': '2.4.0'},
 'result': {'test_result': {'index': {'RT': 110.8804,
                                      'float_vector_1': {'RT': 0.5163},
                                      'float_vector_2': {'RT': 6.0471},
                                      'float_vector_3': {'RT': 0.538},
                                      'id': {'RT': 0.5305},
                                      'int64_1': {'RT': 0.5163},
                                      'varchar_1': {'RT': 0.5145}},
                            'insert': {'total_time': 148.2714, 'VPS': 6749.52, 'batch_time': 1.4827, 'batch': 10000.0},
                            'flush': {'RT': 3.0467},
                            'load': {'RT': 4.2753},
                            'Locust': {'Aggregated': {'Requests': 5802,
                                                      'Fails': 1,
                                                      'RPS': 0.54,
                                                      'fail_s': 0.0,
                                                      'RT_max': 827518.71,
                                                      'RT_avg': 36624.95,
                                                      'TP50': 16000.0,
                                                      'TP99': 510000.0},
                                       'hybrid_search': {'Requests': 2569,
                                                         'Fails': 0,
                                                         'RPS': 0.24,
                                                         'fail_s': 0.0,
                                                         'RT_max': 35405.13,
                                                         'RT_avg': 4356.86,
                                                         'TP50': 2600.0,
                                                         'TP99': 27000.0},
                                       'query': {'Requests': 325,
                                                 'Fails': 0,
                                                 'RPS': 0.03,
                                                 'fail_s': 0.0,
                                                 'RT_max': 83184.82,
                                                 'RT_avg': 4863.86,
                                                 'TP50': 640.0,
                                                 'TP99': 66000.0},
                                       'scene_test_partition_hybrid_search': {'Requests': 303,
                                                                              'Fails': 1,
                                                                              'RPS': 0.03,
                                                                              'fail_s': 0.0,
                                                                              'RT_max': 827518.71,
                                                                              'RT_avg': 425817.24,
                                                                              'TP50': 411000.0,
                                                                              'TP99': 776000.0},
                                       'search': {'Requests': 2605,
                                                  'Fails': 0,
                                                  'RPS': 0.24,
                                                  'fail_s': 0.0,
                                                  'RT_max': 64599.44,
                                                  'RT_avg': 27140.8,
                                                  'TP50': 27000.0,
                                                  'TP99': 52000.0}}}}}
@wangting0128 wangting0128 added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test labels Oct 18, 2024
@wangting0128 wangting0128 added this to the 2.4.14 milestone Oct 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. test/benchmark benchmark test
Projects
None yet
Development

No branches or pull requests

2 participants