-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(robot): add test case Shutdown Volume Node And Test Auto Reattach To A New Node #2169
base: master
Are you sure you want to change the base?
Conversation
…h To A New Node Signed-off-by: Yang Chiu <[email protected]>
WalkthroughThe pull request introduces significant changes across multiple files, primarily enhancing the resource management and testing capabilities within the end-to-end testing framework. Key modifications include the addition of numerous library imports, the introduction of new keywords, and the restructuring of variable management by centralizing variable definitions into a new resource file. The updates also involve renaming and modifying existing keywords to support batch processing of powered-off nodes, improving the flexibility and robustness of the testing environment. Changes
Assessment against linked issues
Possibly related PRs
Suggested reviewers
Poem
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (50)
e2e/libs/host/base.py (1)
26-26
: Add docstring to document the waiting parameter.Since this is an abstract base class that defines the interface, it would be helpful to document the expected behavior of the
waiting
parameter.Consider adding a docstring like this:
@abstractmethod def power_off_node(self, node_name, waiting): + """Power off a node. + + Args: + node_name: Name of the node to power off + waiting: If True, wait for the node to completely shut down before returning + """ return NotImplementede2e/tests/regression/test_engine_image.robot (1)
Line range hint
1-24
: Consider adding node shutdown test coverage.While the existing test case covers engine upgrade scenarios, given the PR's objective to test node shutdown and auto-reattachment (issue #9858), consider adding a new test case in this file to verify engine image behavior during node shutdown scenarios.
Would you like me to help draft a new test case that covers:
- Engine image behavior when a node is shut down
- Verification of engine image state consistency after node recovery
- Integration with the share manager functionality for RWX PVCs
e2e/keywords/host.resource (1)
38-42
: Consider improving error handling and naming.While the implementation is functional, there are a few suggestions for improvement:
- The keyword name "Power on off nodes" is ambiguous. Consider renaming to something more descriptive like "Power On Previously Powered Off Nodes".
- The error handling could be enhanced to log failures.
- Consider adding validation for the
${powered_off_nodes}
list.Here's a suggested improvement:
-Power on off nodes +Power On Previously Powered Off Nodes FOR ${powered_off_node} IN @{powered_off_nodes} - Run keyword And Ignore Error power_on_node_by_name ${powered_off_node} + ${status} ${error} = Run Keyword And Ignore Error power_on_node_by_name ${powered_off_node} + Run Keyword If '${status}' == 'FAIL' Log Failed to power on node ${powered_off_node}: ${error} WARN Remove Values From List ${powered_off_nodes} ${powered_off_node} ENDe2e/tests/regression/test_migration.robot (1)
Line range hint
19-47
: Consider adding a test case for node shutdown scenarioWhile the current test case thoroughly validates RWX volume migration, it doesn't cover the specific scenario from issue #9858 where a node shutdown causes the share manager to become stuck. Consider adding a new test case that:
- Creates an RWX volume
- Shuts down the node containing the share manager pod
- Verifies auto-reattachment to a new node
- Validates that the share manager recovers properly
This would help prevent regression of the issue described in #9858.
Would you like me to help draft the new test case implementation?
e2e/keywords/sharemanager.resource (1)
29-32
: Consider adding documentation for the new keyword.While the implementation is correct, adding documentation would improve maintainability by clarifying:
- The purpose of the keyword
- Expected behavior during pod restart
- Any timeout or failure conditions
Example documentation:
+[Documentation] Waits for the sharemanager pod of the specified deployment to restart. +... The keyword uses the deployment ID to identify the volume and monitors +... the associated sharemanager pod until a restart is detected. Wait for sharemanager pod of deployment ${deployment_id} restart ${deployment_name} = generate_name_with_suffix deployment ${deployment_id} ${volume_name} = get_workload_volume_name ${deployment_name} wait_for_sharemanager_pod_restart ${volume_name}e2e/tests/regression/test_scheduling.robot (1)
Line range hint
1-54
: Missing test case for node shutdown and auto-reattachment scenario.Based on the PR objectives (issue #9858), this file should include a new test case to verify that:
- The share manager correctly handles node shutdown scenarios
- RWX PVCs can successfully auto-reattach to new nodes
The current test cases don't cover these scenarios, which are the primary objectives of this PR.
Would you like me to help draft a new test case that covers these scenarios? The test case would:
- Set up a RWX PVC with share manager
- Shut down the node hosting the share manager
- Verify successful auto-reattachment to a new node
e2e/tests/regression/test_volume.robot (2)
Line range hint
1-60
: Add test case for volume node shutdown and auto-reattachBased on the PR objectives and issue #9858, this file should include a test case for validating the share manager's behavior when a volume node is shut down and testing the auto-reattachment to a new node. This is currently missing from the test suite.
Would you like me to help draft a new test case that:
- Creates a volume with RWX access mode
- Simulates a node shutdown
- Verifies the share manager's state
- Validates successful auto-reattachment to a new node?
Line range hint
1-60
: Consider adding share manager state validationTo thoroughly test the scenario described in issue #9858, consider adding assertions to verify:
- Share manager's state transitions during node shutdown
- RWX PVC attachment behavior
- Recovery time metrics for auto-reattachment
This will help ensure the test suite effectively catches regressions related to the share manager getting stuck in stopping/error state.
e2e/libs/keywords/host_keywords.py (1)
49-52
: Add docstring and error handling.The implementation looks good and aligns well with the PR objectives for testing volume node shutdown scenarios. However, consider these improvements:
- Add a docstring explaining the parameters and behavior:
def power_off_volume_node(self, volume_name, waiting=True): + """Powers off the node containing the specified volume. + + Args: + volume_name: Name of the volume whose node should be powered off + waiting: If True, waits for power-off completion. If False, returns immediately + + Raises: + Exception: If volume node cannot be determined + """
- Add error handling for the node ID lookup:
- node_id = self.volume_keywords.get_node_id_by_replica_locality(volume_name, "volume node") + try: + node_id = self.volume_keywords.get_node_id_by_replica_locality(volume_name, "volume node") + if not node_id: + raise Exception(f"Could not determine node ID for volume {volume_name}") + except Exception as e: + raise Exception(f"Failed to get node ID for volume {volume_name}: {str(e)}")e2e/tests/negative/kubelet_restart.robot (3)
Line range hint
15-35
: Enhance test coverage for share manager scenariosGiven that this PR addresses issue #9858 regarding share manager getting stuck, consider adding specific validations:
- Verify share manager pod state transitions
- Validate successful RWX volume reattachment to new nodes
Restart Volume Node Kubelet While Workload Heavy Writing Given Set setting rwx-volume-fast-failover to ${RWX_VOLUME_FAST_FAILOVER} And Create storageclass longhorn-test with dataEngine=${DATA_ENGINE} And Create statefulset 0 using RWO volume with longhorn-test storageclass And Create statefulset 1 using RWX volume with longhorn-test storageclass FOR ${i} IN RANGE ${LOOP_COUNT} And Keep writing data to pod of statefulset 0 And Keep writing data to pod of statefulset 1 When Stop volume nodes kubelet for 10 seconds statefulset 0 statefulset 1 And Wait for volume of statefulset 0 healthy And Wait for volume of statefulset 1 healthy + And Wait for share manager pod healthy statefulset 1 And Wait for workloads pods stable statefulset 0 statefulset 1 + And Verify RWX volume successfully reattached statefulset 1 Then Check statefulset 0 works And Check statefulset 1 works END
Line range hint
37-57
: Consider parameterizing the kubelet stop durationThe 360-second duration for pod eviction timeout is hardcoded. Consider making this configurable through a variable to improve test maintainability.
Stop Volume Node Kubelet For More Than Pod Eviction Timeout While Workload Heavy Writing Given Set setting rwx-volume-fast-failover to ${RWX_VOLUME_FAST_FAILOVER} And Create storageclass longhorn-test with dataEngine=${DATA_ENGINE} And Create statefulset 0 using RWO volume with longhorn-test storageclass And Create statefulset 1 using RWX volume with longhorn-test storageclass FOR ${i} IN RANGE ${LOOP_COUNT} And Keep writing data to pod of statefulset 0 And Keep writing data to pod of statefulset 1 - When Stop volume nodes kubelet for 360 seconds statefulset 0 statefulset 1 + When Stop volume nodes kubelet for ${POD_EVICTION_TIMEOUT} statefulset 0 statefulset 1 And Wait for volume of statefulset 0 healthy And Wait for volume of statefulset 1 healthy + And Wait for share manager pod healthy statefulset 1 And Wait for workloads pods stable statefulset 0 statefulset 1 + And Verify RWX volume successfully reattached statefulset 1 Then Check statefulset 0 works And Check statefulset 1 works END
Line range hint
1-57
: Test structure aligns well with PR objectivesThe test cases effectively validate the system's behavior during node shutdown scenarios. The structure is clean and follows Robot Framework best practices. The suggested improvements will help ensure comprehensive coverage of the share manager issue (#9858).
Consider adding these test cases to a dedicated test suite for share manager-related scenarios if one doesn't exist already. This would help organize tests by functionality and make it easier to maintain and extend share manager-specific test coverage.
e2e/tests/regression/test_ha.robot (1)
Line range hint
24-34
: Add test case for share manager node shutdown scenarioThe current test cases don't cover the specific scenario from issue #9858 where the share manager becomes stuck when its node is shut down. Consider adding a new test case that:
- Creates a RWX PVC with share manager
- Shuts down the node hosting the share manager
- Verifies successful auto-reattachment to a new node
Would you like me to help draft the new test case implementation?
e2e/keywords/common.resource (2)
Line range hint
4-24
: Remove duplicate import of network_keywords libraryThe
network_keywords.py
library is imported twice (lines 8 and 13). Please remove one of the duplicate imports to maintain clean dependencies.Library ../libs/keywords/network_keywords.py Library ../libs/keywords/recurringjob_keywords.py Library ../libs/keywords/statefulset_keywords.py Library ../libs/keywords/stress_keywords.py Library ../libs/keywords/volume_keywords.py Library ../libs/keywords/workload_keywords.py Library ../libs/keywords/persistentvolumeclaim_keywords.py -Library ../libs/keywords/network_keywords.py Library ../libs/keywords/storageclass_keywords.py
Line range hint
1-67
: Consider documenting the cleanup sequence dependenciesThe cleanup sequence handles various resources in a specific order, which is crucial for proper test isolation. With the addition of powered-off node handling, it would be valuable to document:
- The rationale behind the cleanup sequence order
- Any dependencies between cleanup steps
- The interaction between powered-off node handling and other cleanup operations
Consider adding documentation in the following format:
*** Settings *** Documentation Common keywords ... The cleanup sequence follows these dependencies: ... 1. Power on nodes first to ensure they're available for other cleanup operations ... 2. Uncordon nodes to allow pod scheduling ... 3. Clean up higher-level resources (deployments, statefulsets) before their dependencies ... 4. Clean up storage-related resources laste2e/tests/negative/stress_cpu.robot (1)
6-6
: LGTM! Consider documenting the variables.The centralization of variables into a dedicated resource file is a good practice for maintainability. However, it would be helpful to document the expected values and purpose of these variables in the resource file.
e2e/tests/negative/stress_filesystem.robot (2)
Line range hint
1-100
: Missing test coverage for node shutdown scenarios.While the file contains various stress test scenarios, it doesn't address the primary PR objective of testing volume node shutdown and auto-reattachment scenarios described in issue #9858. Consider adding a new test case to validate this specific behavior.
I can help draft a new test case that would:
- Create a volume with multiple replicas
- Simulate a node shutdown
- Verify auto-reattachment to a new node
- Validate data integrity
Would you like me to propose the implementation?
Line range hint
31-43
: Consider adding timeout handling for volume detach/attach cycles.The test "Stress Volume Node Filesystem When Volume Is Detaching and Attaching" might benefit from explicit timeout handling to prevent test hangs.
Consider adding timeout parameters:
And Wait for volume 0 detached timeout=300s And Wait for volume 0 healthy timeout=300se2e/tests/regression/test_v2.robot (1)
Line range hint
1-1
: Missing test case for share manager node shutdownBased on the PR objectives and issue #9858, this file should include a new test case that:
- Creates an RWX PVC with a share manager
- Simulates a node shutdown where the share manager pod is running
- Verifies that the PVC can successfully reattach to a new node
Consider adding a new test case with the following structure:
Share Manager Should Recover After Node Shutdown [Tags] cluster Given Create storageclass longhorn-test with dataEngine=v2 accessMode=ReadWriteMany And Create persistentvolumeclaim 0 using RWX volume with longhorn-test storageclass And Create deployment 0 with persistentvolumeclaim 0 And Wait for volume of deployment 0 healthy When Get node of share manager pod for deployment 0 volume And Shutdown node of share manager pod Then Wait for volume of deployment 0 attached to new node And Check deployment 0 workse2e/tests/negative/uninstallation_checks.robot (1)
Line range hint
24-92
: Consider enhancing error scenario coverage.The test case thoroughly validates the happy path, but could be strengthened in two areas:
- Add explicit verification that backups are retained immediately after uninstallation
- Include error scenario handling (e.g., failed uninstallation)
Consider adding these verification steps after the uninstallation:
Then Uninstall Longhorn And Check Longhorn CRD removed + And Verify backup store contains backup 0 # Explicit verification post-uninstallation + And Verify backup store contains backup 1 + And Verify no orphaned resources in namespace longhorn-system # Assume this is another Longhorn cluster Then Install Longhorne2e/tests/negative/node_delete.robot (2)
Line range hint
20-40
: Consider optimizing test data size and adding timeoutsThe test writes 2048 MB of data, which might increase test execution time unnecessarily. Consider:
- Reducing the data size for faster test execution
- Making the data size configurable via a variable
- Adding explicit timeouts to "Wait until" steps to prevent indefinite hangs
- And Write 2048 MB data to file data in deployment 0 - And Write 2048 MB data to file data in deployment 1 + And Write ${TEST_DATA_SIZE} MB data to file data in deployment 0 + And Write ${TEST_DATA_SIZE} MB data to file data in deployment 1
Line range hint
20-89
: Consider adding specific test case for share manager recoveryWhile these tests cover node deletion scenarios well, they don't specifically address the share manager stuck issue mentioned in #9858. Consider adding a dedicated test case that:
- Specifically targets the share manager pod's node shutdown
- Verifies the share manager's state transitions
- Confirms successful reattachment of RWX PVCs
Would you like me to help draft a new test case specifically for the share manager recovery scenario?
e2e/tests/negative/pull_backup_from_another_longhorn.robot (1)
Line range hint
1-100
: Test case doesn't address PR objectivesThis test case focuses on backup/restore functionality across different Longhorn versions, but doesn't address the PR's main objective of testing "shutdown volume node and auto reattach to new node" scenarios (from issue #9858). The test should include steps to:
- Shut down a volume node
- Verify share manager behavior
- Test RWX PVC reattachment to new nodes
Would you like assistance in:
- Creating a new test case specifically for the node shutdown scenario, or
- Extending this test case to include the required node shutdown tests?
e2e/tests/negative/cluster_restart.robot (1)
Line range hint
19-89
: Enhance test coverage for share manager scenariosThe test case provides excellent coverage of various storage configurations and workload types. However, given the PR objectives and issue #9858, consider these enhancements:
- Add specific validation for share manager state after cluster restart
- Include checks for RWX volume reattachment behavior
- Consider parallel execution for efficiency
Add these validations after the cluster restart:
When Restart cluster And Wait for longhorn ready And Wait for workloads pods stable ... deployment 0 deployment 1 deployment 2 deployment 3 deployment 4 deployment 5 ... statefulset 0 statefulset 1 statefulset 2 statefulset 3 statefulset 4 statefulset 5 + # Validate share manager recovery + And Wait for share manager ready + And Verify share manager not in error state + And Verify RWX volumes successfully reattached Then Check deployment 0 workse2e/libs/host/aws.py (2)
76-79
: Add documentation for asynchronous behavior.The waiting logic implementation is correct, but it would benefit from a brief comment explaining the behavior when
waiting=False
. This helps other developers understand the asynchronous nature of the operation.def power_off_node(self, power_off_node_name, waiting=True): instance_ids = [self.mapping[power_off_node_name]] resp = self.aws_client.stop_instances(InstanceIds=instance_ids, Force=True) assert resp['ResponseMetadata']['HTTPStatusCode'] == 200, f"Failed to stop instances {instance_ids} response: {resp}" logging(f"Stopping instances {instance_ids}") + # When waiting=False, return immediately after initiating shutdown + # without waiting for the instance to reach 'stopped' state if waiting: waiter = self.aws_client.get_waiter('instance_stopped') waiter.wait(InstanceIds=instance_ids) logging(f"Stopped instances")🧰 Tools
🪛 Ruff (0.8.0)
79-79: f-string without any placeholders
Remove extraneous
f
prefix(F541)
79-79
: Remove unnecessary f-string prefix.The logging statement doesn't contain any placeholders, so the f-string prefix can be removed.
- logging(f"Stopped instances") + logging("Stopped instances")🧰 Tools
🪛 Ruff (0.8.0)
79-79: f-string without any placeholders
Remove extraneous
f
prefix(F541)
e2e/tests/regression/test_backup.robot (2)
Line range hint
1-1
: Consider adding node shutdown test cases or creating a separate test file.Based on the PR objectives and issue #9858, this PR should include test cases for share manager behavior during node shutdown scenarios. However, this file focuses on backup/restore functionality. Consider either:
- Adding the node shutdown test cases to this file if backup functionality is involved, or
- Creating a separate test file specifically for node shutdown and auto-reattachment scenarios.
Would you like me to help draft the test cases for the node shutdown scenario? This would include:
- Testing share manager behavior when a node is shut down
- Verifying RWX PVC attachment to new nodes
- Validating auto-reattachment functionality
Line range hint
37-85
: Consider enhancing error handling in backup volume tests.The test implementation is well-structured, but consider these improvements:
- Add retry mechanisms for backup operations that might be flaky
- Add more assertions to verify the error states
- Add logging to help debug test failures
Example improvement:
When Create backup 0 for volume 0 + # Add retry and logging + Run Keyword And Continue On Failure Wait Until Keyword Succeeds + ... ${RETRY_COUNT} ${RETRY_INTERVAL} + ... Verify Backup Creation 0 0 And Create backup 1 for volume 1 Then Verify backup list contains no error for volume 0e2e/libs/keywords/sharemanager_keywords.py (1)
70-95
: Implementation aligns well with PR objectivesThe new
wait_for_sharemanager_pod_restart
method and enhanced logging inwait_for_share_manager_pod_running
provide the necessary capabilities to:
- Monitor share manager pod restarts during node shutdown tests
- Debug potential issues with pod state transitions
- Validate the fix for issue #9858 where share manager gets stuck in stopping/error state
Consider adding additional checks for pod status conditions beyond just the creation timestamp to ensure the pod is not only restarted but also healthy and ready to serve RWX volumes.
🧰 Tools
🪛 Ruff (0.8.0)
80-80: Comparison to
None
should becond is None
Replace with
cond is None
(E711)
87-87: Do not
assert False
(python -O
removes these calls), raiseAssertionError()
Replace
assert False
(B011)
e2e/libs/host/harvester.py (2)
Line range hint
56-73
: Consider optimizing batch node operations.Now that
power_off_node
supports non-blocking operations, methods likereboot_all_nodes
andreboot_all_worker_nodes
could be optimized to power off all nodes concurrently before waiting.Example optimization for
reboot_all_nodes
:def reboot_all_nodes(self, shut_down_time_in_sec=NODE_REBOOT_DOWN_TIME_SECOND): node_names = [key for key in self.mapping.keys()] for node_name in node_names: - self.power_off_node(node_name) + self.power_off_node(node_name, waiting=False) + + # Wait for all nodes to finish powering off + for node_name in node_names: + self.power_off_node(node_name) # This call will only wait logging(f"Wait for {shut_down_time_in_sec} seconds before starting vms") time.sleep(shut_down_time_in_sec)
Line range hint
1-11
: Review security implications of disabled SSL verification.While this is test code, consider:
- Adding a comment explaining why SSL verification is disabled
- Ensuring this is only used in test environments
- Adding environment validation to prevent accidental use in production
e2e/tests/regression/test_settings.robot (2)
Line range hint
20-82
: Consider enhancing test documentation for concurrent rebuild scenarios.The test case thoroughly validates concurrent rebuild limits, but could benefit from additional documentation about:
- Expected rebuild timing behaviors
- Data integrity verification methods
- Potential impact on system resources
Would you like me to help generate improved documentation?
Line range hint
84-116
: Consider extending RWX volume tests to include node shutdown scenarios.Given the PR objectives to test node shutdown and auto-reattachment scenarios (issue #9858), consider extending this test case to include:
- Node shutdown while RWX volume is attached
- Verification of auto-reattachment behavior
- Network connectivity validation after reattachment
Would you like me to help draft additional test scenarios that align with the PR objectives?
e2e/tests/negative/replica_rebuilding.robot (3)
Line range hint
133-135
: Fix duplicate 'And' keyword in test steps.There's a syntax error in these steps where "And" is duplicated:
- And And Wait until volume of deployment 0 replica rebuilding started on node 0 + And Wait until volume of deployment 0 replica rebuilding started on node 0
Line range hint
22-35
: Consider randomizing node selection for better test coverage.The test cases use a sequential pattern (node 0, 1, 2) for deleting replicas. Consider randomizing the node selection to catch potential edge cases and ensure the system behaves correctly regardless of the order of node operations.
Example approach:
# Create a list of available nodes and shuffle them @{nodes}= Create List 0 1 2 Evaluate random.shuffle(@{nodes}) random # Use shuffled nodes in the test FOR ${i} IN RANGE ${LOOP_COUNT} FOR ${node} IN @{nodes} When Delete volume 0 replica on node ${node} And Wait until volume 0 replica rebuilding started on node ${node} And Wait until volume 0 replica rebuilding completed on node ${node} END Then Wait for volume 0 healthy And Check volume 0 data is intact ENDAlso applies to: 89-102
Line range hint
106-121
: Consider adding more specific assertions during rebuilding.While the test cases verify the final state and data integrity, consider adding assertions to verify the system state during the rebuilding process. This could help catch intermediate issues and provide better debugging information.
Example additions:
When Delete replica of deployment 0 volume on node 0 And Wait until volume of deployment 0 replica rebuilding started on node 0 # Add assertions about the system state during rebuilding And Volume should have expected rebuild status And Volume metrics should show active rebuild And System events should indicate rebuilding progresse2e/tests/negative/network_disconnect.robot (2)
Line range hint
1-1
: Consider adding specific test cases for share manager issues.While the current test cases cover network disconnection scenarios well, they don't fully address the share manager stuck issues mentioned in issue #9858. Consider adding test cases that specifically:
- Validate the share manager state during node shutdown
- Verify that the share manager doesn't get stuck in stopping/error state
- Confirm successful reattachment of RWX PVCs after node shutdown
This would help ensure the specific issues mentioned in #9858 are properly tested and prevent regressions.
Would you like me to help draft additional test cases focusing on these scenarios?
Line range hint
1-1
: Enhance test verification and automation.The test cases are well-structured, but consider these improvements:
- Add explicit verification steps for volume state transitions
- Automate more aspects of the manual test plans
- Add logging to capture state changes during network disconnection
This would improve test reliability and debugging capabilities.
Example verification steps to add:
And Verify volume state transition from "attached" to "detached" And Verify no orphaned resources after reattachment And Log volume manager state changes during network disconnecte2e/tests/negative/test_backup_listing.robot (3)
Line range hint
159-171
: Consider performance and reliability improvements for large volume operations.The test steps for large volume operations (210GB, 150GB) might benefit from the following improvements:
- Add explicit timeout settings for large data operations
- Verify cleanup completion between volume operations to prevent resource exhaustion
- Consider adding progress logging for long-running operations
Example improvement for timeout and logging:
Write 210 GB large data to file 0 in pod 2 [Timeout] 4h # Adjust based on performance characteristics FOR ${written} IN RANGE 0 210 10 Write 10 GB data to file 0 in pod 2 Log Written ${written+10}GB of 210GB END
Line range hint
1-171
: Consider separating test concerns and adding node shutdown scenarios.While the backup listing test improvements are valuable, they don't directly address the PR objectives regarding node shutdown and share manager scenarios. Consider:
- Creating a separate test file specifically for node shutdown and auto reattachment scenarios
- Adding test cases that validate share manager behavior during node shutdown
- Including specific assertions for RWX PVC attachment behavior
This would help maintain clear separation of concerns while ensuring comprehensive coverage of the scenarios described in issue #9858.
Test case is misaligned with PR objectives - additional test coverage needed
The current test files show that while there are tests for share manager pod recreation (
component_resilience.robot
) and node power-off scenarios (node_reboot.robot
), none specifically validate the share manager getting stuck during node shutdown as described in issue #9858. The test case being reviewed focuses on backup listing for large volumes, which is unrelated to the PR's objectives.Suggested actions:
- Add a new test case in
node_reboot.robot
specifically targeting the share manager stuck scenario during node shutdown- Keep the current large volume backup test separate as it serves a different purpose
🔗 Analysis chain
Line range hint
148-158
: Verify test case alignment with PR objectives.While this test case focuses on backup listing for large volumes, the PR objectives mention testing volume node shutdown and auto reattachment scenarios. Consider adding specific test cases that validate the behavior described in issue #9858 regarding share manager getting stuck when a node is shut down.
Let's check if there are other test files that might be more relevant to the PR objectives:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for test files that might be more relevant to testing node shutdown scenarios rg -l "shutdown.*node|node.*shutdown" "e2e/tests/" # Search for test files that might be testing share manager functionality rg -l "share.*manager|manager.*share" "e2e/tests/"Length of output: 225
Script:
#!/bin/bash # Let's examine the content of these files to see if they already cover the share manager node shutdown scenario rg -A 5 -B 5 "share.*manager|manager.*share|shutdown.*node|node.*shutdown" "e2e/tests/negative/node_reboot.robot" "e2e/tests/regression/test_settings.robot" "e2e/tests/negative/component_resilience.robot" # Also search for any existing test cases related to issue #9858 rg "9858|stuck" "e2e/tests/"Length of output: 6872
e2e/tests/negative/component_resilience.robot (1)
Line range hint
208-241
: Add test case for share manager node shutdown scenario.The current RWX volume recovery test case doesn't cover the specific scenario from issue #9858 where the share manager becomes stuck when its node is shut down. Consider adding a new test case that:
- Creates an RWX volume
- Shuts down the node where the share manager pod is running
- Verifies that the share manager successfully moves to a new node
- Confirms the RWX PVC can be attached to a new node
Here's a suggested test case structure:
Test Share Manager Node Shutdown Recovery [Documentation] Test share manager recovery when its node is shut down ... Related to issue #9858 ... ... Test steps: ... Create RWX volume using Longhorn storage class ... Deploy workload using the RWX volume ... Identify the node running share manager pod ... Shutdown the node ... Verify share manager moves to new node ... Verify RWX volume remains accessible When Create storageclass longhorn-test with dataEngine=${DATA_ENGINE} And Create persistentvolumeclaim 0 using RWX volume with longhorn-test storageclass And Create deployment 0 with persistentvolumeclaim 0 And Write test data to deployment 0 ${node}= Get node running sharemanager pod of deployment 0 When Power off node ${node} Then Wait for sharemanager pod of deployment 0 running And Verify sharemanager pod of deployment 0 is on different node ${node} And Wait for deployment 0 pods stable And Check deployment 0 data is intacte2e/keywords/workload.resource (3)
48-52
: LGTM! Consider adding documentation for the powered-off nodes tracking.The changes improve node state tracking by maintaining a list of powered-off nodes. This enhancement aligns well with the PR objectives for testing node shutdown scenarios.
Consider adding a comment explaining the purpose of
${powered_off_nodes}
list and${last_volume_node}
variable for better maintainability.
53-59
: LGTM! Consider documenting the difference between waiting and non-waiting variants.The new non-blocking keyword enhances testing capabilities for concurrent scenarios. However, consider the following improvements:
- Add documentation explaining when to use this variant vs. the blocking version.
- Consider refactoring to reduce code duplication:
+Power off volume node of ${workload_kind} ${workload_id} + [Arguments] ${waiting}=True + ${workload_name} = generate_name_with_suffix ${workload_kind} ${workload_id} + ${volume_name} = get_workload_volume_name ${workload_name} + ${powered_off_node} = get_volume_node ${volume_name} + Append to list ${powered_off_nodes} ${powered_off_node} + ${last_volume_node} = get_volume_node ${volume_name} + power_off_volume_node ${volume_name} waiting=${waiting} + Set Test Variable ${last_volume_node} + +Power off volume node of ${workload_kind} ${workload_id} without waiting + Power off volume node of ${workload_kind} ${workload_id} waiting=False
48-59
: Well-structured changes that enhance test coverage for node shutdown scenarios.The modifications effectively support testing of share manager behavior during node shutdowns by:
- Maintaining accurate tracking of powered-off nodes for reliable cleanup
- Enabling concurrent testing through non-blocking operations
- Providing consistent node state management across different test scenarios
These improvements align well with the PR objectives of testing RWX PVC behavior during node shutdowns.
Consider documenting the test patterns and best practices for using these keywords in the test suite documentation, particularly focusing on:
- When to use blocking vs. non-blocking variants
- Guidelines for cleanup and recovery after node shutdowns
- Recommended patterns for testing concurrent scenarios
e2e/tests/negative/node_drain.robot (4)
Line range hint
1-24
: Add test case for share manager node shutdown scenarioWhile the existing test cases cover various node drain scenarios, they don't specifically address the issue #9858 where the share manager becomes stuck when its node is shut down. Consider adding a new test case that:
- Creates an RWX PVC with share manager
- Shuts down the node where share manager pod is running
- Verifies that the share manager successfully moves to a new node
- Validates that the RWX PVC can be attached to a new node
This would help prevent regression of the issue described in #9858.
Would you like me to help draft this additional test case?
Line range hint
25-63
: Enhance RWX volume assertions in force drain testFor RWX volumes (deployment 1), consider adding these additional checks:
- Verify share manager pod successfully moves to another node
- Assert share manager is not in stopping/error state
- Validate share manager service endpoints are updated
This would strengthen the test coverage for RWX-specific behaviors during node drain.
Line range hint
219-243
: Improve robustness of node drain policy testConsider enhancing the test case with:
- Explicit timeout handling for drain operations
- State verification between steps
- Cleanup verification after policy change
This would make the test more reliable and easier to debug.
Example improvements:
When Force drain node 2 and expect failure And Check instance-manager pod is running on node 2 And Verify node 2 status is cordoned And Verify all non-protected pods are evicted timeout=300s When Set setting node-drain-policy to always-allow And Verify setting update is applied And Force drain node 2 and expect success And Wait until all pods are evicted timeout=300s And Check PDB not exist instance_manger=${instance_manager_name} And Verify cleanup of protection resources
Line range hint
219-219
: Enhance test documentation for node drain policy testThe documentation could be improved by:
- Adding expected behavior for RWX volumes specifically
- Including verification steps for share manager state
- Documenting cleanup expectations
This would make the test's relationship to issue #9858 clearer.
e2e/tests/negative/node_reboot.robot (1)
56-56
: Consider adding post-power-on verificationWhile the test powers on the nodes, consider adding explicit verification steps after power on to ensure complete system recovery.
And Power on off nodes +And Wait for longhorn ready +And Wait for nodes ready
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (35)
-
e2e/keywords/common.resource
(2 hunks) -
e2e/keywords/host.resource
(2 hunks) -
e2e/keywords/sharemanager.resource
(1 hunks) -
e2e/keywords/variables.resource
(1 hunks) -
e2e/keywords/workload.resource
(1 hunks) -
e2e/libs/host/aws.py
(1 hunks) -
e2e/libs/host/base.py
(1 hunks) -
e2e/libs/host/harvester.py
(2 hunks) -
e2e/libs/keywords/host_keywords.py
(1 hunks) -
e2e/libs/keywords/sharemanager_keywords.py
(1 hunks) -
e2e/tests/negative/cluster_restart.robot
(1 hunks) -
e2e/tests/negative/component_resilience.robot
(1 hunks) -
e2e/tests/negative/kubelet_restart.robot
(1 hunks) -
e2e/tests/negative/network_disconnect.robot
(1 hunks) -
e2e/tests/negative/node_delete.robot
(1 hunks) -
e2e/tests/negative/node_drain.robot
(1 hunks) -
e2e/tests/negative/node_reboot.robot
(8 hunks) -
e2e/tests/negative/pull_backup_from_another_longhorn.robot
(1 hunks) -
e2e/tests/negative/replica_rebuilding.robot
(1 hunks) -
e2e/tests/negative/stress_cpu.robot
(1 hunks) -
e2e/tests/negative/stress_filesystem.robot
(1 hunks) -
e2e/tests/negative/stress_memory.robot
(1 hunks) -
e2e/tests/negative/test_backup_listing.robot
(3 hunks) -
e2e/tests/negative/uninstallation_checks.robot
(1 hunks) -
e2e/tests/regression/test_backing_image.robot
(1 hunks) -
e2e/tests/regression/test_backup.robot
(1 hunks) -
e2e/tests/regression/test_basic.robot
(1 hunks) -
e2e/tests/regression/test_engine_image.robot
(1 hunks) -
e2e/tests/regression/test_ha.robot
(1 hunks) -
e2e/tests/regression/test_migration.robot
(1 hunks) -
e2e/tests/regression/test_replica.robot
(1 hunks) -
e2e/tests/regression/test_scheduling.robot
(1 hunks) -
e2e/tests/regression/test_settings.robot
(1 hunks) -
e2e/tests/regression/test_v2.robot
(2 hunks) -
e2e/tests/regression/test_volume.robot
(1 hunks)
✅ Files skipped from review due to trivial changes (1)
- e2e/keywords/variables.resource
🧰 Additional context used
🪛 Ruff (0.8.0)
e2e/libs/host/aws.py
79-79: f-string without any placeholders
Remove extraneous f
prefix
(F541)
e2e/libs/keywords/sharemanager_keywords.py
80-80: Comparison to None
should be cond is None
Replace with cond is None
(E711)
87-87: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
🔇 Additional comments (49)
e2e/libs/host/base.py (1)
26-26
: LGTM! Verify implementing classes.
The addition of the waiting
parameter to the abstract method provides better control over shutdown behavior, which aligns well with the PR's objective of testing node shutdown scenarios.
Let's verify that all implementing classes have been updated accordingly:
✅ Verification successful
All implementing classes have been properly updated with the new parameter
The verification shows that all implementing classes (harvester.py
and aws.py
) have correctly updated their power_off_node
method signatures to include the waiting
parameter with a default value of True
, matching the abstract method's signature in base.py
.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if all implementing classes have updated the power_off_node method signature
# Search for classes that inherit from Base
ast-grep --pattern 'class $_ (Base):'
# Search for power_off_node implementations
rg -A 2 'def power_off_node\(' --type py
Length of output: 683
e2e/tests/regression/test_engine_image.robot (1)
6-6
: Verify the variables defined in the new resource file.
The change moves variable definitions to a centralized variables.resource
file. Please ensure that all previously defined variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, ${DATA_ENGINE}
) are properly defined in the new resource file.
✅ Verification successful
All required variables are properly defined in the centralized resource file
The verification confirms that all previously defined variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, ${DATA_ENGINE}
) are properly defined in e2e/keywords/variables.resource
with appropriate values:
${LOOP_COUNT} = 1
${RETRY_COUNT} = 300
${RETRY_INTERVAL} = 1
${DATA_ENGINE} = v1
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in the new resource file
# Check for variable definitions in the new resource file
echo "Checking variables.resource for required variable definitions..."
rg -A 1 "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)\}" "../keywords/variables.resource"
# Check if these variables are still defined elsewhere in the test suite
echo "Checking for any remaining inline definitions of these variables..."
rg "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)\}" "e2e/tests/"
Length of output: 531
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
echo "Finding the variables.resource file..."
fd "variables.resource" .
# Then check its content for variable definitions
echo -e "\nChecking content of the variables.resource file if found..."
fd "variables.resource" . --exec cat {}
# Also check for any variable definitions in the test file itself
echo -e "\nChecking current test file for variable definitions..."
cat e2e/tests/regression/test_engine_image.robot
Length of output: 1682
e2e/tests/regression/test_backing_image.robot (2)
Line range hint 13-31
: LGTM! Well-structured test case with comprehensive coverage.
The test case demonstrates good practices:
- Clear BDD-style steps
- Proper resource management
- Verification of error conditions
- Complete cleanup
6-6
: LGTM! Good practice to centralize variable definitions.
The addition of variables.resource
improves maintainability by centralizing common variables used across test files.
Let's verify that all required variables are properly defined in the new resource file:
✅ Verification successful
All required variables are properly defined in the centralized resource file
The verification confirms that variables.resource
contains all the essential variables mentioned in the test file:
-
${LOOP_COUNT}
= 1 -
${RETRY_COUNT}
= 300 -
${RETRY_INTERVAL}
= 1 -
${DATA_ENGINE}
= v1
The centralization of these variables is indeed a good practice that will help maintain consistency across test files.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in the new resource file
# Expected variables: LOOP_COUNT, RETRY_COUNT, RETRY_INTERVAL, DATA_ENGINE
# Check the variables.resource file for required variable definitions
rg -A 1 '^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)\}' ../keywords/variables.resource
Length of output: 172
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
fd variables.resource
# Then check the content of the file for variable definitions
fd variables.resource --exec cat {}
Length of output: 384
e2e/keywords/host.resource (2)
Line range hint 4-9
: LGTM: Library imports are well-organized and complete.
The imports provide all necessary dependencies for node management operations.
44-47
: Verify implementation against test requirements.
The implementation tracks powered-off nodes for later recovery, which aligns with the test objectives. However, consider adding parameter validation and error handling.
Let's verify the implementation against the test requirements:
Consider adding parameter validation:
Power off node ${node_id}
+ # Validate node_id is a positive integer
+ Should Be True ${node_id} >= 0 msg=Node ID must be a non-negative integer
+
${powered_off_node} = get_node_by_index ${node_id}
+ Should Not Be Empty ${powered_off_node} msg=Invalid node ID: ${node_id}
+
Append to list ${powered_off_nodes} ${powered_off_node}
power_off_node_by_name ${powered_off_node}
✅ Verification successful
Implementation is correctly integrated with test framework
The verification shows that the implementation is properly integrated with the test framework:
-
powered_off_nodes
is initialized as an empty list invariables.resource
- The keyword is used in node reboot and drain test scenarios
- Proper cleanup is implemented in
Cleanup test resources
keyword which powers on nodes and cleans the list - Multiple test files track powered-off nodes consistently using the same list
The suggested parameter validation is not necessary because:
-
get_node_by_index
is a framework-level function that would fail if the index is invalid - The list operation is safe as
powered_off_nodes
is pre-initialized - Test teardown ensures proper cleanup of powered-off nodes
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the usage of this keyword in test cases
# and ensure it's properly integrated with the shutdown volume node test case
# Check for test cases using this keyword
rg -l "Power off node" --type robot
# Check for related test implementations
rg -l "shutdown.*volume.*node.*test.*auto.*reattach" --type robot
Length of output: 178
Script:
#!/bin/bash
# Let's examine the test files that use this keyword to understand the context
rg -B 5 -A 5 "Power off node" e2e/tests/negative/node_reboot.robot e2e/tests/negative/node_drain.robot
# Check if powered_off_nodes is initialized somewhere
rg "powered_off_nodes" --type robot
# Check the implementation of get_node_by_index
rg -B 2 -A 2 "get_node_by_index" --type robot
Length of output: 3494
Script:
#!/bin/bash
# Let's check the initialization of powered_off_nodes and related setup
rg -B 5 -A 5 "Test Setup|Suite Setup" e2e/tests/negative/node_reboot.robot e2e/tests/negative/node_drain.robot
# Check the implementation of power_off_node_by_name
rg -B 2 -A 2 "power_off_node_by_name" --type robot
# Check if there are any test teardown procedures handling powered off nodes
rg -B 5 -A 5 "Test Teardown|Suite Teardown" e2e/tests/negative/node_reboot.robot e2e/tests/negative/node_drain.robot
Length of output: 3561
Script:
#!/bin/bash
# Let's check the Set test environment keyword which might initialize powered_off_nodes
rg -B 2 -A 5 "Set test environment" ../e2e/keywords/
# Check the Cleanup test resources keyword for handling powered off nodes
rg -B 2 -A 5 "Cleanup test resources" ../e2e/keywords/
# Check for any other references to powered_off_nodes in keyword files
rg -B 2 -A 2 "powered_off_nodes" ../e2e/keywords/
Length of output: 332
Script:
#!/bin/bash
# Let's check the Set test environment keyword which might initialize powered_off_nodes
rg -B 2 -A 5 "Set test environment" e2e/keywords/
# Check the Cleanup test resources keyword for handling powered off nodes
rg -B 2 -A 5 "Cleanup test resources" e2e/keywords/
# Check for any other references to powered_off_nodes in keyword files
rg -B 2 -A 2 "powered_off_nodes" e2e/keywords/
# Let's also check the content of the host.resource file
cat e2e/keywords/host.resource
Length of output: 5101
e2e/tests/regression/test_migration.robot (1)
6-6
: LGTM! Good practice moving variables to a centralized resource file.
The addition of variables.resource
improves maintainability by centralizing variable definitions that are used across multiple test files.
e2e/keywords/sharemanager.resource (2)
29-32
: LGTM! The new keyword follows consistent patterns.
The implementation correctly follows the established patterns in the file:
- Reuses existing helper functions for name generation
- Maintains consistent naming conventions
- Follows the same structure as similar keywords
29-32
: Verify the restart detection logic.
Since this keyword is crucial for testing the auto-reattachment functionality mentioned in issue #9858, we should verify the underlying implementation.
✅ Verification successful
Let's examine the implementation of the function in the identified file.
Let's check the retry configuration and pod utilities being used to ensure the restart detection is robust.
Let's check the retry configuration values and the get_pod implementation to complete our verification.
The restart detection logic is properly implemented
The implementation correctly detects pod restarts by:
- Tracking pod creation timestamps to detect new instances
- Using robust retry mechanism with configurable count and interval
- Properly handling pod absence during restart
- Using the official Kubernetes API for pod status checks
- Failing explicitly if restart is not detected within the retry window
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check the implementation of wait_for_sharemanager_pod_restart
# to ensure it correctly detects pod restarts and handles timeouts
# Search for the implementation
ast-grep --pattern 'def wait_for_sharemanager_pod_restart($$$)'
# Look for related test cases
rg -l "wait_for_sharemanager_pod_restart" --type py
Length of output: 160
Script:
#!/bin/bash
# Get the implementation of wait_for_sharemanager_pod_restart and surrounding context
rg -A 20 "def wait_for_sharemanager_pod_restart" e2e/libs/keywords/sharemanager_keywords.py
# Also check for any restart detection related functions in the same file
rg "restart" e2e/libs/keywords/sharemanager_keywords.py
Length of output: 1432
Script:
#!/bin/bash
# Check the retry configuration
rg "get_retry_count_and_interval" --type py
# Check pod utilities
rg "def get_pod" --type py
# Look for any test cases using this function
rg -B 5 -A 10 "wait_for_sharemanager_pod_restart" e2e/
Length of output: 10658
Script:
#!/bin/bash
# Check the retry configuration implementation
rg -A 5 "def get_retry_count_and_interval" e2e/libs/utility/utility.py
# Check the get_pod implementation
cat e2e/libs/utility/utility.py | rg -A 10 "def get_pod"
Length of output: 641
e2e/tests/regression/test_scheduling.robot (1)
6-6
: LGTM! Good practice to centralize variable management.
The addition of variables.resource
aligns with best practices by centralizing variable definitions, making the test suite more maintainable and consistent.
e2e/tests/regression/test_replica.robot (2)
6-6
: LGTM! Good practice to centralize variable management.
The addition of variables.resource
improves code organization by centralizing variable definitions that were previously scattered across test files.
Let's verify the resource file exists and contains the required variables:
✅ Verification successful
Resource file contains all expected variables and follows Robot Framework conventions ✅
The variables.resource
file exists and contains essential variables including:
-
${LOOP_COUNT}
,${RETRY_COUNT}
,${RETRY_INTERVAL}
for test control flow -
${DATA_ENGINE}
for engine specification - Additional configuration variables for volume types and network settings
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the variables.resource file exists and contains the required variables
# Expected: File exists and contains LOOP_COUNT, RETRY_COUNT, RETRY_INTERVAL, and DATA_ENGINE
# Check if the file exists and show its contents
cat e2e/keywords/variables.resource
Length of output: 326
Line range hint 1-24
: Consider adding node shutdown test cases to align with PR objectives.
The PR objectives mention testing node shutdown and auto-reattachment scenarios (issue #9858), but this file only contains replica rebuilding tests. Consider adding test cases for:
- Share manager pod behavior during node shutdown
- RWX PVC auto-reattachment to new nodes
Let's check if these test cases exist elsewhere:
e2e/tests/regression/test_volume.robot (1)
6-6
: Verify variables.resource contains all required variables
The change to centralize variables is good for maintainability. However, we should verify that all previously defined variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
) are properly defined in the new resource file.
✅ Verification successful
All required variables are properly defined in variables.resource
The verification confirms that all previously defined variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
) have been successfully moved to e2e/keywords/variables.resource
with the same values:
${LOOP_COUNT} = 1
${RETRY_COUNT} = 300
${RETRY_INTERVAL} = 1
The centralization of these variables is a good practice and has been implemented correctly.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in variables.resource
# Check if the variables exist in the new resource file
rg -l "(?:\$\{LOOP_COUNT\}|\$\{RETRY_COUNT\}|\$\{RETRY_INTERVAL\})" "e2e/keywords/variables.resource"
Length of output: 98
Script:
#!/bin/bash
# Let's first check if the variables.resource file exists and see its content
fd "variables.resource" e2e/keywords/ --type f --exec cat {}
# Also, let's check the original test file to see what variables were previously defined
fd "test_volume.robot" e2e/tests/regression/ --type f --exec git show HEAD~1:{}
Length of output: 2590
e2e/libs/keywords/host_keywords.py (1)
49-52
: Implementation aligns with PR objectives.
The addition of the waiting
parameter enhances the testing capabilities for issue #9858, allowing verification of share manager behavior during both immediate and delayed node shutdowns. This flexibility is crucial for thorough testing of RWX PVC auto-reattachment scenarios.
Let's verify the usage of this method in test cases:
✅ Verification successful
Implementation aligns with PR objectives and is being utilized correctly
The verification confirms that the power_off_volume_node
method with the new waiting
parameter is being used in the test framework:
- In
e2e/keywords/workload.resource
: Used with both default behavior (waiting=True
) and immediate shutdown (waiting=False
) - The implementation provides the flexibility needed for testing share manager behavior during node shutdowns as intended in issue #9858
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if the new waiting parameter is utilized in test cases
# Expected: Find test cases using power_off_volume_node with waiting=False
# Search for test cases using the new parameter
rg -l "power_off_volume_node.*waiting.*=.*False" "e2e/test"
Length of output: 108
Script:
#!/bin/bash
# Let's first find the correct test directory structure
fd -t d "test"
# Then search for any usage of power_off_volume_node
rg "power_off_volume_node" -A 2 -B 2
Length of output: 1641
e2e/tests/negative/kubelet_restart.robot (1)
6-6
: Verify variables in the new resource file
The variables previously defined in this file have been moved to variables.resource
. Let's ensure all required variables are properly defined.
✅ Verification successful
Variables are properly defined in the centralized resource file
All required variables used in kubelet_restart.robot
are correctly defined in e2e/keywords/variables.resource
:
-
${LOOP_COUNT}
= 1 -
${RETRY_COUNT}
= 300 -
${RETRY_INTERVAL}
= 1 -
${RWX_VOLUME_FAST_FAILOVER}
= false -
${DATA_ENGINE}
= v1
The centralization of these variables maintains consistency and follows good practices for test maintenance.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in the new resource file
# Expected variables: LOOP_COUNT, RETRY_COUNT, RETRY_INTERVAL, RWX_VOLUME_FAST_FAILOVER, DATA_ENGINE
# Check for variable definitions in the resource file
rg -A 1 '^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)\}' ../keywords/variables.resource
Length of output: 197
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
fd variables.resource
# Then check the content of the file for variable definitions
# We'll also search for these variables in the original test file to understand what we're looking for
rg -A 1 '^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)\}' e2e/tests/negative/kubelet_restart.robot
Length of output: 192
Script:
#!/bin/bash
# Now that we found the correct path, let's check the content of variables.resource
cat e2e/keywords/variables.resource
# Also let's check the original test file to see how these variables were used before
cat e2e/tests/negative/kubelet_restart.robot
Length of output: 2688
e2e/tests/regression/test_ha.robot (2)
6-6
: Verify variable definitions in the new resource file
The migration of variables to variables.resource
improves maintainability. However, we should ensure that all required variables (${LOOP_COUNT}, ${RETRY_COUNT}, ${RETRY_INTERVAL}) are properly defined in the new resource file.
✅ Verification successful
Variables are properly defined in the resource file
The verification confirms that all required variables (${LOOP_COUNT}, ${RETRY_COUNT}, ${RETRY_INTERVAL}) are correctly defined in e2e/keywords/variables.resource
with appropriate default values:
- ${LOOP_COUNT} = 1
- ${RETRY_COUNT} = 300
- ${RETRY_INTERVAL} = 1
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in the new resource file
# Check for required variables in the resource file
echo "Checking for required variables in variables.resource..."
rg -A 1 "LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL" "../keywords/variables.resource"
Length of output: 274
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
fd -t f "variables.resource$"
# Then check the content of the file for our variables
echo "=== Checking test file for variable usage ==="
rg "LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL" "e2e/tests/regression/test_ha.robot"
Length of output: 289
Script:
#!/bin/bash
# Now that we found the correct path, let's check the content of variables.resource
echo "=== Content of variables.resource ==="
cat "e2e/keywords/variables.resource"
# Also check if these variables might be defined in other resource files
echo -e "\n=== Checking other resource files for these variables ==="
rg "LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL" "e2e/**/*.resource"
Length of output: 665
Line range hint 36-52
: Consider implementing the skipped test case
The "Don't Orphan Processes" test case has detailed documentation but is currently skipped. This test could be valuable for verifying proper cleanup of resources during node failures, which aligns with the PR's objectives around node shutdown scenarios.
e2e/tests/negative/stress_cpu.robot (1)
Line range hint 15-89
: Consider adding a test case for node shutdown scenario.
While the existing test cases provide good coverage for CPU stress scenarios, they don't directly address the PR objective of testing node shutdown and auto-reattachment behavior. Consider adding a new test case that specifically tests this scenario.
Example structure for the new test case:
Stress Volume Node CPU During Node Shutdown And Auto Reattachment
Given Create volume 0 with size=5Gi numberOfReplicas=3
And Attach volume 0
And Write data to volume 0
FOR ${i} IN RANGE ${LOOP_COUNT}
When Stress CPU of node with volume 0
And Shutdown volume node
And Wait for volume auto-reattachment to new node
Then Check volume 0 data is intact
END
Let's verify if similar test cases exist in other files:
e2e/tests/negative/stress_memory.robot (3)
6-6
: LGTM! Good practice centralizing variables.
The addition of variables.resource
improves maintainability by centralizing variable definitions, which is particularly important for test parameters like ${LOOP_COUNT}
that are used across multiple test cases.
Line range hint 19-31
: Add documentation about expected behavior under memory stress.
The test case verifies data integrity during replica rebuilding under memory stress, but it would be helpful to document:
- Expected memory thresholds
- Expected system behavior under stress
- Success criteria beyond data integrity
Let's verify the memory stress implementation:
✅ Verification successful
Add documentation about memory stress test configuration and behavior
The test uses stress-ng with the following configuration:
- Memory load: 80% of available memory
- Single VM worker process
- 1-hour timeout period
- Stress pod runs on the same node as the volume
Consider documenting:
- The memory stress configuration in test description
- Expected behavior: system should maintain data integrity while under 80% memory pressure
- Success criteria: volume replica rebuilding completes successfully under memory stress
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check how memory stress is implemented
# Expected: Find the stress memory implementation to understand thresholds
ast-grep --pattern 'Stress memory of node with volume $_'
rg -A 5 'Stress memory of node with volume'
Length of output: 1356
Script:
#!/bin/bash
# Let's check the implementation of stress_node_memory functions
rg -A 10 'stress_node_memory_by_volume|stress_node_memory_of_all_volumes'
# Also check if there are any configuration or constant files that might define thresholds
fd -e yaml -e py -e robot -e resource | xargs rg -l 'stress.*memory|memory.*stress'
Length of output: 2860
Script:
#!/bin/bash
# Let's check the actual memory stress implementation in stress.py
cat e2e/libs/node/stress.py
Length of output: 5186
Script:
#!/bin/bash
# Let's check the constant values used for memory stress
cat e2e/libs/node/constant.py
Length of output: 373
Line range hint 33-47
: Consider potential race conditions in stress test loops.
The remaining test cases follow similar patterns of stressing memory during volume operations. Consider:
- Adding retry mechanisms for volume operations that might fail under stress
- Implementing proper cleanup between iterations
- Adding health checks before proceeding to the next iteration
Let's verify the retry mechanisms:
Also applies to: 49-67, 69-89
✅ Verification successful
Based on the search results, I can see that there are already retry mechanisms and health checks implemented in the test framework:
- The
Wait for volume healthy
keyword is consistently used after volume operations - Data integrity checks are performed using
Check volume data is intact
- The test framework has proper cleanup between iterations with detach/attach cycles
- Health checks are performed before proceeding to next iterations
The review comment's concerns about retry mechanisms and health checks are already addressed in the codebase.
Let me verify the cleanup mechanisms between iterations:
Let me try a different search pattern to verify the cleanup mechanisms:
The review comment's concerns are already properly addressed in the codebase.
Looking at the test file and previous search results, I can confirm that:
- Retry mechanisms: The framework uses
Wait for volume healthy
andWait for volume detached
keywords consistently to ensure operations complete - Cleanup between iterations: The test properly detaches and reattaches volumes between iterations with proper wait states
- Health checks: Each iteration includes data integrity verification with
Check volume data is intact
and volume health checks withWait for volume healthy
The test follows best practices for stress testing with proper cleanup, verification, and health checks between iterations. No changes are needed.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for retry implementations in volume operations
# Expected: Find retry mechanisms for volume operations under stress
rg -A 5 '(Wait for|Wait until).*volume.*healthy'
ast-grep --pattern 'Wait for volume $_ healthy'
Length of output: 53636
Script:
#!/bin/bash
# Check cleanup implementations between iterations
rg -A 5 'Detach volume.*\s+.*Wait for volume.*detached'
Length of output: 57
Script:
#!/bin/bash
# Check cleanup implementations between iterations
rg -A 2 'Detach volume.*\s+And Wait for volume'
# Also check the test file directly
cat e2e/tests/negative/stress_memory.robot
Length of output: 2641
e2e/tests/negative/stress_filesystem.robot (2)
6-6
: LGTM: Good practice to centralize variable definitions.
The addition of variables.resource
aligns with best practices by centralizing common variable definitions, making the test framework more maintainable.
Line range hint 19-29
: Consider enhancing error handling in the replica rebuilding test.
The test "Stress Volume Node Filesystem When Replica Is Rebuilding" should verify that the filesystem remains stable during rebuilding. Consider adding additional validation steps.
Consider adding these validation steps within the loop:
And Verify volume 0 filesystem is accessible
And Verify no I/O errors in volume 0 logs
e2e/tests/regression/test_v2.robot (2)
6-6
: Verify variables migration completeness
The addition of variables.resource
and removal of inline variables is a good practice for maintainability. However, we should verify that all necessary variables (especially ${LOOP_COUNT}
) are properly defined in the new resource file.
✅ Verification successful
Variables migration is complete and correct
The variables.resource
file contains all essential variables including ${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, and other global variables needed for testing. The migration to a centralized variable definition is properly implemented.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all variables used in the test file are defined in the resource file
# Check if LOOP_COUNT and other variables are defined in the new resource file
rg -A 1 "^\*\*\* Variables \*\*\*" "../keywords/variables.resource"
rg "LOOP_COUNT" "../keywords/variables.resource"
Length of output: 256
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
fd "variables.resource"
# Then check the content of test_v2.robot to see what variables are being used
rg "\${[A-Z_]+}" "e2e/tests/regression/test_v2.robot"
Length of output: 203
Script:
#!/bin/bash
# First, let's check the content of the variables resource file
cat e2e/keywords/variables.resource
# Then check variables used in test_v2.robot with a corrected regex
rg "\\\${[A-Z_]+}" "e2e/tests/regression/test_v2.robot"
Length of output: 477
54-54
: Test coverage gap for node shutdown scenario
While the cluster
tag addition is appropriate for this test case, I notice that this file doesn't include the test case mentioned in the PR objectives for testing volume node shutdown and auto-reattachment scenarios (issue #9858).
Would you like me to help draft a new test case that specifically addresses the share manager node shutdown scenario described in issue #9858?
e2e/tests/negative/uninstallation_checks.robot (2)
Line range hint 24-92
: LGTM on test structure and documentation!
The test case is well-documented with clear prerequisites, steps, and environment requirements. The implementation follows good practices:
- Tests both v1 and v2 data engines
- Includes proper data verification
- Validates cross-cluster DR functionality
6-6
: Verify variable definitions in the new resource file.
The addition of variables.resource
and removal of local variable declarations is a good practice for centralization. However, we should ensure all required variables are properly defined.
✅ Verification successful
Resource file change verified successfully
The variables.resource file properly defines all the required variables including ${LOOP_COUNT}
, ${RETRY_COUNT}
, and ${RETRY_INTERVAL}
. The file is correctly located at e2e/keywords/variables.resource
and is being used by multiple test files in the e2e test suite. The uninstallation_checks.robot file doesn't contain any local variable declarations that would conflict with the centralized variables.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in the new resource file
# and check for any potential impacts on other test files.
# Check the content of the variables resource file
echo "Checking variables.resource content:"
cat ../keywords/variables.resource
# Look for any remaining direct references to the moved variables
echo -e "\nChecking for any remaining references to variables in test files:"
rg "(\${LOOP_COUNT}|\${RETRY_COUNT}|\${RETRY_INTERVAL})" ./e2e/tests/
Length of output: 532
Script:
#!/bin/bash
# Let's fix the path and regex issues in our verification
# First, let's find the correct path to the variables.resource file
echo "Looking for variables.resource file:"
fd "variables.resource"
# Check for variable references with properly escaped regex
echo -e "\nChecking for variable references in test files:"
rg '\$\{LOOP_COUNT\}|\$\{RETRY_COUNT\}|\$\{RETRY_INTERVAL\}' e2e/tests/
# Also check the content of the uninstallation_checks.robot to understand variable usage
echo -e "\nChecking uninstallation_checks.robot content:"
cat e2e/tests/negative/uninstallation_checks.robot
Length of output: 7391
Script:
#!/bin/bash
# Let's check the content of the variables.resource file since we found its correct path
echo "Checking variables.resource content:"
cat e2e/keywords/variables.resource
# Also check if there are any variable declarations in the uninstallation_checks.robot
# that might have been moved to variables.resource
echo -e "\nChecking for any variable declarations in uninstallation_checks.robot:"
rg '^\$\{.*\}.*=' e2e/tests/negative/uninstallation_checks.robot
Length of output: 633
e2e/tests/negative/node_delete.robot (1)
6-6
: Verify all required variables are defined in variables.resource
The centralization of variables is a good practice. However, we need to ensure that all previously defined variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, ${RWX_VOLUME_FAST_FAILOVER}
, ${DATA_ENGINE}
) are properly defined in the new resource file.
✅ Verification successful
All required variables are properly defined in variables.resource
The verification confirms that all necessary variables have been successfully moved to the centralized resource file:
-
${LOOP_COUNT}
= 1 -
${RETRY_COUNT}
= 300 -
${RETRY_INTERVAL}
= 1 -
${RWX_VOLUME_FAST_FAILOVER}
= false -
${DATA_ENGINE}
= v1
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all required variables are defined in the new resource file
# Check if all required variables are defined in variables.resource
echo "Checking for required variables in variables.resource..."
rg -A 1 "(\${LOOP_COUNT}|\${RETRY_COUNT}|\${RETRY_INTERVAL}|\${RWX_VOLUME_FAST_FAILOVER}|\${DATA_ENGINE})" "../keywords/variables.resource"
Length of output: 434
Script:
#!/bin/bash
# Description: Verify all required variables are defined in the new resource file
# Check if all required variables are defined in variables.resource
echo "Checking for required variables in variables.resource..."
# Search for each variable separately to avoid regex issues
echo "Searching for LOOP_COUNT..."
rg "LOOP_COUNT" "e2e/keywords/variables.resource"
echo "Searching for RETRY_COUNT..."
rg "RETRY_COUNT" "e2e/keywords/variables.resource"
echo "Searching for RETRY_INTERVAL..."
rg "RETRY_INTERVAL" "e2e/keywords/variables.resource"
echo "Searching for RWX_VOLUME_FAST_FAILOVER..."
rg "RWX_VOLUME_FAST_FAILOVER" "e2e/keywords/variables.resource"
echo "Searching for DATA_ENGINE..."
rg "DATA_ENGINE" "e2e/keywords/variables.resource"
# Also check the test file to see what variables were previously defined
echo "Checking git history for previously defined variables..."
git log -p "e2e/tests/negative/node_delete.robot" | rg "^\-.*(\$\{.*\})"
Length of output: 992
e2e/tests/negative/pull_backup_from_another_longhorn.robot (1)
6-6
: LGTM: Good refactoring of variable management
The addition of variables.resource
aligns with best practices by centralizing variable definitions, making the test framework more maintainable.
e2e/tests/negative/cluster_restart.robot (1)
6-6
: Verify the variables.resource file setup
The centralization of variables into variables.resource
is a good practice. However, let's ensure all required variables are properly defined.
✅ Verification successful
Resource file setup is correct and properly configured
The variables.resource file exists at e2e/keywords/variables.resource
and contains all the required variables (LOOP_COUNT, RETRY_COUNT, RETRY_INTERVAL, CONTROL_PLANE_NODE_NETWORK_LATENCY_IN_MS, RWX_VOLUME_FAST_FAILOVER, DATA_ENGINE). The import path in cluster_restart.robot
correctly references this file using a relative path.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the variables.resource file exists and contains all required variables
# Check if the file exists and contains the required variables
rg -U "(?s)\*\*\* Variables \*\*\*.*\\\${LOOP_COUNT}.*\\\${RETRY_COUNT}.*\\\${RETRY_INTERVAL}.*\\\${CONTROL_PLANE_NODE_NETWORK_LATENCY_IN_MS}.*\\\${RWX_VOLUME_FAST_FAILOVER}.*\\\${DATA_ENGINE}" ../keywords/variables.resource
Length of output: 502
Script:
#!/bin/bash
# Let's first check if the variables.resource file exists
fd "variables.resource" ../keywords/
# Then let's examine its content
rg "^\\\$\{.*\}" ../keywords/variables.resource
# Also check for Variables section marker
rg "\*\*\* Variables \*\*\*" ../keywords/variables.resource
Length of output: 388
Script:
#!/bin/bash
# Let's search for the variables.resource file in the entire repository
fd "variables.resource"
# Let's also check the actual path relative to the test file
ls -l e2e/tests/keywords/variables.resource
Length of output: 186
Script:
#!/bin/bash
# Let's examine the content of the found variables.resource file
cat e2e/keywords/variables.resource
# Also check the import statement in the test file
cat e2e/tests/negative/cluster_restart.robot | head -n 10
Length of output: 682
e2e/libs/host/aws.py (2)
71-71
: LGTM! Well-designed parameter addition.
The new waiting
parameter with default value True
maintains backward compatibility while enabling more flexible testing scenarios. This change aligns well with the PR objective of testing node shutdown behaviors.
77-78
: Verify AWS waiter timeout configuration.
Since this code is critical for testing share manager behavior during node shutdown (issue #9858), we should verify the default timeout settings of the AWS waiter are sufficient.
e2e/tests/regression/test_backup.robot (1)
6-6
: Verify the variables.resource file setup and variable accessibility.
The addition of variables.resource
aligns with the effort to centralize variable management. This is a good practice for maintainability and consistency.
Let's verify the variables.resource file setup:
✅ Verification successful
Resource file setup is correct and contains the required variables
The variables.resource
file exists at the correct location (e2e/keywords/variables.resource
) and contains all the necessary variables including ${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, and ${DATA_ENGINE}
. The file path in the test file (../keywords/variables.resource
) correctly references this centralized variable file.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that variables.resource exists and contains the necessary variables
# Check if the file exists and contains the required variables
rg -l "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)\}" ../keywords/variables.resource
# Check for any remaining direct variable declarations in other test files
# that should have been moved to variables.resource
rg "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)\}" "e2e/tests/**/*.robot"
Length of output: 318
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
fd "variables.resource" .
# Then check for any variable declarations in Robot Framework files
fd -e robot . --exec rg -l "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)\}"
# Also check the content of the test_backup.robot file to understand the correct relative path
cat e2e/tests/regression/test_backup.robot
Length of output: 4327
Script:
#!/bin/bash
# Let's check the content of the variables.resource file
cat e2e/keywords/variables.resource
# Also check if these variables are used in the test file
rg "\${(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|DATA_ENGINE)}" e2e/tests/regression/test_backup.robot
Length of output: 566
e2e/libs/keywords/sharemanager_keywords.py (1)
95-95
: Great addition of phase logging
The added logging statement improves test observability by showing the current phase of the share manager pod during the waiting period. This will be helpful for debugging test failures.
e2e/libs/host/harvester.py (2)
56-56
: LGTM! The non-blocking power-off option is a good addition.
The new waiting
parameter with a default value of True
maintains backward compatibility while adding flexibility for batch operations. The early return when waiting=False
is clean and straightforward.
Also applies to: 71-73
Line range hint 56-89
: Consider adding additional power state validation.
Given this code is critical for testing scenarios like share manager recovery after node shutdown, consider adding additional validation to ensure the VM is truly powered off before returning.
Suggested additional validation:
def power_off_node(self, node_name, waiting=True):
vm_id = self.mapping[node_name]
url = f"{self.url}/{vm_id}"
+ initial_state = None
+ try:
+ resp = requests.get(url, cookies=self.cookies, verify=False)
+ initial_state = resp.json()['metadata']['fields'].get('status', None)
+ except Exception as e:
+ logging(f"Failed to get initial VM state: {e}")
+
for i in range(self.retry_count):
logging(f"Trying to stop vm {vm_id} ... ({i})")
e2e/tests/regression/test_settings.robot (1)
6-6
: LGTM! Good practice to centralize variable definitions.
The addition of variables.resource
improves maintainability by centralizing variable definitions. This change aligns with Robot Framework best practices.
Let's verify that all required variables are properly defined in the new resource file:
✅ Verification successful
Let me try to locate the correct path for the variables resource file:
Let's check the content of the variables.resource file and look for any duplicate variable definitions:
Resource file addition verified and looks good
The verification shows that:
- The
variables.resource
file exists at the correct location (e2e/keywords/variables.resource
) - It contains global variable definitions that are properly scoped
- The test file
test_settings.robot
correctly references it using a relative path - The test file uses
${DATA_ENGINE}
which is defined in the variables resource file - No conflicting variable definitions were found in the test file
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify all required variables are defined in the new resource file
# and that no other test files still define these variables locally
# Check if all required variables are defined in the new resource file
echo "Checking variables.resource for required variables..."
cat ../keywords/variables.resource
# Check if any other test files still define these variables
echo "Checking for duplicate variable definitions..."
rg -l "^\*\*\* Variables \*\*\*" "e2e/tests/"
Length of output: 410
Script:
#!/bin/bash
# First, let's find the actual path of the variables.resource file
fd -e resource variables
# Check the content of test_settings.robot to understand the relative path
cat e2e/tests/regression/test_settings.robot
# List all resource files to understand the structure
fd -e resource
Length of output: 5504
Script:
#!/bin/bash
# Check the content of the variables resource file
cat e2e/keywords/variables.resource
# Look for any variables defined in test_settings.robot
echo -e "\nChecking for variables in test_settings.robot:"
rg "^\$\{.*\}" e2e/tests/regression/test_settings.robot
# Check for any duplicate variable definitions in other files
echo -e "\nChecking for duplicate variable definitions in other files:"
rg "^\$\{.*\}" e2e/tests/
Length of output: 654
e2e/tests/negative/replica_rebuilding.robot (1)
6-6
: Verify variable definitions in the new resource file.
The centralization of variables is a good practice. However, we should verify that all required variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, ${DATA_ENGINE}
) are properly defined in the new resource file.
✅ Verification successful
All required variables are properly defined in the resource file
The variables.resource file contains all the necessary variable definitions that are used in replica_rebuilding.robot:
-
${LOOP_COUNT}
= 1 -
${RETRY_COUNT}
= 300 -
${RETRY_INTERVAL}
= 1 -
${DATA_ENGINE}
= v1
The centralization of these variables is correctly implemented and all test cases in replica_rebuilding.robot can access these variables through the resource import.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that all required variables are defined in the new resource file
# Check if the variables are defined in the resource file
rg -l "(?:\$\{LOOP_COUNT\}|\$\{RETRY_COUNT\}|\$\{RETRY_INTERVAL\}|\$\{DATA_ENGINE\})" ../keywords/variables.resource
Length of output: 185
Script:
#!/bin/bash
# Let's first find the correct path to the variables.resource file
fd variables.resource
# Then check the content of the file for the required variables
rg -l "(?:\$\{LOOP_COUNT\}|\$\{RETRY_COUNT\}|\$\{RETRY_INTERVAL\}|\$\{DATA_ENGINE\})"
Length of output: 139
Script:
#!/bin/bash
# Now that we found the file, let's check its content for the required variables
rg "(?:\$\{LOOP_COUNT\}|\$\{RETRY_COUNT\}|\$\{RETRY_INTERVAL\}|\$\{DATA_ENGINE\})" e2e/keywords/variables.resource -A 1
# Also check how these variables are defined
rg "^(?:\$\{LOOP_COUNT\}|\$\{RETRY_COUNT\}|\$\{RETRY_INTERVAL\}|\$\{DATA_ENGINE\}).*=.*" e2e/keywords/variables.resource
Length of output: 236
Script:
#!/bin/bash
# Let's try a different approach to check variable definitions in Robot Framework syntax
cat e2e/keywords/variables.resource
# Also check where these variables are currently used in the test file
cat e2e/tests/negative/replica_rebuilding.robot
Length of output: 5173
e2e/tests/regression/test_basic.robot (2)
6-6
: LGTM! Good refactoring of variable management.
The addition of variables.resource
and removal of inline variable declarations is a positive change that:
- Centralizes variable management
- Reduces duplication across test files
- Improves maintainability
Line range hint 1-1
: Verify location of the new test case.
The PR objectives mention adding a test case for "Shutdown Volume Node And Test Auto Reattach To A New Node", but this test case is not present in this file. Please verify:
- If the test case was added to a different file
- If the test case still needs to be added
This verification is important as the test case is crucial for validating the fix for issue #9858 where the share manager becomes stuck when a node is shut down.
✅ Verification successful
Let me check the content of the identified file to verify if it contains the test case.
The test case has been found and is properly implemented.
The test case "Shutdown Volume Node And Test Auto Reattach To A New Node" is present in e2e/tests/negative/node_reboot.robot
. The test case verifies:
- Volume node shutdown scenario for both RWO and RWX volumes
- Auto reattachment of volumes to new nodes
- Data integrity after node shutdown and reattachment
- Share manager pod recovery for RWX volumes
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for the new test case across all robot files
echo "Searching for the new test case..."
rg -l "Shutdown.*Volume.*Node.*Auto.*Reattach" --type robot
# Search for any recent changes to robot test files
echo "Checking recent changes to robot test files..."
git diff --name-only HEAD~1 | grep "\.robot$"
Length of output: 1366
Script:
#!/bin/bash
# Check the content of node_reboot.robot for the test case
echo "Content of e2e/tests/negative/node_reboot.robot:"
cat e2e/tests/negative/node_reboot.robot
Length of output: 22057
e2e/tests/negative/network_disconnect.robot (2)
6-6
: LGTM! Good practice to centralize variable management.
The addition of variables.resource
improves maintainability by centralizing variable definitions that were previously declared locally.
Line range hint 1-1
: Add more specific assertions for volume state and health.
While the test cases cover the scenarios well, consider adding more specific assertions to verify:
- Volume attachment state transitions
- Replica health status
- Data consistency after network recovery
- Resource cleanup after test completion
This would help catch subtle issues and ensure proper system recovery.
e2e/tests/negative/test_backup_listing.robot (1)
5-5
: LGTM! Good improvements to test organization.
The addition of the 'negative' tag and centralization of variables through variables.resource
improves test organization and maintainability.
Also applies to: 7-7
e2e/tests/negative/component_resilience.robot (2)
6-6
: Verify the new resource file exists and contains the required variables.
The change to centralize variable management is good practice. However, we need to ensure all required variables (LOOP_COUNT, RETRY_COUNT, RETRY_INTERVAL, RWX_VOLUME_FAST_FAILOVER, DATA_ENGINE) are properly defined in the new resource file.
✅ Verification successful
Resource file verification successful
The variables.resource file exists at e2e/keywords/variables.resource
and contains all the required variables with appropriate default values:
- ${LOOP_COUNT} = 1
- ${RETRY_COUNT} = 300
- ${RETRY_INTERVAL} = 1
- ${RWX_VOLUME_FAST_FAILOVER} = false
- ${DATA_ENGINE} = v1
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check if the variables resource file exists and contains the required variables
rg -A 1 "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)\}" "e2e/tests/keywords/variables.resource"
Length of output: 210
Script:
#!/bin/bash
# Let's first check if the file exists in a different location
fd -g "*variables.resource"
# Also check for any Robot Framework resource files that might contain these variables
rg -A 1 "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)\}" -g "*.resource"
Length of output: 172
Script:
#!/bin/bash
# Let's check the content of the variables.resource file
cat e2e/keywords/variables.resource
# Also check where these variables were defined previously in the component_resilience.robot file
rg -A 1 "^\$\{(LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)\}" "e2e/tests/negative/component_resilience.robot"
Length of output: 467
Line range hint 142-166
: Clarify data engine restrictions on test cases.
Several test cases are only executed when DATA_ENGINE='v1'
, including backing image and RWX volume tests. Please clarify:
- Why are these tests restricted to v1 engine?
- Should we have equivalent tests for other data engines?
- Are there specific limitations that prevent these tests from running with other engines?
Also applies to: 208-241
✅ Verification successful
Based on the gathered information, I can now provide a final response:
Tests are correctly restricted to v1 data engine for backing image and RWX volume features
The restriction of certain tests to DATA_ENGINE='v1'
is intentional and correct because:
- The backing image feature is specifically designed for v1 data engine, as evidenced by dedicated v2 engine tests in
test_v2.robot
that don't include backing image operations - The tests are properly structured with conditional blocks to handle engine-specific features
- The default data engine is set to 'v1' in
variables.resource
, with separate test files for v2-specific functionality
The test implementation follows good practices by:
- Using clear conditionals to separate engine-specific features
- Having dedicated test files for v2 engine features
- Maintaining proper test coverage for both engines in their supported feature sets
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check for data engine related configurations and documentation
rg -A 5 "DATA_ENGINE" "e2e/tests/keywords/variables.resource"
rg -A 5 "dataEngine" "e2e/tests/keywords/"
Length of output: 238
Script:
#!/bin/bash
# Let's first find the correct paths for test files and keywords
fd "variables" "e2e"
# Then search for DATA_ENGINE usage across e2e tests
rg -A 5 "DATA_ENGINE" "e2e"
# Search for dataEngine configurations
rg -A 5 "dataEngine" "e2e"
# Look for documentation about data engines
fd "README|\.md" "e2e"
Length of output: 84892
Script:
#!/bin/bash
# Let's check the README for data engine documentation
cat e2e/README.md
# Let's also check for v2 data engine specific tests
rg "dataEngine=v2" "e2e/tests"
# And check for any comments or documentation about data engine restrictions
rg -B 2 -A 2 "DATA_ENGINE.*v1" "e2e"
Length of output: 8071
e2e/tests/negative/node_drain.robot (1)
6-6
: Verify variable definitions in the new resource file
The centralization of variables is a good practice. However, we should verify that all required variables (${LOOP_COUNT}
, ${RETRY_COUNT}
, ${RETRY_INTERVAL}
, ${RWX_VOLUME_FAST_FAILOVER}
, ${DATA_ENGINE}
) are properly defined in variables.resource
.
✅ Verification successful
All required variables are properly defined in variables.resource
The variables.resource file contains all the required variables mentioned in the test file:
-
${LOOP_COUNT}
= 1 -
${RETRY_COUNT}
= 300 -
${RETRY_INTERVAL}
= 1 -
${RWX_VOLUME_FAST_FAILOVER}
= false -
${DATA_ENGINE}
= v1
The centralization of these variables in variables.resource is correct and follows good practices. The test file properly imports this resource file and uses these variables consistently throughout the test cases.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if all required variables are defined in the new resource file
# Check for required variables in the resource file
rg -l "(?:LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)" "e2e/tests/keywords/variables.resource"
Length of output: 204
Script:
#!/bin/bash
# Let's first find the correct path of the variables.resource file
fd "variables.resource"
# Then check the content of the file for the required variables
rg "(?:LOOP_COUNT|RETRY_COUNT|RETRY_INTERVAL|RWX_VOLUME_FAST_FAILOVER|DATA_ENGINE)" -A 1
Length of output: 105799
Script:
#!/bin/bash
# Let's check the content of the variables.resource file
cat e2e/keywords/variables.resource
# Also check if there are any other variable definitions in the test file
cat e2e/tests/negative/node_drain.robot
Length of output: 13088
e2e/tests/negative/node_reboot.robot (3)
6-7
: LGTM: Resource imports align with test requirements
The addition of variables.resource
and sharemanager.resource
appropriately supports the new test case requirements for RWX volume and share manager testing.
24-54
: Well-structured test implementation with comprehensive verification
The test case effectively validates the auto reattachment behavior with a thorough sequence of steps:
- Creates both RWO and RWX volumes
- Verifies initial deployment health
- Writes test data
- Simulates node shutdown
- Validates sharemanager recovery
- Verifies data integrity and functionality
This implementation directly addresses the issue #9858 regarding share manager recovery after node shutdown.
319-319
: LGTM: Consistent power on handling across test cases
The modifications maintain a consistent approach to power on handling across different test scenarios, ensuring proper system recovery and test cleanup.
Also applies to: 337-337, 358-358, 375-375, 396-396, 412-412
FOR ${powered_off_node} IN @{powered_off_nodes} | ||
Run keyword And Ignore Error power_on_node_by_name ${powered_off_node} | ||
Remove Values From List ${powered_off_nodes} ${powered_off_node} | ||
END |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
🛠️ Refactor suggestion
List modification is safe, but error handling can be improved
After analyzing the codebase, I found that:
- The
@{powered_off_nodes}
list is properly initialized as an empty list invariables.resource
- Nodes are added to the list via
Power off node
keyword inhost.resource
- The same list modification pattern is used in both
common.resource
andhost.resource
, indicating it's an established pattern - The list modification during iteration is safe because:
- Each node is processed only once
- The list is only used for cleanup purposes
- The same pattern is consistently used across files
However, the error handling suggestion from the original review is still valid:
FOR ${powered_off_node} IN @{powered_off_nodes}
- Run keyword And Ignore Error power_on_node_by_name ${powered_off_node}
- Remove Values From List ${powered_off_nodes} ${powered_off_node}
+ ${status} ${error}= Run Keyword And Ignore Error power_on_node_by_name ${powered_off_node}
+ Remove Values From List ${powered_off_nodes} ${powered_off_node}
+ IF '${status}' == 'FAIL'
+ Log Failed to power on node ${powered_off_node}: ${error} WARN
+ END
END
🔗 Analysis chain
Improve error handling and list modification safety
While the loop effectively handles multiple powered-off nodes, there are a few concerns to address:
- The node is removed from the list regardless of whether the power-on operation succeeds.
- Modifying the list while iterating over it could lead to unexpected behavior.
- The
@{powered_off_nodes}
list's initialization is not visible in this file.
Consider this safer implementation:
- FOR ${powered_off_node} IN @{powered_off_nodes}
- Run keyword And Ignore Error power_on_node_by_name ${powered_off_node}
- Remove Values From List ${powered_off_nodes} ${powered_off_node}
- END
+ ${nodes_to_process}= Copy List ${powered_off_nodes}
+ FOR ${powered_off_node} IN @{nodes_to_process}
+ ${status} ${error}= Run Keyword And Ignore Error power_on_node_by_name ${powered_off_node}
+ IF '${status}' == 'PASS'
+ Remove Values From List ${powered_off_nodes} ${powered_off_node}
+ ELSE
+ Log Failed to power on node ${powered_off_node}: ${error} WARN
+ END
+ END
Let's verify the initialization of the powered_off_nodes variable:
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find where powered_off_nodes is initialized
# Expected: Find the variable declaration in variables.resource or similar file
rg -l "powered_off_nodes" --type robot
Length of output: 38
Script:
#!/bin/bash
# Let's try a broader search without type filtering to find any references
rg -l "powered_off_nodes"
Length of output: 144
Script:
#!/bin/bash
# Let's also check the specific file content to understand the context better
cat "e2e/keywords/common.resource"
Length of output: 2476
Script:
#!/bin/bash
# Search for any power-related keywords that might help understand the context
rg "power_on_node_by_name" -A 2 -B 2
Length of output: 1149
Script:
#!/bin/bash
# Let's check the variables.resource file to find the initialization
cat "e2e/keywords/variables.resource"
Length of output: 326
Script:
#!/bin/bash
# Let's also check host.resource since it has similar power-related code
cat "e2e/keywords/host.resource"
Length of output: 1735
def wait_for_sharemanager_pod_restart(self, name): | ||
sharemanager_pod_name = "share-manager-" + name | ||
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system") | ||
last_creation_time = sharemanager_pod.metadata.creation_timestamp | ||
|
||
retry_count, retry_interval = get_retry_count_and_interval() | ||
for i in range(retry_count): | ||
logging(f"Waiting for sharemanager for volume {name} restart ... ({i})") | ||
time.sleep(retry_interval) | ||
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system") | ||
if sharemanager_pod == None: | ||
continue | ||
creation_time = sharemanager_pod.metadata.creation_timestamp | ||
logging(f"Getting new sharemanager which is created at {creation_time}, and old one is created at {last_creation_time}") | ||
if creation_time > last_creation_time: | ||
return | ||
|
||
assert False, f"sharemanager pod {sharemanager_pod_name} isn't restarted" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Good addition, but some improvements needed
The new method effectively monitors share manager pod restarts with good logging. However, there are a few improvements to consider:
- Follow Python best practices:
- if sharemanager_pod == None:
+ if sharemanager_pod is None:
- Use proper assertion:
- assert False, f"sharemanager pod {sharemanager_pod_name} isn't restarted"
+ raise AssertionError(f"sharemanager pod {sharemanager_pod_name} isn't restarted")
- Consider extracting common code with
delete_sharemanager_pod_and_wait_for_recreation
as both methods share similar logic for monitoring pod recreation.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
def wait_for_sharemanager_pod_restart(self, name): | |
sharemanager_pod_name = "share-manager-" + name | |
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system") | |
last_creation_time = sharemanager_pod.metadata.creation_timestamp | |
retry_count, retry_interval = get_retry_count_and_interval() | |
for i in range(retry_count): | |
logging(f"Waiting for sharemanager for volume {name} restart ... ({i})") | |
time.sleep(retry_interval) | |
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system") | |
if sharemanager_pod == None: | |
continue | |
creation_time = sharemanager_pod.metadata.creation_timestamp | |
logging(f"Getting new sharemanager which is created at {creation_time}, and old one is created at {last_creation_time}") | |
if creation_time > last_creation_time: | |
return | |
assert False, f"sharemanager pod {sharemanager_pod_name} isn't restarted" | |
def wait_for_sharemanager_pod_restart(self, name): | |
sharemanager_pod_name = "share-manager-" + name | |
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system") | |
last_creation_time = sharemanager_pod.metadata.creation_timestamp | |
retry_count, retry_interval = get_retry_count_and_interval() | |
for i in range(retry_count): | |
logging(f"Waiting for sharemanager for volume {name} restart ... ({i})") | |
time.sleep(retry_interval) | |
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system") | |
if sharemanager_pod is None: | |
continue | |
creation_time = sharemanager_pod.metadata.creation_timestamp | |
logging(f"Getting new sharemanager which is created at {creation_time}, and old one is created at {last_creation_time}") | |
if creation_time > last_creation_time: | |
return | |
raise AssertionError(f"sharemanager pod {sharemanager_pod_name} isn't restarted") |
🧰 Tools
🪛 Ruff (0.8.0)
80-80: Comparison to None
should be cond is None
Replace with cond is None
(E711)
87-87: Do not assert False
(python -O
removes these calls), raise AssertionError()
Replace assert False
(B011)
Which issue(s) this PR fixes:
Issue longhorn/longhorn#9858
What this PR does / why we need it:
add test case Shutdown Volume Node And Test Auto Reattach To A New Node
Special notes for your reviewer:
Additional documentation or context
Summary by CodeRabbit
Release Notes
New Features
Improvements
Bug Fixes
Documentation