Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(robot): add test case Shutdown Volume Node And Test Auto Reattach To A New Node #2169

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 5 additions & 1 deletion e2e/keywords/common.resource
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
*** Settings ***
Documentation Common keywords

Library Collections
Library OperatingSystem
Library ../libs/keywords/common_keywords.py
Library ../libs/keywords/deployment_keywords.py
Expand Down Expand Up @@ -38,7 +39,10 @@ Set test environment
END

Cleanup test resources
Run keyword And Ignore Error power_on_node_by_name ${powered_off_node}
FOR ${powered_off_node} IN @{powered_off_nodes}
Run keyword And Ignore Error power_on_node_by_name ${powered_off_node}
Remove Values From List ${powered_off_nodes} ${powered_off_node}
END
Comment on lines +42 to +45
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

🛠️ Refactor suggestion

List modification is safe, but error handling can be improved

After analyzing the codebase, I found that:

  1. The @{powered_off_nodes} list is properly initialized as an empty list in variables.resource
  2. Nodes are added to the list via Power off node keyword in host.resource
  3. The same list modification pattern is used in both common.resource and host.resource, indicating it's an established pattern
  4. The list modification during iteration is safe because:
    • Each node is processed only once
    • The list is only used for cleanup purposes
    • The same pattern is consistently used across files

However, the error handling suggestion from the original review is still valid:

    FOR    ${powered_off_node}    IN    @{powered_off_nodes}
-        Run keyword And Ignore Error    power_on_node_by_name    ${powered_off_node}
-        Remove Values From List    ${powered_off_nodes}    ${powered_off_node}
+        ${status}    ${error}=    Run Keyword And Ignore Error    power_on_node_by_name    ${powered_off_node}
+        Remove Values From List    ${powered_off_nodes}    ${powered_off_node}
+        IF    '${status}' == 'FAIL'
+            Log    Failed to power on node ${powered_off_node}: ${error}    WARN
+        END
    END
🔗 Analysis chain

Improve error handling and list modification safety

While the loop effectively handles multiple powered-off nodes, there are a few concerns to address:

  1. The node is removed from the list regardless of whether the power-on operation succeeds.
  2. Modifying the list while iterating over it could lead to unexpected behavior.
  3. The @{powered_off_nodes} list's initialization is not visible in this file.

Consider this safer implementation:

-    FOR    ${powered_off_node}    IN    @{powered_off_nodes}
-        Run keyword And Ignore Error    power_on_node_by_name    ${powered_off_node}
-        Remove Values From List    ${powered_off_nodes}    ${powered_off_node}
-    END
+    ${nodes_to_process}=    Copy List    ${powered_off_nodes}
+    FOR    ${powered_off_node}    IN    @{nodes_to_process}
+        ${status}    ${error}=    Run Keyword And Ignore Error    power_on_node_by_name    ${powered_off_node}
+        IF    '${status}' == 'PASS'
+            Remove Values From List    ${powered_off_nodes}    ${powered_off_node}
+        ELSE
+            Log    Failed to power on node ${powered_off_node}: ${error}    WARN
+        END
+    END

Let's verify the initialization of the powered_off_nodes variable:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find where powered_off_nodes is initialized
# Expected: Find the variable declaration in variables.resource or similar file

rg -l "powered_off_nodes" --type robot

Length of output: 38


Script:

#!/bin/bash
# Let's try a broader search without type filtering to find any references
rg -l "powered_off_nodes"

Length of output: 144


Script:

#!/bin/bash
# Let's also check the specific file content to understand the context better
cat "e2e/keywords/common.resource"

Length of output: 2476


Script:

#!/bin/bash
# Search for any power-related keywords that might help understand the context
rg "power_on_node_by_name" -A 2 -B 2

Length of output: 1149


Script:

#!/bin/bash
# Let's check the variables.resource file to find the initialization
cat "e2e/keywords/variables.resource"

Length of output: 326


Script:

#!/bin/bash
# Let's also check host.resource since it has similar power-related code
cat "e2e/keywords/host.resource"

Length of output: 1735

uncordon_all_nodes
cleanup_control_plane_network_latency
reset_node_schedule
Expand Down
11 changes: 7 additions & 4 deletions e2e/keywords/host.resource
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
*** Settings ***
Documentation Physical Node Keywords

Library Collections
Library ../libs/keywords/common_keywords.py
Library ../libs/keywords/host_keywords.py
Library ../libs/keywords/network_keywords.py
Expand Down Expand Up @@ -34,11 +35,13 @@ Restart cluster
reboot_all_nodes
setup_control_plane_network_latency

Power on off node
Run keyword And Ignore Error
... power_on_node_by_name ${powered_off_node}
Power on off nodes
FOR ${powered_off_node} IN @{powered_off_nodes}
Run keyword And Ignore Error power_on_node_by_name ${powered_off_node}
Remove Values From List ${powered_off_nodes} ${powered_off_node}
END

Power off node ${node_id}
${powered_off_node} = get_node_by_index ${node_id}
Append to list ${powered_off_nodes} ${powered_off_node}
power_off_node_by_name ${powered_off_node}
Set Test Variable ${powered_off_node}
5 changes: 5 additions & 0 deletions e2e/keywords/sharemanager.resource
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,11 @@ Delete sharemanager pod of deployment ${deployment_id} and wait for recreation
${volume_name} = get_workload_volume_name ${deployment_name}
delete_sharemanager_pod_and_wait_for_recreation ${volume_name}

Wait for sharemanager pod of deployment ${deployment_id} restart
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
wait_for_sharemanager_pod_restart ${volume_name}

Wait for sharemanager pod of deployment ${deployment_id} running
${deployment_name} = generate_name_with_suffix deployment ${deployment_id}
${volume_name} = get_workload_volume_name ${deployment_name}
Expand Down
13 changes: 13 additions & 0 deletions e2e/keywords/variables.resource
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
*** Settings ***
Documentation Global Variables

*** Variables ***
${LOOP_COUNT} 1
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${VOLUME_TYPE} RWO
${CONTROL_PLANE_NODE_NETWORK_LATENCY_IN_MS} 0
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1

@{powered_off_nodes}=
11 changes: 10 additions & 1 deletion e2e/keywords/workload.resource
Original file line number Diff line number Diff line change
Expand Up @@ -45,9 +45,18 @@ Power off volume node of ${workload_kind} ${workload_id}
${workload_name} = generate_name_with_suffix ${workload_kind} ${workload_id}
${volume_name} = get_workload_volume_name ${workload_name}
${powered_off_node} = get_volume_node ${volume_name}
Append to list ${powered_off_nodes} ${powered_off_node}
${last_volume_node} = get_volume_node ${volume_name}
power_off_volume_node ${volume_name}
Set Test Variable ${powered_off_node}
Set Test Variable ${last_volume_node}

Power off volume node of ${workload_kind} ${workload_id} without waiting
${workload_name} = generate_name_with_suffix ${workload_kind} ${workload_id}
${volume_name} = get_workload_volume_name ${workload_name}
${powered_off_node} = get_volume_node ${volume_name}
Append to list ${powered_off_nodes} ${powered_off_node}
${last_volume_node} = get_volume_node ${volume_name}
power_off_volume_node ${volume_name} waiting=False
Set Test Variable ${last_volume_node}

Reboot volume node of ${workload_kind} ${workload_id}
Expand Down
9 changes: 5 additions & 4 deletions e2e/libs/host/aws.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,14 +68,15 @@ def reboot_all_worker_nodes(self, shut_down_time_in_sec=NODE_REBOOT_DOWN_TIME_SE
waiter.wait(InstanceIds=instance_ids)
logging(f"Started instances")

def power_off_node(self, power_off_node_name):
def power_off_node(self, power_off_node_name, waiting=True):
instance_ids = [self.mapping[power_off_node_name]]
resp = self.aws_client.stop_instances(InstanceIds=instance_ids, Force=True)
assert resp['ResponseMetadata']['HTTPStatusCode'] == 200, f"Failed to stop instances {instance_ids} response: {resp}"
logging(f"Stopping instances {instance_ids}")
waiter = self.aws_client.get_waiter('instance_stopped')
waiter.wait(InstanceIds=instance_ids)
logging(f"Stopped instances")
if waiting:
waiter = self.aws_client.get_waiter('instance_stopped')
waiter.wait(InstanceIds=instance_ids)
logging(f"Stopped instances")

def power_on_node(self, power_on_node_name):
instance_ids = [self.mapping[power_on_node_name]]
Expand Down
2 changes: 1 addition & 1 deletion e2e/libs/host/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@ def reboot_all_worker_nodes(self, shut_down_time_in_sec):
return NotImplemented

@abstractmethod
def power_off_node(self, node_name):
def power_off_node(self, node_name, waiting):
return NotImplemented

@abstractmethod
Expand Down
5 changes: 4 additions & 1 deletion e2e/libs/host/harvester.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def reboot_all_worker_nodes(self, shut_down_time_in_sec):
for node_name in node_names:
self.power_on_node(node_name)

def power_off_node(self, node_name):
def power_off_node(self, node_name, waiting=True):
vm_id = self.mapping[node_name]

url = f"{self.url}/{vm_id}"
Expand All @@ -68,6 +68,9 @@ def power_off_node(self, node_name):
logging(f"Stopping vm failed with error {e}")
logging(f"Stopping vm {vm_id}")

if not waiting:
return

stopped = False
for i in range(self.retry_count):
logging(f"Waiting for vm {vm_id} stopped ... ({i})")
Expand Down
6 changes: 3 additions & 3 deletions e2e/libs/keywords/host_keywords.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,10 +46,10 @@ def reboot_node_by_name(self, node_name, downtime_in_min=1):
logging(f'Rebooting node {node_name} with downtime {reboot_down_time_sec} seconds')
self.host.reboot_node(node_name, reboot_down_time_sec)

def power_off_volume_node(self, volume_name):
def power_off_volume_node(self, volume_name, waiting=True):
node_id = self.volume_keywords.get_node_id_by_replica_locality(volume_name, "volume node")
logging(f'Power off volume {volume_name} node {node_id}')
self.host.power_off_node(node_id)
logging(f'Power off volume {volume_name} node {node_id} with waiting = {waiting}')
self.host.power_off_node(node_id, waiting)

def power_on_node_by_name(self, node_name):
self.host.power_on_node(node_name)
Expand Down
20 changes: 20 additions & 0 deletions e2e/libs/keywords/sharemanager_keywords.py
Original file line number Diff line number Diff line change
Expand Up @@ -67,12 +67,32 @@ def delete_sharemanager_pod_and_wait_for_recreation(self, name):

assert False, f"sharemanager pod {sharemanager_pod_name} not recreated"

def wait_for_sharemanager_pod_restart(self, name):
sharemanager_pod_name = "share-manager-" + name
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
last_creation_time = sharemanager_pod.metadata.creation_timestamp

retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
logging(f"Waiting for sharemanager for volume {name} restart ... ({i})")
time.sleep(retry_interval)
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
if sharemanager_pod == None:
continue
creation_time = sharemanager_pod.metadata.creation_timestamp
logging(f"Getting new sharemanager which is created at {creation_time}, and old one is created at {last_creation_time}")
if creation_time > last_creation_time:
return

assert False, f"sharemanager pod {sharemanager_pod_name} isn't restarted"
Comment on lines +70 to +87
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Good addition, but some improvements needed

The new method effectively monitors share manager pod restarts with good logging. However, there are a few improvements to consider:

  1. Follow Python best practices:
-            if sharemanager_pod == None:
+            if sharemanager_pod is None:
  1. Use proper assertion:
-        assert False, f"sharemanager pod {sharemanager_pod_name} isn't restarted"
+        raise AssertionError(f"sharemanager pod {sharemanager_pod_name} isn't restarted")
  1. Consider extracting common code with delete_sharemanager_pod_and_wait_for_recreation as both methods share similar logic for monitoring pod recreation.
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def wait_for_sharemanager_pod_restart(self, name):
sharemanager_pod_name = "share-manager-" + name
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
last_creation_time = sharemanager_pod.metadata.creation_timestamp
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
logging(f"Waiting for sharemanager for volume {name} restart ... ({i})")
time.sleep(retry_interval)
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
if sharemanager_pod == None:
continue
creation_time = sharemanager_pod.metadata.creation_timestamp
logging(f"Getting new sharemanager which is created at {creation_time}, and old one is created at {last_creation_time}")
if creation_time > last_creation_time:
return
assert False, f"sharemanager pod {sharemanager_pod_name} isn't restarted"
def wait_for_sharemanager_pod_restart(self, name):
sharemanager_pod_name = "share-manager-" + name
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
last_creation_time = sharemanager_pod.metadata.creation_timestamp
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
logging(f"Waiting for sharemanager for volume {name} restart ... ({i})")
time.sleep(retry_interval)
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
if sharemanager_pod is None:
continue
creation_time = sharemanager_pod.metadata.creation_timestamp
logging(f"Getting new sharemanager which is created at {creation_time}, and old one is created at {last_creation_time}")
if creation_time > last_creation_time:
return
raise AssertionError(f"sharemanager pod {sharemanager_pod_name} isn't restarted")
🧰 Tools
🪛 Ruff (0.8.0)

80-80: Comparison to None should be cond is None

Replace with cond is None

(E711)


87-87: Do not assert False (python -O removes these calls), raise AssertionError()

Replace assert False

(B011)



def wait_for_share_manager_pod_running(self, name):
sharemanager_pod_name = "share-manager-" + name
retry_count, retry_interval = get_retry_count_and_interval()
for i in range(retry_count):
sharemanager_pod = get_pod(sharemanager_pod_name, "longhorn-system")
logging(f"Waiting for sharemanager for volume {name} running, currently {sharemanager_pod.status.phase} ... ({i})")
if sharemanager_pod.status.phase == "Running":
return

Expand Down
10 changes: 1 addition & 9 deletions e2e/tests/negative/cluster_restart.robot
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Documentation Negative Test Cases

Test Tags negative cluster

Resource ../keywords/variables.resource
Resource ../keywords/common.resource
Resource ../keywords/deployment.resource
Resource ../keywords/longhorn.resource
Expand All @@ -16,15 +17,6 @@ Resource ../keywords/setting.resource
Test Setup Set test environment
Test Teardown Cleanup test resources

*** Variables ***
${LOOP_COUNT} 1
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${CONTROL_PLANE_NODE_NETWORK_LATENCY_IN_MS} 0
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1


*** Test Cases ***
Restart Cluster While Workload Heavy Writing
Given Set setting rwx-volume-fast-failover to ${RWX_VOLUME_FAST_FAILOVER}
Expand Down
8 changes: 1 addition & 7 deletions e2e/tests/negative/component_resilience.robot
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Documentation Negative Test Cases

Test Tags negative

Resource ../keywords/variables.resource
Resource ../keywords/common.resource
Resource ../keywords/volume.resource
Resource ../keywords/backing_image.resource
Expand All @@ -18,13 +19,6 @@ Resource ../keywords/sharemanager.resource
Test Setup Set test environment
Test Teardown Cleanup test resources

*** Variables ***
${LOOP_COUNT} 1
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1

*** Keywords ***
Delete instance-manager of volume ${volume_id} and wait for recover
When Delete instance-manager of volume ${volume_id}
Expand Down
8 changes: 1 addition & 7 deletions e2e/tests/negative/kubelet_restart.robot
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Documentation Negative Test Cases

Test Tags negative

Resource ../keywords/variables.resource
Resource ../keywords/common.resource
Resource ../keywords/storageclass.resource
Resource ../keywords/persistentvolumeclaim.resource
Expand All @@ -14,13 +15,6 @@ Resource ../keywords/setting.resource
Test Setup Set test environment
Test Teardown Cleanup test resources

*** Variables ***
${LOOP_COUNT} 1
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1

*** Test Cases ***
Restart Volume Node Kubelet While Workload Heavy Writing
Given Set setting rwx-volume-fast-failover to ${RWX_VOLUME_FAST_FAILOVER}
Expand Down
9 changes: 1 addition & 8 deletions e2e/tests/negative/network_disconnect.robot
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Documentation Negative Test Cases

Test Tags negative

Resource ../keywords/variables.resource
Resource ../keywords/volume.resource
Resource ../keywords/storageclass.resource
Resource ../keywords/statefulset.resource
Expand All @@ -14,14 +15,6 @@ Resource ../keywords/setting.resource
Test Setup Set test environment
Test Teardown Cleanup test resources

*** Variables ***
${LOOP_COUNT} 1
${LATENCY_IN_MS} 0
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1

*** Test Cases ***
Disconnect Volume Node Network While Workload Heavy Writing
Given Set setting rwx-volume-fast-failover to ${RWX_VOLUME_FAST_FAILOVER}
Expand Down
8 changes: 1 addition & 7 deletions e2e/tests/negative/node_delete.robot
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Documentation Negative Test Cases

Test Tags negative

Resource ../keywords/variables.resource
Resource ../keywords/common.resource
Resource ../keywords/host.resource
Resource ../keywords/storageclass.resource
Expand All @@ -15,13 +16,6 @@ Resource ../keywords/setting.resource
Test Setup Set test environment
Test Teardown Cleanup test resources

*** Variables ***
${LOOP_COUNT} 1
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1

*** Test Cases ***
Delete Volume Node While Replica Rebuilding
Given Set setting node-down-pod-deletion-policy to do-nothing
Expand Down
8 changes: 1 addition & 7 deletions e2e/tests/negative/node_drain.robot
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@ Documentation Negative Test Cases

Test Tags negative

Resource ../keywords/variables.resource
Resource ../keywords/common.resource
Resource ../keywords/storageclass.resource
Resource ../keywords/persistentvolumeclaim.resource
Expand All @@ -18,13 +19,6 @@ Resource ../keywords/node.resource
Test Setup Set test environment
Test Teardown Cleanup test resources

*** Variables ***
${LOOP_COUNT} 1
${RETRY_COUNT} 300
${RETRY_INTERVAL} 1
${RWX_VOLUME_FAST_FAILOVER} false
${DATA_ENGINE} v1

*** Test Cases ***
Force Drain Volume Node While Replica Rebuilding
Given Set setting rwx-volume-fast-failover to ${RWX_VOLUME_FAST_FAILOVER}
Expand Down
Loading