Modify the docs to reflect suggested troubleshoot path

microsoft · Nov 26, 2024 · 8aaee14 · 8aaee14
1 parent d0f7b78
commit 8aaee14
Showing 1 changed file with 135 additions and 59 deletions.
diff --git a/docs/run_test/troubleshoot_failures.rst b/docs/run_test/troubleshoot_failures.rst
@@ -2,22 +2,58 @@ Troubleshoot Test Failures
 =======================
 
 -  `Overview <#overview>`__
--  `Test results <#test-results>`__
--  `Console output <#console-output>`__
--  `Log Folder Structure <#log-folder-structure>`__
+-  `Analyze Test Results <#analyze-test-results>`__
+-  `Log Files <#log-files>`__
+-  `Search LISA Code for Issues <#search-lisa-code-for-issues>`__
+-  `Reproduce Failures Manually <#reproduce-failures-manually>`__
 
 Overview
---------
+-----------
+
+To understand a test failure, follow these recommended troubleshooting
+steps:
+
+   1. **Analyze Test Results**: Look for the error messages in the
+      console output. These messages are derived from assertion or
+      exception messages and are the easiest and fastest way to
+      understand a test failure.
+   2. **Check the Log Files**: Search the root log file, which contains
+      traces and command outputs, as well as the split log files, which are
+      smaller in size.
+   3. **Search the LISA Code for Issues**: Investigate the LISA codebase to
+      identify potential issues.
+   4. **Reproduce the Failure Manually**: Deploy the necessary resources
+      and run the commands to try to reproduce the failure manually.
+
+These steps are ranked in order of ease and speed of resolution. The
+first two steps are the easiest and fastest to follow and should be
+sufficient to resolve most issues. The last two steps are more advanced
+and require more effort but can be useful for complex issues. It is
+recommended to start with the first steps as they are lower in cost
+compared to the later steps.
+
+Analyze Test results
+------------
 
-To understand a test failure, the recommended troubleshooting path is: 
+- **Console Output**
 
-   1. Check the test result error messages in console output.
-   2. Check the log file.  Search the root log file which contains traces and commands output, as well as the split log files which are smaller in size.
-   3. Search the LISA code for issues.
-   4. Try to reproduce failure manually, deploy and run resources.
+The results of a test run are displayed in the console at conclusion of a
+test run and saved in log files generated by LISA.  The console will
+display a summary, containing the test suite and case name, test status
+and a message if applicable.  There will be a summary generated that
+tallies results of all tests results  Failures are categorized by similar messages.  
 
-Test results
-------------
+.. figure:: ../img/test_results_summary.png
+   :alt: test_results_summary
+
+In the above example, there are 5 total tests run, with test results of
+2 PASSED and 3 SKIPPED. The SKIPPED tests failed to meet requirements
+for the test environment, due to an insufficient number of nodes and an
+OS type mismatch, as stated in the message column.  See "Final
+Results" below for more information on the meaning of PASSED and SKIPPED
+results.
+
+- **Test Result Categories**
 
 It's essential to understand the results after running tests. LISA has 7
 kinds of test results in total: 3 of which are intermediate results, and
@@ -28,6 +64,36 @@ two or more results at the same time.
 .. figure:: ../img/test_results.png
    :alt: test_results
 
+- **Final results**
+
+A final result shows information of a terminated test. It provides more
+valuable information than the intermediate result. It only appears in
+the end of a successful test run.
+
+  -  **FAILED**
+
+     FAILED tests are tests that did not finish successfully and
+     terminated because of failures like ``LISA exceptions`` or
+     ``Assertion failure``. You can use them to trace where the problem
+     was and why the problem happened.
+
+  -  **PASSED**
+
+     PASSED tests are tests that passed, or at least partially passed,
+     with a special ``PASSException`` that warns there are minor errors in
+     the run but they do not affect the test result.
+
+  -  **SKIPPED**
+
+     SKIPPED tests are tests that did not start and would no longer run.
+     They suggest failure to meet some requirements in the environments
+     involved with the test.
+
+  -  **ATTEMPTED**
+
+     ATTEMPTED tests are a special category of FAILED tests because of
+     known issues, which are not likely to be fixed soon.
+
 - **Intermediate results**
 
 An intermediate result shows information of an unfinished test. It will
@@ -36,6 +102,7 @@ of error or exception prior to running a test case, only the
 intermediate result will be provided.
 
   -  **QUEUED**
+
      QUEUED tests are tests that are created, and planned to run (but have
      not started yet). They are pre-selected by extension/runbook
      criteria. You can check log to see which test cases are included by
@@ -47,6 +114,7 @@ intermediate result will be provided.
      match none of the environments.
 
   -  **ASSIGNED**
+
      ASSIGNED tests are tests that are assigned to an environment, and
      will start to run, if applicable, once the environment is
      deployed/initialized. They suggest some environmental setting up is
@@ -58,62 +126,23 @@ intermediate result will be provided.
      successfully.
 
   -  **RUNNING**
+
      RUNNING tests are tests that are in test procedure.
      RUNNING tests will end with one of the following final results.
 
-- **Final results**
-
-A final result shows information of a terminated test. It provides more
-valuable information than the intermediate result. It only appears in
-the end of a successful test run.
-
-  -  **FAILED**
-     FAILED tests are tests that did not finish successfully and
-     terminated because of failures like ``LISA exceptions`` or
-     ``Assertion failure``. You can use them to trace where the problem
-     was and why the problem happened.
-
-  -  **PASSED**
-     PASSED tests are tests that passed, or at least partially passed,
-     with a special ``PASSException`` that warns there are minor errors in
-     the run but they do not affect the test result.
-
-  -  **SKIPPED**
-     SKIPPPED tests are tests that did not start and would no longer run.
-     They suggest failure to meet some requirements in the environments
-     involved with the test.
-
-  -  **ATTEMPTED**
-     ATTEMPTED tests are a special category of FAILED tests because of
-     known issues, which are not likely to be fixed soon.
-
-Console Output
---------------------
-
-The results of a test run are displayed in the console and saved in log
-files generated by LISA.  The console will display a summary at the end
-of each run, containing the test suite and case name, test status and a
-message if applicable.  There will be a summary generated that tallies
-results of all tests.
-
-.. figure:: ../img/test_results_summary.png
-   :alt: test_results_summary
-
-The test result message is the easiest, fastest way to understand a test
-failure.  It is derived from assertion or exception messages.  Failures
-are categorized by similar messages.
-
-Log Folder Structure
+Log Files
 --------------------
 
 After a test run, the LISA log file will be generated. The log file can
 be found in the `runtime/log` directory that is generated after test
-runs.  Navigate subfolders until you find the log with a timestamp
+runs.  Navigate sub-folders until you find the log with a timestamp
 corresponding to the time of the test run.  Inside the log's timestamped
-folder, the contents are further split by environment and test case. The
-logs will show INFO and above levels by default.
+folder, the contents are further split by environment and test case. 
+If the test run only has a few cases, the full log (`lisa-<timestamp>.log`)
+may be easier to read. If it is run with concurrency, the split logs may
+be easier to read.  
 
-- **LOG FOLDER CONTENTS** 
+- **LOG FOLDER STRUCTURE** 
 
   * **environment** folder, which contains logs split for the
     environment.
@@ -181,4 +210,51 @@ logs will show INFO and above levels by default.
   containes log files named <timestamp>-<testcase>.log.
 
 .. figure:: ../img/test_case_logs.png
-   :alt: test_case_logs
+   :alt: test_case_logs
+
+Search LISA Code for Issues
+-----------------------
+
+If the test results and logs do not provide enough information to
+resolve the issue, you may need to investigate the LISA codebase itself.
+Use the stack trace information from the console output or logs to
+locate the relevant code lines. Here’s how you can do it:
+
+1. **Locate the Stack Trace**: Find the stack trace in the console
+   output or in the log files located in the `runtime/logs` directory.
+   The stack trace will show the sequence of function calls that led to
+   the error.
+
+2. **Identify Relevant Code Lines**: The stack trace includes file names
+   and line numbers where the error occurred. Use this information to
+   navigate to the corresponding lines in the LISA codebase.
+
+3. **Understand the Flow**: Examine the functions and methods mentioned
+   in the stack trace to understand the flow of execution. This will
+   help you identify where the issue might be originating from.
+
+4. **Search for Issues**: Look for any anomalies or potential issues in
+   the code around the lines mentioned in the stack trace. This could
+   include incorrect logic, unhandled exceptions, or other bugs.
+
+5. **Contribute Back**: If you find areas that can be improved or
+   clarified, consider contributing back to LISA to help others
+   understand the issue through better error messages or code
+   improvements.
+
+Reproduce Failures Manually
+---------------------------
+
+If the test results and logs do not provide enough information to
+resolve the issue, you may need to reproduce the failure manually. Set
+up your development environment as described in the :doc:`Development Setup<../write_test/dev_setup>`. 
+Deploy the necessary resources, such as virtual machines or cloud
+services. Try running the commands that caused the test failures and
+observing output.  Be aware that reproducing failures can incur costs,
+especially in cloud environments, so monitor your resource usage and
+clean up resources when no longer needed. Some issues may not be
+reproducible 100% of the time, so examining error messages and logs
+might be more effective. If you manage to reproduce the issue or find a
+solution, consider contributing back to LISA by improving error
+messages, updating documentation, or fixing bugs to help others who
+might encounter similar issues.