openai · JunShern · Mar 19, 2024 · Mar 15, 2024 · Mar 15, 2024 · Mar 19, 2024
@@ -0,0 +1,134 @@
+# Track the Stat
+
+This eval measures how well models can implicitly keep track of task state, by
+asking models to compute the rolling median or the rolling mode over a sequence
+of integers.
+
+## Usage
+
+Run with:
+
+```bash
+oaieval <solver> track_the_stat
+```
+
+We have found that `generation/direct/gpt-4-0125-preview` works well on this
+eval. For more examples of tested solvers, see
+[`./scripts/run_experiments.sh`](./scripts/run_experiments.sh).
+
+## Evaluation Process
+
+The evaluation process is as follows for a given sample from our dataset:
+
+1. The `TASK_DESCRIPTION` prompt is shown to the solver.
+2. The sample contains an integer to use as a seed for a random number
+   generator.
+3. The random number generator generates 300 random integers between 0 and 100,
+   with replacement.
+4. The integers are shown one by one to the solver.
+5. At each turn (i.e., after each integer is shown), the solver needs to respond
+   with the current rolling median or the current rolling mode of the integers
+   seen so far.
+6. The solver's response is parsed and compared to the correct rolling median or
+   rolling mode.
+7. If the solver's response is incorrect or a violation is raised (answered in
+   the incorrect format), the evaluation stops and we measure how many turns the
+   solver lasted for. If the solver's response is correct, we move on to the
+   next integer.
+
+## Prompts
+
+We refer readers to the [`./prompts/`](./prompts/) folder for the
+`TASK_DESCRIPTION` used in the eval.
+
+## Metrics
+
+Below are the metrics returned by the eval:
+
+<!-- prettier-ignore-start -->
+| **Metric**        	| **Notes**                                                                                                                                  	|
+|-------------------	|--------------------------------------------------------------------------------------------------------------------------------------------	|
+| avg_max_length    	| The maximum sequence length the model can handle before failing, averaged across the samples. Higher is better. Best possible is 300.      	|
+| stddev_max_length 	| The standard deviation on the above.                                                                                                       	|
+| median_max_length 	| The median of the maximum sequence length the model can handle before failing, across the samples. Higher is better. Best possible is 300. 	|
+| max_max_length    	| The maximum sequence length the model handled before failing across all samples.                                                           	|
+| min_max_length    	| The minimum sequence length the model handled before failing across all samples.                                                           	|
+| violation_rate    	| how often the model responds in an invalid format. i.e. not using the `[<task>: <answer>]` format.                                         	|
+<!-- prettier-ignore-end -->
+
+## Variants
+
+The eval has two variants: median and mode. In the median variant, the solver
+needs to track the rolling median. In the mode variant, the solver needs to
+track the rolling mode.
+
+```bash
+oaieval <solver> track_the_stat.<median/mode>
+```
+
+## Custom Solvers
+
+We implement 3 custom solvers for this eval in [./solvers.py](./solvers.py)
+
+1. `ExplicitStateSolver`: A nested solver that injects an explicit
+   representation of the task state after each number is seen. For example, for
+   the median task we inject the sorted list of numbers seen so far. For the
+   mode task, we inject a dictionary that maps each number seen so far to its
+   count. We view this solver as a baseline for the task, providing the
+   performance of the models on _explicit_ state tracking, rather than the
+   default _implicit_ state tracking.
+2. `RandomBaselineSolver`: A solver that randomly chooses a number from the
+   numbers seen so far as the rolling median or mode. In case of even length
+   lists in the median variant, it chooses two random numbers and returns their
+   arithmetic mean. We view this baseline as equivalent to randomly guessing.
+3. `TrackTheStatHuman`: A helper solver class that wraps the `HumanCliSolver`
+   class such that users do not have to wrap their answer in the
+   `[median: <answer>]` or `[mode: <answer>]` format and can instead just
+   directly type the number.
+
+## Token Usage Estimates
+
+Below are token usage estimates for a given run (one run = all samples) of the
+eval.
+
+For the mode task:
+
+| Model (state tracking)        | Input     | Output    | Total      |
+| ----------------------------- | --------- | --------- | ---------- |
+| gpt-3.5-turbo-0125 (implicit) | 670,000   | 10,000    | 680,000    |
+| gpt-3.5-turbo-0125 (explicit) | 2,710,000 | 30,000    | 2,740,000  |
+| gpt-4-base (implicit)         | 9,030,000 | 2,110,000 | 11,150,000 |
+| gpt-4-base (explicit)         | 3,720,000 | 960,000   | 4,680,000  |
+| gpt-4-0125-preview (implicit) | 3,050,000 | 30,000    | 3,080,000  |
+| gpt-4-0125-preview (explicit) | 8,580,000 | 50,000    | 8,630,000  |
+
+For the median task:
+
+| Model (state tracking)        | Input     | Output  | Total     |
+| ----------------------------- | --------- | ------- | --------- |
+| gpt-3.5-turbo-0125 (implicit) | 430,000   | 10,000  | 440,000   |
+| gpt-3.5-turbo-0125 (explicit) | 880,000   | 10,000  | 890,000   |
+| gpt-4-base (implicit)         | 2,900,000 | 760,000 | 3,660,000 |
+| gpt-4-base (explicit)         | 3,250,000 | 810,000 | 4,060,000 |
+| gpt-4-0125-preview (implicit) | 690,000   | 10,000  | 700,000   |
+| gpt-4-0125-preview (explicit) | 1,430,000 | 20,000  | 1,450,000 |
+
+## Future modifications
+
+- Identify new variants of the task beyond median or mode, where the explicit
+  state is either impossible to represent or not useful for the task. This would
+  allow us to more comfortably measure the implicit state tracking, even on CoT
+  solvers.
+- Identify more realistic and/or complex tasks.
+- Introduce distractors.
+
+## Version History
+
+- v0: Initial version released
+
+## Contribution Statement
+
+Eval design, implementation, and results evaluation were primarily conducted by
+Giulio Starace, under the guidance of (alphabetically by last-name) Steven
+Adler, Andrei Alexandru, James Aung, and Chan Jun Shern who provided research
+input, report revisions, and project management support.
@@ -0,0 +1,96 @@
+import logging
+import random
+from typing import Any, Optional
+
+import numpy as np
+
+from evals.elsuite.track_the_stat import prompts, utils
+from evals.eval import SolverEval
+from evals.record import RecorderBase, record_metrics
+from evals.solvers.solver import Solver
+from evals.task_state import Message, TaskState
+
+logging.getLogger("httpx").setLevel(logging.WARNING)
+logger = logging.getLogger(__name__)
+
+
+class TrackTheStat(SolverEval):
+    def __init__(self, task: str, n_samples: Optional[int] = 250, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+        assert task in [
+            "median",
+            "mode",
+        ], f"task must be either 'median' or 'mode', but got {task}"
+        self.task = task
+        # warn, color in yellow
+        logger.warning(
+            utils.yellow_string(
+                "By nature of what is being evaluated, this eval assumes that the "
+                "solver cannot make use of external scratchpads or similar solutions "
+                "to explicitly write down the task state at every step. Using solvers "
+                "that allow for this functionality will likely produce invalid results."
+            )
+        )
+        self.task_desc = prompts.TASK_DESCRIPTION.format(
+            task=task,
+            task_further_details=prompts.task_to_further_details[task],
+            task_example=prompts.task_to_example[task],
+        )
+        self.task_fn = utils.task_to_fn[task]
+        self.n_samples = n_samples
+        self.rng = random.Random(self.seed)
+
+    def eval_sample(self, solver: Solver, sample: Any, rng: random.Random) -> None:
+        capped_inf_list = np.random.default_rng(sample["seed"]).integers(0, 100, size=300)
+        metrics = self._eval_sample(solver, capped_inf_list)
+
+        record_metrics(**metrics)
+
+    def _eval_sample(self, solver: Solver, capped_inf_list: list[int]) -> dict:
+        violation = False
+        task_state = TaskState(task_description=self.task_desc, messages=[])
+        for i, num in enumerate(capped_inf_list):
+            curr_list = capped_inf_list[: i + 1]
+            task_state.messages.append(Message(role="user", content=str(num)))
+            task_state.current_state = utils.compute_state(curr_list, self.task)
+            solver_output = solver(task_state).output
+            solver_response = utils.parse_solver_output(solver_output, self.task)
+            if solver_response is None:
+                violation = True
+                break
+            if round(solver_response, 1) != round(self.task_fn(curr_list), 1):
+                break
+            task_state.messages.append(Message(role="assistant", content=solver_output))
+
+        return {
+            "max_length": len(curr_list) - 1,
+            "violation": violation,
+        }
+
+    def run(self, recorder: RecorderBase):
+        samples = self._get_samples()
+        self.eval_all_samples(recorder, samples)
+        logged_metrics: list[dict] = recorder.get_metrics()
+
+        agg_metrics = self._compute_agg_metrics(logged_metrics)
+        return agg_metrics
+
+    def _compute_agg_metrics(self, logged_metrics: list[dict]) -> dict:
+        max_lengths = np.array([x["max_length"] for x in logged_metrics])
+
+        agg_metrics = {
+            "avg_max_length": np.mean(max_lengths),
+            "stddev_max_length": np.std(max_lengths),
+            "median_max_length": np.median(max_lengths),
+            "max_max_length": np.max(max_lengths),
+            "min_max_length": np.min(max_lengths),
+            "violation_rate": np.mean([x["violation"] for x in logged_metrics]),
+        }
+        # necessary for serialization, json doesn't like np floats
+        agg_metrics = {k: float(v) for k, v in agg_metrics.items()}
+        return agg_metrics
+
+    def _get_samples(self) -> list[dict]:
+        samples = [{"seed": num} for num in range(self.n_samples)]
+        samples = self.rng.sample(samples, self.n_samples)  # shuffling
+        return samples
@@ -0,0 +1,27 @@
+from . import mode
+from . import median
+
+
+TASK_DESCRIPTION = """\
+In the following messages, you will be given a sequence of numbers.\
+ At each turn, you will be shown a number as input, and you should respond with the\
+ {task} of all the input numbers shown to you so far.
+
+{task_further_details}
+
+Here is an example of what this may look like.
+{task_example}
+
+Format your response as [{task}: <response>] (square brackets included), as shown in\
+the transcript above. The task will begin now.
+"""
+
+task_to_example = {
+    "median": median.MEDIAN_EXAMPLE,
+    "mode": mode.MODE_EXAMPLE,
+}
+
+task_to_further_details = {
+    "median": median.MEDIAN_FURTHER_DETAILS,
+    "mode": mode.MODE_FURTHER_DETAILS,
+}
@@ -0,0 +1,33 @@
+MEDIAN_EXAMPLE = """\
+```example
+input: 1
+ideal_response: [median: 1]\
+ # your response; 1 is the only number shown so far
+---
+input: 2
+ideal_response: [median: 1.5]\
+ # even number of numbers, so median = mean(1,2) = 1.5
+---
+input: 1
+ideal_response: [median: 1]\
+ # 1 is now the middle number when sorting the numbers
+---
+input: 3
+ideal_response: [median: 1.5]\
+ # middle numbers are now 1 and 2, so once again median = mean(1,2) = 1.5
+---
+input: 3
+ideal_response: [median: 2]\
+# the sorted list is [1 1 2 3 3]; odd length, so median is the middle number, 2
+---
+input: 0
+ideal_response: [median: 1.5]\
+# the sorted list is [0 1 1 2 3 3]; even length, so median is mean(1,2) = 1.5
+```\
+"""
+
+
+MEDIAN_FURTHER_DETAILS = """\
+NOTE: In case of lists containing an even number of elements, you should respond with the\
+ arithmetic mean of the middle two numbers of the sorted list.\
+"""
@@ -0,0 +1,29 @@
+MODE_EXAMPLE = """\
+```example
+input: 1
+ideal_response: [mode: 1]\
+ # your response; 1 is the only number shown so far
+---
+input: 2
+ideal_response: [mode: 2]\
+ # 1 and 2 are tied modes (both appeared once), 2 > 1
+---
+input: 1
+ideal_response: [mode: 1]\
+ # 1 now has appeared more than any other number
+---
+input: 3
+ideal_response: [mode: 1]
+---
+input: 3
+ideal_response: [mode: 3]\
+ # 3 is tied with 1 in terms of appearances, 3 > 1
+---
+input: 0
+ideal_response: [mode: 3]
+```\
+"""
+
+MODE_FURTHER_DETAILS = """\
+NOTE: In case of ties, you should respond with the largest number that is part of the tie.\
+"""