Pipeline for experimenting and evaluating LLM ability to audit smart contracts
- Set up two environment variables (recommended to do this inside the terminal config):
export ICL_USERNAME={imperial_username}
export ICL_PASSWORD={imperial_password}
-
Run script
./start-job.sh
to reserve a job. -
Run script
./run-server.sh -p {PORT} -e {ENV}
to start a Jupyter server onPORT
running your{ENV}
python environment.
This repo contains a Web application which goes through model outputs and allows for manual labeling of the results.
The accuracy of the evaluating model is then approximated using a Sequential Massart algorithm (See: formal-verification/sequential_massart_smc.ipynb
-
Run
make install
-
Run
make run
to start the app
In total 213
rows were verified where each row contains 3
criteria or 639
samples.
Below is the resulting accuracy of gpt-4o-mini
for evaluating audits.
633: Interval: [0.9223155036427156, 0.9780027129062199] - Samples: 634.0 - Estimate: 0.9557661927330173
The model achieves an evaluation accuracy of
To be more exact the accuracy lays between