Disclaimer: This guidance is highly opinionated and currently a best-effort approach to help Cosmos teams with their incident response processes. The current document is expected to be modified and improved in the future with the help and feedback of other Cosmos teams and community members
Confidential. Do not share your incident response plan with anyone outside of your circle of trust
This document offers recommendations and guidance for core teams to help them deal with major security incidents and control their damage.
- Incident response plan (IR): A combination of processes, documents and playbooks that standardize how a team reacts and responds to security incidents in the most effective way.
- Playbook: Incident response manuals that help teams address specific situations e.g. an on-chain security incident or a social media account take over.
- War room: A meeting point (either virtual or in-person) where incident responders and other stakeholders within the circle of trust coordinate and work through a major security incident affecting an organization.
- Circle of trust: The set of trusted stakeholders, both internal team members and trusted external collaborators, that can access a team's IR plan and potentially join a war room when a major incident occurs.
It's recommended to define substitutes in case the primary team members are not available at the moment of the incident. Having team member substitutes spread on different timezones increases the availabily guarantees.
- Lead Core Dev
- Leads the vulnerability identification and evaluation process
- Leads the creation and decision process for emergency actions e.g. patches
- Leads the planning and creation of long-term mitigations
- Secondary Core Dev
- Carefully reviews and shadows any action and decision performed by the lead core dev
- Web UI Lead
- Update and disable website/app
- Creates and deploys UI banner alerts if applicable
- Facilitator
- Coordinates the overall incident reponse process
- Facilitates uutreach to internal and external stakeholders
- Coordinates emergency patches if applicable
- Coordinates emergency governance proposals if applicable
- Multi-sig herder if applicable
- Ops
- War room set up
- Record minutes and timeline
- Assists the Facilitator in any required operational task
- Leads community communications
- Draft post-mortem
- External
- Validators
- Auditors
Name | Function | Phones | signal | telegram | other |
---|---|---|---|---|---|
Alice | Lead Core Dev | xxx-xxx-xxxx | something | something | something |
Bob | Facilitator | yyy-yyy-yyyy | something | something | something |
Name | Function | Other | Security page | |
---|---|---|---|---|
Super Security | Audit Firm | [email protected] | something | something |
Mallory | Trusted Validator | [email protected] | something | n/a |
Protocol X | Protocol dependency | [email protected] | something | something |
Infra Partner Z | Protocol infra | [email protected] | something | something |
- Define internal communication toolkit
- Example: Signal/Zoom (primary), Slack/Meet (backup)
- Define external communication toolkit
- Example: Discord (primary), Telegram (backup)
- Assign roles in your team
- Define your circle of trust (internal roles + selected external participants e.g. past auditors)
- Document and keep updated contact details of internal roles and external participants
- Consider what circumstances would require emergency actions in your protocol
- Create potential hack scenarios
- Example: How to Hack the Yield Protocol
- Define sensitive logic and behaviour that would be considered an anomaly an integrate with an off-chain monitoring solution
- Consider integration with a monitoring solution e.g Range
- Prepare mitigation scripts
- Implement circuit breakers for critical functionality
- Example: Use x/circuit module
- Understand how long it would take to patch a vulnerability in a given component
- Describe critical dependencies of your system and define how you will keep track on vulnerabilities and disclosure of those systems
- Inventory of critical dependencies
- Consider assigning one team member to be fully responsible of this task
- Drill. Employee training and practice of mock incidents
- Keep the Incident Response plan private exclusively to assigned roles
This is a playbook for chain-related incidents. Even if there may be overlaps, other security issues like social media account take-overs or phishing attacks should have their own specific playbooks.
- Identify vulnerability
- Escalate and set up the war room with internal members
- Contact security partners and other assigned external roles (e.g. trusted validators/contributors)
- Add trusted external partners within your circle of trust to the war room
- Determine the full extent of the compromise
- How the system was compromised
- What is the breach timeline of the attack or bug
- Initial root cause analysis
- Emergency mitigations. Pause contracts or module functionality when relevant
- Execute an immediate patch or emergency action if relevant
- e.g. frontrun attacker
- e.g. drain a compromised pool before the attacker
- Monitor the effectiveness of the emergency remediation action
- Review related application logic to identify knock-on vulnerabilities
- Update UI e.g. disable withdrawals via the UI
- Contact other protocols and projects that may be affected by the same vulnerability
- Contact upstream maintainers if the vulnerability is at a lower level of the stack
- Cosmos SDK
- CometBFT
- IBC
- etc
- Post message Discord, Twitter and other social media channels
- Prepare long-term remediation actions (e.g. patches)
- Reviewed by validators and past auditors in your circle of trust
- Consider Code4rena contest (optional)
- Deploy long-term mitigation e.g. patches
- Monitor the correct functioning of the patches
- Draft Post mortem
- To be performed by Ops member
- Review Post mortem
- To be perfomed by lead and secondary devs, past auditors
- Publish post mortem in social media channels
- Medium, Discord, Twitter, etc
- Retrospective
- What we did well, what not
- How did we handle the incident?
- What can we improve?