A curated list of awesome academic researches and industrial materials about Artificial Intelligence for IT Operations (AIOps).
China (& HK SAR) | |||
---|---|---|---|
Michael R. Lyu, CUHK | Dongmei Zhang, Microsoft | Pengfei Chen, SYSU | Dan Pei, Tsinghua |
Xin Peng, Fudan | |||
USA | |||
Ryan Huang, JHU | Yingnong Dang, Microsoft | Christina Delimitrou, MIT EECS | |
Europe | |||
Odej Kao, TU Berlin | |||
Australia | |||
Hongyu Zhang, UON |
- [AIOps Challenge] A series of AIOps competitions hosted by Tsinghua University
- [PAKDD2020] Alibaba AIOps Competition
- [VMware] Proactive Incident and Problem Management
- [GREATOPS 高效运维社区] 《企业级 AIOps 实施建议》白皮书
- [Awesome Open Source] Aiops Handbook
- [Moogsoft] What is AIOps?
- [Tsinghua University] 清华裴丹:AIOps落地的15条原则
- [Tsinghua University] 清华裴丹:AIOps效果落地最后一公里
- [Alibaba Cloud] 基于大数据的智能网络分析-齐天
- [Microsoft] Advancing Azure service quality with artificial intelligence: AIOps
- [Grafana] GrafanaCON: Grafana Observability Conference 2022
- [InfoQ] 2023,可观测性需求将迎来“爆发之年”?
- [Alibaba] 阿里云张建锋谈新型计算体系:云正在重构硬件、软件和终端世界
- [Cornell] DeathStarBench (An open-source benchmark suite for cloud microservices)
- [Google Cloud] Online Boutique (A microservices demo application)
- [Fudan] Train Ticket (A benchmark microservice system)
- [Weaveworks] Sock Shop (A microservices demo application)
- [Log Analytics] LogPAI
- [AI for Cloud Operation] OpsPAI
- [Outlier Detection] PyOD
- [Anomaly Detection] ADTK
- [Anomaly Detection] PySAD
- [Online Machine Learning] River
- [Online Machine Learning] scikit-multiflow
- [Fault Injection] Chaos Mesh
- [Fault Injection] ChaosBlade
- [Container Monitoring] cAdvisor
- [Performance Monitoring] Netdata
- [Anomaly Detection Labeling Tool] Microsoft TagAnomaly
- [Serverless App Dev. Framework] AWS Serverless Application Model (AWS SAM)
- [Performance Testing Tool] Locust
- [Alibaba Java Diagnostic Tool] Arthas
- Datadog: A monitoring and security platform for cloud applications
- 必示 bizseer
- 日志易
- 博睿数据
- 听云 TINGYUN: 端到端的全平台应用性能管理系统
- Loom Systems
- Keep: Open-source alert management and AIOps platform
- ICSE21 Workshop on Cloud Intelligence
- AAAI-20 Workshop on Cloud Intelligence
- AIOPS 2020 (International Workshop on Artificial Intelligence for IT Operations)
- [arXiv '24] A Survey on Failure Analysis and Fault Injection in AI Systems
- [arXiv '23] AI for IT Operations (AIOps) on Cloud Platforms: Reviews, Opportunities and Challenges
- [CSUR '22] Anomaly Detection and Failure Root Cause Analysis in (Micro) Service-Based Cloud Applications: A Survey
- [ASE '22] Going through the Life Cycle of Faults in Clouds: Guidelines on Fault Handling
- [arXiv '21] Experience Report: Deep Learning-based System Log Analysis for Anomaly Detection
- [CSUR '21] A Survey on Automated Log Analysis for Reliability Engineering
- [ESEC/FSE '20] Towards intelligent incident management: why we need it and how we make it
- [arXiv '20] A Systematic Mapping Study in AIOps
- [ICSE '19] AIOps: Real-World Challenges and Research Innovations
- [HotOS '19] What bugs cause production cloud incidents?
- [ISSRE '16] Experience Report: System Log Analysis for Anomaly Detection
- [ASE '13] Software analytics for incident management of online services: An experience report
- [arXiv '22] Constructing Large-Scale Real-World Benchmark Datasets for AIOps
- [ASPLOS '19] An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud and Edge Systems
- [ISSTA '24] LILAC: Log Parsing using LLMs with Adaptive Parsing Cache
- [arXiv '24] Exploring LLM-based Agents for Root Cause Analysis
- [arXiv '24] Nissist: An Incident Mitigation Copilot based on Troubleshooting Guides
- [arXiv '24] Automated Root Causing of Cloud Incidents using In-Context Learning with GPT-4
- [arXiv '23] Automatic Root Cause Analysis via Large Language Models for Cloud Incidents
- [arXiv '23] OpsEval: A Comprehensive Task-Oriented AIOps Benchmark for Large Language Models
- [arXiv '23] Xpert: Empowering Incident Management with Query Recommendations via Large Language Models
- [arXiv '23] Exploring the Effectiveness of LLMs in Automated Logging Generation: An Empirical Study
- [arXiv '23] Assess and Summarize: Improve Outage Understanding with Large Language Models
- [arXiv '23] Empower Large Language Model to Perform Better on Industrial Domain-Specific Question Answering
- [arXiv '23] Recommending Root-Cause and Mitigation Steps for Cloud Incidents using Large Language Models
- [SoCC '19] A System-Wide Debugging Assistant Powered by Natural Language Processing
- [ICSE-SEIP '22] Mining Root Cause Knowledge from Cloud Service Incident Investigations for AIOps
- [ICSE-SEIP '21] Neural knowledge extraction from cloud service incidents
- [arXiv '21] SoftNER: Mining Knowledge Graphs From Cloud Incidents
- [APPLSCI '20] A Causality Mining and Knowledge Graph Based Method of Root Cause Diagnosis for Performance Anomaly in Cloud Applications
- [ASPLOS '21] Sage: Practical & Scalable ML-Driven Performance Debugging in Microservices
- [ICDCS '21] Defuse: A Dependency-Guided Function Scheduler to Mitigate Cold Starts on FaaS Platforms
- [FSE '20] Graph-based trace analysis for microservice architecture understanding and problem diagnosis
- [OSDI '20] FIRM: An Intelligent Fine-grained Resource Management Framework for SLO-Oriented Microservices
- [ESEC/FSE '19] Latent Error Prediction and Fault Localization for Microservice Applications by Learning from System Trace Logs
- [TSE '18] Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study
- [ASE '21] AID: Efficient Prediction of Aggregated Intensity of Dependency in Large-scale Cloud Systems [code]
- [NSDI '07] X-Trace: A Pervasive Network Tracing Framework
- [HotNets '06] Discovering Dependencies for Network Management
- [ICSE '23] CONAN: Diagnosing Batch Failures for Cloud Systems
- [ISSRE '22] Share or Not Share? Towards the Practicability of Deep Models for Unsupervised Anomaly Detection in Modern Online Systems [code]
- [ICSE '22] Adaptive Performance Anomaly Detection for Online Service Systems via Pattern Sketching [code]
- [KDD '19] Time-Series Anomaly Detection Service at Microsoft
- [ESEC/FSE '18] Identifying Impactful Service System Problems via Log Analysis
- [CCS '17] DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
- [SIGCOMM '23] Murphy: Performance Diagnosis of Distributed Cloud Applications
- [FSE '23] Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data
- [OSDI '18] Capturing and Enhancing In Situ System Observability for Failure Detection
- [ATC '23] AutoARTS: Taxonomy, Insights and Tools for Root Cause Labelling of Incidents in Microsoft Azure
- [ICSE '23] Incident-aware Duplicate Ticket Aggregation for Cloud Systems
- [SoCC '22] How to Fight Production Incidents? An Empirical Study on a Large-scale Cloud Service
- [DSN '22] Characterizing and Mitigating Anti-patterns of Alerts in Industrial Cloud Systems
- [USENIX ATC '21] Fighting the Fog of War: Automated Incident Detection for Cloud Systems
- [ASE '21] Graph-based Incident Aggregation for Large-Scale Online Service Systems
- [ASE '21] Groot: An Event-graph-based Approach for Root Cause Analysis in Industrial Settings
- [SIGCOMM '20] Scouts: Improving the Diagnosis Process Through Domain-customized Incident Routing
- [ASE '20] How Incidental are the Incidents?: Characterizing and Prioritizing Incidents for Large-Scale Online Service Systems
- [ESEC/FSE '20] Identifying linked incidents in large-scale online service systems
- [ESEC/FSE '20] Efficient incident identification from multi-dimensional issue reports via meta-heuristic search
- [ESEC/FSE '20] Real-time incident prediction for online service systems
- [ESEC/FSE '20] How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems
- [ICSE '20] Understanding and Handling Alert Storm for Online Service Systems
- [HotOS '19] What bugs cause production cloud incidents?
- [ASE '19] Continuous Incident Triage for Large-Scale Online Service Systems
- [ICSE '19] An empirical investigation of incident triage for online service systems
- [WWW '19] Outage Prediction and Diagnosis for Cloud Service Systems
- [KDD '14] Correlating Events with Time Series for Incident Diagnosis
- [FAST '23] Perseus: A Fail-Slow Detection Framework for Cloud Storage Systems [data]
- [DSN '21] General Feature Selection for Failure Prediction in Large-scale SSD Deployment
- [TOSEM '20] Predicting Node Failures in an Ultra-Large-Scale Cloud Computing Platform: An AIOps Solution
- [ICDCS '20] Toward Adaptive Disk Failure Prediction via Stream Mining
- [VLDB '20] Diagnosing root causes of intermittent slow queries in cloud databases
- [USENIX ATC '19] IASO: A Fail-Slow Detection and Mitigation Framework for Distributed Storage Services
- [NSDI '18] Deepview: Virtual Disk Failure Diagnosis and Pattern Detection for Azure
- [ESEC/FSE '18] Predicting Node Failure in Cloud Service Systems
- [USENIX ATC '18] Improving Service Availability of Cloud Systems by Predicting Disk Error
- [NSDI '22] CloudCluster: Unearthing the Functional Structure of a Cloud Service
- [OSDI '20] Predictive and Adaptive Failure Mitigation to Avert Production Cloud VM Interruptions
- [SOSP '21] Understanding and Detecting Software Upgrade Failures in Distributed Systems
- [NSDI '20] Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure
- [CUHK] Loghub
- [Microsoft Azure] Azure Public Dataset
- [Tsinghua] AIOps Challenge Dataset
- [Google] Cluster Traces
- [Backblaze] Hard Drive Dataset
- [Baidu] SMART Dataset of PAKDD CUP 2020
- [Red Hat] Ceph Device Telemetry Dataset
- [Alibaba] SSD SMART logs and failure data
- [Alibaba] Alibaba Cluster Trace Program
- [CloudWise] GAIA Dataset
- [Huawei Cloud] Serverless traces
- [Coursera] Cloud-Based Network Design & Management Techniques
- [Tsinghua] AIOps Course of Tsinghua