AIOps Platforms & Frameworks

Full-Stack AIOps Platforms

  • alibaba/SREWorks - SREWorks is Alibaba Cloud's cloud-native AIOps and DataOps platform designed to enhance IT operation and maintenance through AI and big data.
  • alibaba/UnifiedModel - UModel is a vendor-neutral semantic runtime that provides a unified object graph for enterprise AI agents, enabling them to understand and process fragmented IT operational data, services, and busi...
  • HoloInsight/holoinsight - HoloInsight is a cloud-native observability platform that emphasizes real-time log analysis and integrates AI for enhanced monitoring and insights.
  • keephq/keep - Keep is an open-source AIOps and alert management platform that centralizes alerts, deduplicates, enriches, filters them, and correlates incidents using AI-powered automation and various backend in...

↑ Back to TOC

Predictive Analytics for Operations

  • aqstack/sentinel - Sentinel is a self-healing edge computing agent for Kubernetes, providing predictive failure detection and partition-resilient orchestration for edge nodes using lightweight statistical models.

↑ Back to TOC

Incident Management & Response

  • Arvo-AI/aurora - Aurora is an open-source, AI-powered incident management platform that uses AI agents to autonomously investigate incidents, perform root cause analysis, and suggest remediations across multi-cloud...
  • ongridio/ongrid - Ongrid is an AI-powered ops agent that autonomously investigates alerts, performs root-cause analysis, and orchestrates fixes for IT infrastructure issues, interacting via chat platforms.
  • papadopouloskyriakos/agentic-chatops - This project implements a multi-agent ChatOps system that uses AI/ML to autonomously triage, investigate, and propose fixes for infrastructure alerts, including self-improving prompts and a causal ...
  • Tommy-yw/RunbookHermes - RunbookHermes is an AIOps agent built on the Hermes Agent framework, designed for evidence-driven incident response, approval-gated remediation, and self-evolving runbook learning.

↑ Back to TOC

Observability & Monitoring with AI

Log Analysis & Intelligence

  • logpai/Drain3 - Drain3 is an online log template miner that extracts structured templates from raw log messages for enhanced observability and anomaly detection.
  • logpai/Log3C - Log3C is a framework that identifies impactful service system problems by analyzing system logs and KPI metrics through a process of log parsing, sequence vectorization, cascading clustering, and c...
  • logpai/loglizer - Loglizer is an open-source machine learning toolkit designed for automated anomaly detection in system logs, supporting various supervised and unsupervised models.
  • logpai/logparser - Logparser is a machine learning toolkit that provides automated log parsing and benchmarks for structured log analytics via event template extraction.
  • salesforce/logai - LogAI is an open-source library that provides a comprehensive platform for log analytics and intelligence, including summarization, clustering, and anomaly detection.

↑ Back to TOC

Infrastructure Monitoring

  • linkedin/cruise-control - Cruise Control is a self-healing and workload rebalancing tool for Apache Kafka clusters, optimizing resource utilization and detecting anomalies to simplify large-scale operations.

↑ Back to TOC

Root Cause Analysis

  • cuebook/CueObserve - CueObserve is an open-source platform for time-series anomaly detection and root cause analysis directly on data warehouses, designed to monitor key metrics and identify causative factors.
  • derisk-ai/OpenDerisk - OpenDeRisk is an AI-native risk intelligence system providing 24/7 comprehensive protection for application systems through multi-agent collaboration for deep root cause analysis.
  • HolmesGPT/holmesgpt - HolmesGPT is an open-source AI agent for SRE that investigates production incidents, finds root causes, and can automatically identify and fix problems 24/7.
  • kubeshark/kubeshark - Kubeshark provides eBPF-powered network observability for Kubernetes, indexing L4/L7 traffic with full K8s context and decrypting TLS, queryable by AI agents and humans.
  • openrca/orca - OpenRCA provides automated root cause analysis for Kubernetes clusters by constructing a real-time topology graph enriched with telemetry data from various sources.
  • scitix/siclaw - Siclaw is an open-source AI agent designed for DevOps and SRE teams to perform read-only infrastructure diagnostics and root-cause analysis through deep investigation workflows.
  • shaido987/riskloc - RiskLoc is an AI-powered method for localizing multi-dimensional root causes in time-series data, identifying the specific dimensions and values contributing to anomalies.
  • tangpan360/MicroRCA-Agent - MicroRCA-Agent is an LLM-agent-based solution for microservice root cause localization and fault analysis, processing multi-modal Log, Trace, and Metric data.
  • Tracer-Cloud/opensre - OpenSRE is an open-source framework for building, training, and evaluating AI SRE agents specifically designed for incident investigation and response, running on your own infrastructure.

↑ Back to TOC

Anomaly Detection

  • activecm/rita - RITA (Real Intelligence Threat Analytics) is an open-source framework that detects command and control (C2) communication by analyzing network traffic, identifying beaconing, long connections, DNS ...
  • d0ng1ee/logdeep - LogDeep is an open-source deep learning-based toolkit for automated log anomaly detection, implementing state-of-the-art models like DeepLog, LogAnomaly, and RobustLog.
  • datamllab/tods - TODS is a comprehensive automated machine learning system for multivariate time-series outlier detection, providing modules for preprocessing, feature extraction, and a wide array of detection algo...
  • earthgecko/skyline - Skyline is a real-time anomaly detection and time series analysis system designed for passive monitoring of numerous high-resolution metrics without pre-configured models or thresholds.
  • jixinpu/aiopstools - AIopstools is a Python toolkit offering fundamental AI-driven functionalities for IT operations, including anomaly detection, alarm convergence, time series forecasting, and alarm association analy...
  • khundman/telemanom - Telemanom is a framework using LSTMs and automatic thresholding for unsupervised anomaly detection in multivariate time series data, originally developed for spacecraft telemetry.
  • kLabUM/rrcf - rrcf provides a Python implementation of the Robust Random Cut Forest algorithm for anomaly detection on streaming data, designed for high-dimensional datasets.
  • MentatInnovations/datastream.io - datastream.io is an open-source framework for real-time anomaly detection in streaming data using Python, Elasticsearch, and Kibana.
  • sintel-dev/Orion - Orion is an open-source machine learning library from MIT's Data to AI Lab, focused on unsupervised time series anomaly detection using various AI-driven pipelines.
  • Stream-AD/MIDAS - MIDAS is a C++ implementation for real-time anomaly detection in dynamic, time-evolving graphs, designed to identify intrusions, fraud, and fake rating anomalies with high accuracy and speed.
  • xuhongzuo/DeepOD - DeepOD is an open-source Python library providing a unified API for 27 deep learning-based outlier and anomaly detection algorithms for tabular and time-series data.
  • yzhao062/pyod - PyOD is a comprehensive Python library for multi-modal anomaly detection, offering 60+ detectors and an agentic workflow for AI agents to drive investigations across various data types.
  • zillow/luminaire - Luminaire is a Python package from Zillow that provides ML-driven solutions for monitoring time series data through automated anomaly detection and forecasting.

↑ Back to TOC

Automation & Self-Healing

AI-Powered Automation

  • bgdnvk/clanker - Clanker is an AI-powered CLI agent designed for autonomous systems engineering across various cloud environments, enabling intelligent infrastructure operations and agent-human collaboration.
  • bolivian-peru/os-moda - osModa is an AI-native operating system based on NixOS, allowing AI agents to manage server operations through typed, auditable tool access and atomic rollbacks.
  • getsavvyinc/savvy-cli - Savvy is a CLI tool that uses AI to create, share, and run command-line workflows, leveraging natural language for automation and explanation of commands and error messages.
  • microsoft/AIOpsLab - AIOpsLab is a comprehensive framework for designing, developing, and evaluating autonomous AIOps agents, providing reproducible and scalable benchmarks for AIOps solutions.

↑ Back to TOC

ChatOps & AI Assistants for Ops

  • BUAADreamer/EasyRAG - EasyRAG is an efficient Retrieval-Augmented Generation (RAG) framework designed for automated network operations, achieving top results in the CCF AIOps International Challenge 2024.
  • Higangssh/homebutler - HomeButler is a homelab management tool that uses AI agents and structured interfaces to monitor, diagnose, and automate operations for self-hosted applications and services.
  • WeOps-Lab/OpsPilot - OpsPilot is an open-source intelligent operation and maintenance assistant that uses deep learning and LLM technology to link various O&M systems for enhanced capabilities.

↑ Back to TOC

AI for Security Operations (SecOps)

Threat Intelligence

  • taranis-ai/taranis-ai - Taranis AI is an open-source intelligence (OSINT) tool that leverages AI and NLP to gather, analyze, and structure information from diverse sources for situational analysis and threat intelligence.
  • thalesgroup-cert/Watcher - Watcher is an AI-powered open-source platform for cybersecurity threat intelligence and hunting, designed to discover and monitor emerging cyber threats.

↑ Back to TOC

Security Monitoring

  • backbay-labs/clawdstrike - Clawdstrike is an AI-powered Endpoint Detection and Response (EDR) system providing policy enforcement, a signed audit chain, and threat detection for developer workstations and autonomous agent fl...
  • beenuar/AiSOC - AiSOC is an open-source, self-hostable AI-powered Security Operations Center that ingests, correlates, and investigates security events using AI, providing a transparent investigation ledger.

↑ Back to TOC

AI for Cloud & Infrastructure

Cloud Cost Optimization

  • infracost/infracost - Infracost provides cloud cost estimates and FinOps best practices for Infrastructure as Code (IaC) by integrating with CI/CD pipelines, IDEs, and AI coding agents to enable cost-aware development.
  • openops-cloud/openops - OpenOps is a no-code FinOps automation platform that uses AI to optimize cloud costs and streamline financial operations through customizable workflows and integrations.
  • realopslabs/kubeledger - KubeLedger is a Kubernetes cost accounting system that tracks CPU, memory, and GPU usage per namespace, making hidden non-allocatable overhead visible for precise financial analysis and optimization.

↑ Back to TOC

Container & Kubernetes Intelligence

  • aliyun/alibabacloud-ack-mcp-server - ACK MCP Server by Alibaba Cloud unifies container operations for AI assistants, enabling natural language interaction to manage Kubernetes resources, observability, and diagnostic tasks, facilitati...

↑ Back to TOC