ApiCortex

Autonomous API failure prediction and contract testing SaaS platform with ML-powered analytics.

GitHub: 0xarchit/ApiCortex
Live Demo: https://api-cortex.vercel.app

Tip

Predict API Failures Before They Happen. An enterprise-grade SaaS platform using ML analytics on real production traffic.

Overview

ApiCortex is an enterprise-grade SaaS platform that predicts API failures before they occur using machine learning analytics on real production traffic. The platform ensures API contract compliance and provides proactive failure detection through advanced anomaly detection algorithms.

Key Capabilities

Predictive Analytics: ML-powered failure prediction with 95%+ accuracy
Real-time Monitoring: Sub-second telemetry processing via Kafka streaming
Contract Validation: OpenAPI specification enforcement and drift detection
Multi-tenant Architecture: Organization-based isolation with RBAC
Time-series Analytics: Historical querying with TimescaleDB
Developer Dashboard: Interactive Next.js UI with live metrics

Deployment Status (MVP)

For the initial MVP launch, we have adopted a hybrid-cloud strategy utilizing high-performance managed services.

Component	Provider	Role
Frontend	Vercel	Dashboard & Edge Proxy
Backend	HuggingFace	Unified Docker Orchestration
Metadata	NeonDB	Serverless PostgreSQL
Metrics	TigerData	Managed TimescaleDB
Streaming	Aiven	Cloud Managed Kafka
Caching	Upstash	Serverless Redis

Note

To maximize efficiency and minimize cross-service latency on free-tier resources, the core backend services (Ingest, Control Plane, and ML Service) are orchestrated within a unified Docker container on HuggingFace Spaces.

Architecture

System Flow Diagram

graph TB
    subgraph "Presentation Layer"
        A[Next.js Dashboard]
        B[REST API Clients]
    end
    
    subgraph "Control Plane"
        C[FastAPI Server]
        D[Auth Service]
        E[API Management]
        F[Contract Validator]
    end
    
    subgraph "Data Plane"
        G[Go Ingest Service]
        H[Kafka Producer]
        I[Rate Limiter]
    end
    
    subgraph "ML Plane"
        J[Python ML Service]
        K[Feature Engineering]
        L[XGBoost Predictor]
        M[Anomaly Detector]
    end
    
    subgraph "Execution Plane"
        Q[Rust Testing Engine]
        R[SSRF Shield]
        S[External APIs]
    end

    subgraph "Storage"
        N[(PostgreSQL)]
        O[(TimescaleDB)]
        P[Kafka Topics]
    end
    
    A --> C
    B --> C
    C --> D
    C --> E
    C --> F
    C <--> Q
    Q --> R
    R --> S
    G --> H
    H --> P
    J --> P
    J --> K
    K --> L
    L --> M
    C --> N
    G --> O
    J --> O

Features

Core Features

Feature	Description	Status
Real-time Telemetry	Collect API metrics with <10ms latency	✔ Active
ML Failure Prediction	XGBoost-based anomaly detection	✔ Active
Contract Validation	OpenAPI 3.0 specification enforcement	✔ Active
Multi-tenant RBAC	Organization-based access control	✔ Active
Time-series Analytics	Historical data querying	✔ Active
Alerting System	Webhook-based notifications	✔ Active
Developer Dashboard	Interactive UI with live metrics	✔ Active
API Testing	High-performance Rust execution engine	✔ Active

Technical Specifications

Throughput: 10,000+ events/second
Latency: <50ms p99 for telemetry ingestion
Accuracy: 95%+ failure prediction accuracy
Retention: Configurable (default 30 days)
Scalability: Horizontal scaling with Kafka partitions

System Components

1. Data Plane (Go)

Location: ingest-service/

Responsible for high-throughput telemetry collection and streaming.

Key Files:

cmd/server/main.go - Application entry point
internal/api/handler.go - HTTP request handlers
internal/kafka/producer.go - Kafka producer
internal/buffer/batcher.go - Event batching

2. Control Plane (FastAPI)

Location: control-plane/

Handles authentication, API metadata, and contract management.

Key Files:

app/main.py - FastAPI application
app/routers/auth.py - Authentication endpoints
app/routers/apis.py - API management
app/services/contract_service.py - Contract validation

3. ML Plane (Python)

Location: ml-service/

Processes telemetry streams and generates failure predictions.

Key Files:

app/main.py - ML worker entry
workers/inference_worker.py - Inference pipeline
app/features/feature_engineering.py - Feature extraction
app/inference/predictor.py - Model prediction

4. Presentation Plane (Next.js)

Location: frontend/

Developer dashboard for monitoring and management.

5. Execution Engine (Rust)

Location: api-testing/

High-performance, secure engine optimized for executing REST, GraphQL, and WebSocket tests.

Key Files:

src/main.rs - Axum server entry
src/executor.rs - Core execution & security logic
src/protocols/ - WebSocket & HTTP handlers
src/models.rs - Result & Snapshot schemas

Data Flow

Telemetry Data Flow

sequenceDiagram
    participant Client as API Client
    participant Ingest as Ingest Service
    participant Kafka as Apache Kafka
    participant ML as ML Service
    participant DB as TimescaleDB
    participant UI as Dashboard
    
    Client->>Ingest: POST /v1/telemetry
    Ingest->>Ingest: Validate & Buffer
    Ingest->>Kafka: Publish telemetry.raw
    Ingest->>DB: Store telemetry
    Ingest-->>Client: 200 OK
    
    ML->>Kafka: Consume telemetry.raw
    ML->>ML: Feature Engineering
    ML->>ML: XGBoost Prediction
    ML->>DB: Store prediction
    ML->>Kafka: Publish alerts
    
    UI->>DB: Query metrics
    UI->>UI: Display charts

Prediction Flow

flowchart TD
    A[Telemetry Event] --> B{Kafka Consumer}
    B --> C[Feature Extraction]
    C --> D[1m Window Stats]
    C --> E[5m Window Stats]
    C --> F[15m Window Stats]
    D --> G[Feature Vector]
    E --> G
    F --> G
    G --> H{XGBoost Model}
    H --> I[Risk Score]
    I --> J{Threshold Check}
    J -->|Score > 0.8| K[Generate Alert]
    J -->|Score < 0.8| L[Store Prediction]
    K --> M[Kafka Alerts Topic]
    L --> N[TimescaleDB]

Getting Started

Prerequisites

Go: 1.26 or later
Python: 3.11 or later
Node.js: 22 or later
PostgreSQL: 16+ or NeonDB
TimescaleDB: Latest version
Apache Kafka: 3.0 or later

Installation

# Clone repository
git clone https://github.com/0xarchit/apicortex.git
cd apicortex

# Set up environment variables
cp .env.example .env
# Edit .env with your credentials

# Start infrastructure (Docker)
docker-compose up -d

Running Services

# Ingest Service
cd ingest-service && go run cmd/server/main.go

# Control Plane
cd control-plane && uvicorn app.main:app --reload

# ML Service
cd ml-service && python app/main.py

# API Testing Engine (Rust)
cd api-testing && cargo run

# Frontend
cd frontend && npm run dev

Configuration

Environment Variables

Variable	Service	Description	Default
`DATABASE`	Control Plane	PostgreSQL connection string	-
`TIMESCALE_DATABASE`	All	TimescaleDB connection string	-
`KAFKA_SERVICE_URI`	Ingest, ML	Kafka broker URI	-
`ACTIVE_POLLING_ENABLED`	Ingest	Enable active polling	`true`
`BATCH_SIZE`	Ingest	Kafka batch size	`500`
`MODEL_PATH`	ML	Path to XGBoost model	`model/xgboost.pkl`
`ALERT_THRESHOLD`	ML	Alert threshold (0-1)	`0.8`

Service Configuration

Ingest Service (ingest-service/.env):

PORT=8080
KAFKA_SERVICE_URI=kafka:9092
BATCH_SIZE=500
FLUSH_INTERVAL_SECONDS=2
ACTIVE_POLLING_ENABLED=true

Control Plane (control-plane/.env):

DATABASE=postgresql://user:pass@host:5432/db
JWT_SECRET_KEY=your-secret-key
OAUTH_GITHUB_CLIENT_ID=your-client-id

ML Service (ml-service/.env):

KAFKA_TOPIC_RAW=telemetry.raw
MODEL_PATH=model/xgboost_failure_prediction.pkl
ALERT_THRESHOLD=0.8
ENABLE_SHAP=true

Usage

Dashboard Access

Open browser: http://localhost:3000
Sign in with OAuth (Google/GitHub)
Navigate to Dashboard

API Endpoints

Endpoint	Method	Description
`/auth/login`	POST	User authentication
`/apis`	GET	List APIs
`/apis/{id}/endpoints`	GET	Get API endpoints
`/telemetry`	POST	Submit telemetry
`/predictions`	GET	Get predictions
`/dashboard/metrics`	GET	Dashboard metrics
`/testing/execute`	POST	Execute API test

Monitoring

Health Checks

Service	Endpoint	Port
Ingest	`/health`	8080
API Testing	`/health`	9090
Control Plane	`/health`	8000
Frontend	`/`	3000

Logging

All services use structured logging in JSON format:

Ingest: Zerolog
Control Plane: Python logging
ML Service: Python logging

Troubleshooting

Common Issues

Services Won't Start

Solution:

# Check environment variables
printenv | grep APICORTEX

# Verify database connectivity
psql $DATABASE -c "SELECT 1"

# Check Kafka connection
kafka-consumer-groups --bootstrap-server $KAFKA_URI --list

High Memory Usage

Solution:

# Reduce batch size
BATCH_SIZE=100

# Limit buffer capacity
MAX_BUFFER_CAPACITY=10000

Kafka Consumer Lag

Solution:

Increase consumer parallelism
Add more ML worker instances
Check network connectivity

Debug Mode

DEBUG=true
LOG_LEVEL=debug

Security

Authentication Flow

sequenceDiagram
    participant User
    participant Frontend
    participant ControlPlane
    participant OAuth
    participant DB
    
    User->>Frontend: Click "Login"
    Frontend->>ControlPlane: Initiate OAuth
    ControlPlane->>OAuth: Redirect
    User->>OAuth: Authenticate
    OAuth->>ControlPlane: OAuth Callback
    ControlPlane->>DB: Create/Update User
    ControlPlane->>Frontend: JWT Token
    Frontend->>User: Dashboard Access

API Key Management

Keys are hashed with pepper before storage
Keys are rotated every 90 days
Audit logging for all key operations

Contributing

Fork the repository
Create feature branch
Submit pull request
Pass CI/CD pipeline

Development Setup

# Install dependencies
go mod download
pip install -r requirements.txt
npm install

# Run tests
go test ./...
pytest
npm test

Support

Email: mail@0xarchit.is-a.dev
Discussions: GitHub Discussions
Issues: GitHub Issues

fastapi go rust python next.js kafka postgresql ml xgboost