Skip to content
Technical BriefOct 20258 min read

Technical Brief: ArgusAI Deployment — On-Prem LLMs, MCP Servers, and Private Inference

ArgusAI
technical-briefargusaion-premises-aillm-deploymentmcp-serverprivate-inferencearchitectureera-3

Overview

ArgusAI is an on-premises AI inference platform designed for industrial operations environments where data cannot leave the facility. This brief covers the deployment architecture: the hardware and software components, how the Model Context Protocol (MCP) server layer connects the AI to operational data, and the configuration choices that affect performance and capability.


Architecture Overview

An ArgusAI deployment consists of four layers:

graph TD A[User Interface] --> B[Inference Server] B --> C[LLM Runtime] B --> D[MCP Server Layer] D --> E[Operational Data]
Scroll to see full diagram

User Interface: The Ask Argus interface in ArgusIQ, or a custom application using ArgusAI’s inference API. Sends natural language queries; receives answers with source citations.

Inference Server: Receives queries, orchestrates MCP context retrieval, assembles the full prompt (query + context), calls the LLM runtime for inference, returns the response. The inference server is the orchestration layer.

LLM Runtime: The large language model running on GPU hardware. Receives the assembled prompt and generates the response. The LLM runtime has no direct access to operational data — it receives structured context prepared by the MCP server layer.

MCP Server Layer: Domain-specific servers that translate natural language intent into structured queries, retrieve context data from operational data sources (ArgusIQ, external databases, document repositories), and format the context for LLM consumption. Without the MCP servers, the LLM has only its training knowledge — no live operational state.


The LLM Runtime

Model Selection

ArgusAI supports instruction-tuned open-weight models in the 7B–70B parameter range. Model selection affects inference capability and hardware requirements.

Practical capability tiers:

7B models (Mistral 7B Instruct, Llama 3.1 8B Instruct, similar):

  • Suitable for straightforward operational queries: current status, recent history, list retrieval
  • Response quality degrades for complex multi-step reasoning
  • Hardware requirement: single GPU with 16–24 GB VRAM
  • Appropriate for: small facilities, limited query complexity, hardware-constrained environments

13B–34B models (Llama 3.1 13B/34B, similar):

  • Good performance for operational queries with moderate complexity: pattern analysis, comparison across multiple assets, trend summaries
  • Better handling of ambiguous queries and follow-up questions in a conversation
  • Hardware requirement: single GPU with 40–80 GB VRAM (or dual-GPU configuration)
  • Appropriate for: most production deployments

70B models (Llama 3.1 70B, similar):

  • Best performance for complex analytical queries, report synthesis, multi-document reasoning
  • Appropriate for enterprise deployments with demanding analytical use cases
  • Hardware requirement: multi-GPU (2–4 GPUs), 80–160+ GB total VRAM
  • Appropriate for: large enterprise deployments, complex analytical workloads

Quantization

Model quantization reduces memory requirements at a modest accuracy cost:

FP16 (half precision): Full model capability. Memory requirement = ~2 bytes per parameter. A 7B FP16 model requires ~14 GB VRAM.

INT8 (8-bit): ~10–15% accuracy reduction on complex tasks, negligible on most operational queries. Memory requirement = ~1 byte per parameter. A 7B INT8 model requires ~7 GB VRAM.

INT4 (4-bit): ~20–25% accuracy reduction on complex tasks. Memory requirement = ~0.5 bytes per parameter. A 7B INT4 model requires only ~3.5 GB VRAM.

For most operational query workloads (current status, maintenance history, alert summary), INT8 quantization provides acceptable quality with significantly reduced hardware requirements. INT4 is appropriate for hardware-constrained environments where the alternative is no AI capability.

Inference Engine

ArgusAI uses vLLM as the inference engine. vLLM’s paged attention architecture provides efficient memory management and batching for concurrent query requests. For hardware-constrained environments, llama.cpp is supported as an alternative inference engine with lower memory overhead at reduced throughput.


The MCP Server Layer

The Model Context Protocol (MCP) is the bridge between the LLM and operational data. Understanding the MCP architecture is essential for deploying ArgusAI effectively.

What MCP Servers Do

A MCP server is a service that:

  1. Exposes a set of “tools” — named functions with defined inputs and outputs
  2. Implements those tools as queries against specific data sources
  3. Returns structured data that the inference server formats as LLM context

When the inference server processes a query, it determines which tools to call (based on the query’s apparent data needs), calls those tools via the MCP servers, collects the results, and assembles them into a structured prompt context.

Example query flow:

User: “Which motors on Press Line 2 have health scores below 70?”

  1. Inference server identifies needed data: asset health scores, filtered by asset type (motor) and location (Press Line 2)
  2. Calls asset_hub_mcp.get_assets_by_filter(asset_type="motor", location="Press Line 2", health_score_max=70)
  3. MCP server queries ArgusIQ Asset Hub for matching assets
  4. Returns: list of motor assets with names, current health scores, and key metrics
  5. Inference server includes this data as context in the LLM prompt
  6. LLM generates the answer with the retrieved data as the factual basis

ArgusIQ MCP Servers

ArgusAI includes pre-built MCP servers for each ArgusIQ module domain:

asset_hub_mcp: Queries Asset Hub for asset records, health scores, telemetry history, baseline statistics, and asset relationships. Supports filtering by asset type, location, health score range, and time period.

cmms_mcp: Queries CMMS for work order records, maintenance history, PM schedules, and parts records. Supports filtering by asset, date range, work order type, and status.

alarm_mcp: Queries Alarm Engine history for alert events, active alerts, and acknowledgment records. Supports filtering by asset, severity, time period, and alert condition.

space_mcp: Queries Space Hub for asset locations, zone assignments, and RTLS location history.

ticketing_mcp: Queries Ticketing for service ticket records, SLA status, and resolution history.

Custom MCP Servers

For data sources outside ArgusIQ — ERP systems, external document repositories, proprietary production management systems — custom MCP servers can be developed using the MCP server SDK. Custom MCP servers follow the same interface contract as the ArgusIQ MCP servers and integrate transparently into the ArgusAI inference pipeline.

Custom MCP server development typically requires: familiarity with the target data source’s API or query interface, Python or TypeScript development capability, and 1–4 weeks of development time per data source depending on complexity.


Hardware Specifications

Minimum Production Configuration

Use case: Single facility, < 100 concurrent users, moderate query complexity

Server: Single GPU server, 2U rackmount GPU: NVIDIA A10 (24 GB GDDR6) or RTX A5000 (24 GB) CPU: 16-core x86-64 (AMD EPYC or Intel Xeon) RAM: 128 GB system RAM Storage: 2 TB NVMe SSD Model: 7B INT8 or 13B INT4

Standard Production Configuration

Use case: Mid-size facility or multiple facilities sharing one deployment, < 500 concurrent users, standard query complexity

Server: Dual GPU server, 2U or 4U rackmount GPU: 2× NVIDIA A100 (80 GB) or 2× A30 (24 GB) CPU: 32-core x86-64 RAM: 256 GB system RAM Storage: 4 TB NVMe SSD Model: 13B–34B FP16 or 70B INT8 (dual A100)

High-Performance Configuration

Use case: Large enterprise, 1000+ concurrent users, complex analytical queries

Server: Multi-GPU cluster, 4–8 GPUs GPU: 4× NVIDIA H100 (80 GB) or 4× A100 CPU: 64-core x86-64 RAM: 512 GB system RAM Storage: 8 TB NVMe SSD, separate logging storage Model: 70B FP16 (with NVLink GPU interconnect)


Network Architecture

ArgusAI runs within the facility network with no required external connectivity:

Inbound connections: The inference server accepts HTTPS connections from:

  • ArgusIQ (Ask Argus interface sends queries to the ArgusAI endpoint)
  • Authorized user workstations (direct API access if configured)

No outbound connections required: ArgusAI makes no outbound connections during operation. Model weights are downloaded once during deployment; operational queries are processed entirely from local resources.

Air-gapped deployment: For facilities with no internet connectivity, model weights are transferred via removable media or authorized file transfer. Once deployed, no external connectivity is needed.

Network segmentation: ArgusAI can be deployed on the same network segment as ArgusIQ, or on a separate segment with a controlled connection to ArgusIQ’s API. In OT/IT-segmented environments, ArgusAI is deployed on the OT network segment (alongside ArgusIQ), not on the corporate IT network.


Inference Latency and Throughput

Typical inference latency for operational queries (not counting MCP context retrieval):

Model Size GPU Tokens/sec Typical Query Time
7B INT8 A10 (24 GB) 60–80 3–8 seconds
13B FP16 A100 (80 GB) 40–60 5–12 seconds
34B FP16 2× A100 30–50 8–18 seconds
70B FP16 4× A100 20–35 12–25 seconds

MCP context retrieval adds 0.5–3 seconds per data source queried, depending on query complexity and data volume.

For most operational status queries (< 500 output tokens), total response time including MCP retrieval is 5–20 seconds depending on configuration.


Security Configuration

Authentication: ArgusAI inference endpoint authentication via JWT tokens. Integration with ArgusIQ’s RBAC — user permissions that govern which data ArgusIQ serves are enforced in the MCP servers, not bypassed by the AI interface.

Audit logging: All queries and responses logged on-premises, never transmitted externally. Audit logs are stored on-premises.

Model security: Model weights are stored on encrypted storage. Access to the model files requires the server’s encryption key. Model weights do not change after initial deployment (no training, no fine-tuning) — the deployed model is the same model that was deployed.


Talk to our team about ArgusAI deployment specifications for your environment.

Ready to see how this applies to your operations?

Every article describes real capabilities you can deploy today.