Agent Skills Guard

This security guide explains how we review AI agent skills before trust recommendations. The goal is practical: help teams move faster without accepting hidden operational risk.

How the grading model works

Grade labels summarize risk density, not feature richness. We score implementation behavior, permission requirements, and operational blast radius. A skill with fewer features can rank safer than a feature-rich one if it keeps strict boundaries and predictable failure modes.

Grade A
Safe and Mature
Score 90-100
Grade B
Low Risk
Score 70-89
Grade C
Needs Guardrails
Score 50-69
Grade F
Unsafe by Default
Fail / Critical Risk
🛑

1. Remote Code Execution (RCE)

Skills that execute arbitrary code from untrusted or weakly validated input.

Critical Risk
  • No eval-like execution paths from user controlled content
  • No dynamic module loading from untrusted parameters
  • No unsafe deserialization of untrusted payloads
  • No runtime code compilation from raw strings
💥

2. Command Injection

Shell execution patterns where user input can alter command intent.

Critical Risk
  • Avoid shell=true patterns for untrusted arguments
  • Pass command arguments as arrays instead of interpolated strings
  • Reject metacharacters in risky command contexts
  • Do not build command pipelines from untrusted variables
🧨

3. Destructive Operations

Actions that can irreversibly delete data or damage runtime state.

High Risk
  • No recursive destructive deletes without hard guardrails
  • No writes to critical system startup/config locations
  • No raw disk-level operations in general-purpose skills
  • Require explicit confirmation for broad file mutations
📤

4. Network Exfiltration

Unapproved data transfer from local context to external endpoints.

High Risk
  • No outbound file uploads to arbitrary hosts
  • No silent telemetry or hidden tracking paths
  • No transfer of environment or credential files
  • Prefer explicit allowlists for network destinations
🔐

5. Sensitive File Access

Reads from locations that commonly contain credentials or private data.

Medium Risk
  • No broad reads of SSH, cloud, or auth config directories
  • No access to browser storage/history databases by default
  • No reads of system account or protected host files
  • Use narrow path scopes and document required file access
🧯

6. Secrets Leakage

Hardcoded secrets or logging/output paths that expose sensitive tokens.

Medium Risk
  • No hardcoded API tokens in source or config examples
  • No raw env dumps in logs or diagnostic output
  • Mask credentials in error messages and traces
  • No committed secret-bearing local env files
🕳️

7. Persistence and Backdoors

Unauthorized mechanisms that survive session boundaries.

High Risk
  • No hidden startup task registration by default
  • No stealth background services without explicit user action
  • No alias/function hijacking in shell profiles
  • Document all persistence behavior and disable paths

8. Privilege Escalation

Attempts to obtain elevated privileges or bypass permission boundaries.

Critical Risk
  • No privilege escalation patterns in standard skill flows
  • No permission broadening commands as fallback behavior
  • No bypass attempts for host security controls
  • No unsafe permission presets for convenience

Operational review workflow

Strong security review is process-driven. We use a repeatable flow: static pattern checks, behavior simulation, permission scope validation, and rollout-readiness review. This ensures teams can compare skills consistently, not rely on subjective impressions.

  1. Identify dangerous execution primitives and boundary violations.
  2. Test failure behavior under controlled invalid inputs.
  3. Validate documented permissions against observed behavior.
  4. Assign rollout constraints and rollback ownership before promotion.

Teams that skip one of these steps usually pay later with brittle automation, unclear ownership, or emergency rollbacks during peak workload windows.

Worked example: reducing rollout risk in one week

Imagine your team wants to adopt a skill that automates data pull plus content update. Day one, define one measurable outcome and one explicit stop condition. Day two, run preview-only against a narrow dataset. Day three, classify errors by root cause rather than generic logs. Day four, patch guardrails for repeated failure classes. Day five, run one controlled replay and compare baseline metrics.

If throughput improves but error severity rises, promotion should be blocked. If throughput improves and error severity drops or remains stable, promote gradually with bounded scope. This is how teams avoid false confidence from raw speed gains.

  • Do not expand scope until failure classes are understood.
  • Require evidence links for every go/no-go decision.
  • Keep rollback commands documented before first production run.

Frequently Asked Questions

Can a skill still be risky if its score looks high?

Yes. Scoring is a risk signal, not an absolute safety guarantee. Always review permission scope and deployment context.

What should teams do before promoting a newly installed skill?

Run preview-first validation with fixed pass/fail checks, log failure evidence, and confirm rollback ownership.

How often should security review results be refreshed?

Refresh after upstream updates and run a scheduled monthly drift review for production-critical skills.

Why are command and network controls emphasized so strongly?

They create the largest blast radius when misused, especially in automation chains with access to local files and external APIs.

What is the fastest way to reduce real-world rollout risk?

Adopt fewer skills at once, enforce explicit ownership, and require evidence-backed acceptance before production expansion.

Use security scoring as a decision aid, not a shortcut

The safest adoption pattern is still preview-first execution with clear ownership. Use this guide to structure decisions, reduce ambiguity, and keep production rollouts predictable.