DocsSecuritySelf-Learning Safety

Ryvos takes a fundamentally different approach to AI agent safety. Instead of blocking tools or gating actions behind rigid approval flows, Ryvos uses self-learning safety — the agent understands why actions are dangerous and improves its judgment over time.

Core Philosophy

Traditional AI agent security relies on deny-lists, regex patterns, and tier-based blocking. These approaches have critical flaws:

  • Regex cannot enumerate all dangerous commands
  • Blocking is trivially bypassed (encoding, aliasing, indirection)
  • Legitimate use of powerful tools is prevented
  • The system never gets smarter — the same rigid rules apply forever

Ryvos replaces this with a system inspired by how experienced engineers develop safety intuition:

  1. Understanding over prohibition — The agent knows why rm -rf / is dangerous, not just that it matches a pattern
  2. Learning from experience — When something goes wrong, the agent remembers and avoids it next time
  3. Post-hoc accountability — Every action is logged and analyzed, not blocked before execution
  4. Continuous improvement — Safety gets better with use, not worse

:::note This does not mean Ryvos has no safety controls. The agent has constitutional principles, a safety memory, and an audit trail. The difference is that safety comes from the agent's understanding, not from external blocking rules. :::

Constitutional AI (7 Principles)

Every agent run includes positively-framed constitutional principles in the system prompt. These principles guide the agent's reasoning about every action:

1. Preservation

"Ensure that your actions preserve existing systems, data, and configurations. Before modifying or removing anything, understand its current state and purpose."

2. Intent Matching

"Ensure your actions match the user's stated intent. If the intent is ambiguous, clarify before acting. Do not extrapolate beyond what was asked."

3. Proportionality

"Use the minimum level of intervention needed. Prefer targeted changes over broad ones. Prefer reading over writing, editing over replacing, moving over deleting."

4. Transparency

"Explain your reasoning before taking significant actions. Share what you plan to do and why, especially for actions that are difficult to reverse."

5. Boundaries

"Respect system boundaries. Stay within the workspace unless explicitly directed elsewhere. Do not access resources, networks, or services beyond what the task requires."

6. Secrets

"Never expose, log, or transmit secrets, API keys, passwords, or private data. If you encounter secrets in files, treat them as sensitive and do not include them in responses."

7. Learning

"When an action has an unexpected or negative outcome, reflect on what happened and why. Store the lesson for future reference. Actively improve your judgment over time."

These principles are positively framed (research shows positive framing is 27% more effective than negative framing for AI safety). They guide the agent to reason about safety rather than matching patterns.

Safety Memory

The SafetyMemory module provides experience-based learning. It stores safety lessons in SQLite (and optionally Viking) with these fields:

FieldDescription
actionWhat the agent did (tool name + key parameters)
outcomeWhat happened (harmless, near-miss, incident, user-corrected)
reflectionWhy the outcome occurred
corrective_ruleWhat to do differently next time
confidenceHow confident the agent is in this lesson (0.0-1.0)
timestampWhen the lesson was learned

How Lessons Are Created

Lessons are generated through the post-action learning loop:

Tool Execution
    |
    v
Outcome Assessment
    |
    +-- Harmless --> Reinforce positive patterns
    +-- Near-miss --> Generate reflection + corrective rule
    +-- Incident --> Generate reflection + corrective rule (high priority)
    +-- User-corrected --> Extract lesson from user's correction
    |
    v
Store in SafetyMemory (if confidence > threshold)

Lesson Curation

Not all lessons are kept. Low-quality lessons create error loops (research shows memory quality matters more than quantity). Ryvos curates strictly:

  • High-confidence lessons (>0.8) are kept permanently
  • Medium-confidence lessons (0.5-0.8) are kept but may be pruned
  • Low-confidence lessons (below 0.5) are discarded
  • Contradictory lessons trigger re-evaluation

Loading Relevant Lessons

Before each run, relevant safety lessons are loaded into the context:

User message: "delete the old log files"
    |
    v
SafetyMemory search: "delete files"
    |
    v
Relevant lessons loaded:
  - "When deleting files, always confirm the exact path first.
     A previous run accidentally deleted config files in a
     similarly-named directory." (confidence: 0.92)

The agent sees these lessons alongside the constitutional principles, giving it both general principles and specific experience.

Research Backing

This architecture is grounded in published research:

FindingSourceRelevance
Safety and capability improve together (15% to 70% safety, 75% to 95% task completion)Agent Safety Alignment via RL, 2025Safety does not require sacrificing capability
Constitutional prompting works without fine-tuningDeepSeek-R1, Gemma-2, Llama, Qwen studiesWorks on any model, no training needed
Reflexion: 91% vs 80% without, using verbal RLReflexion paper, GPT-4Experience-based learning with frozen weights
Positive framing 27% more effective than negativeC3AI, 2025"Ensure preservation" works better than "don't delete"
Strict memory curation yields 10% improvementMemory quality studiesBad lessons create error loops

Tiered Safety (Optional)

Ryvos retains a tiered system as an optional baseline layer:

TierLevelExamples
T0Saferead, glob, grep, memory_search
T1Lowweb_fetch, web_search
T2Mediumwrite, edit, apply_patch, MCP tools
T3Highbash, spawn_agent
T4CriticalUnparseable bash commands (fail-safe)
[security]
auto_approve_up_to = "T1"         # Auto-approve safe and low-risk tools
deny_above = "T3"                  # Require approval for high-risk tools
approval_timeout_secs = 60

:::tip The tier system is a configurable baseline, not the primary safety mechanism. Constitutional AI and safety memory provide the real protection. Many users set auto_approve_up_to = "T3" and rely on the self-learning system. :::

Optional User Checkpoints

Users can opt into soft pauses for specific tools:

[security]
pause_before = ["file_delete", "git_push"]

When the agent wants to use a paused tool, it explains its reasoning and waits for confirmation. This is the user's choice — the agent is never silently blocked.

Dangerous Pattern Detection

Ryvos includes 9 built-in patterns for bash commands that are almost always unintentional:

  • rm -rf / — Root filesystem deletion
  • git push --force — Force push (data loss risk)
  • DROP TABLE — Database table deletion
  • chmod 777 — World-writable permissions
  • mkfs — Filesystem formatting
  • dd if= — Raw disk writes
  • > /dev/ — Writing to device files
  • curl | bash — Remote code execution
  • wget | bash — Remote code execution

These patterns do not block execution. They trigger the agent's constitutional reasoning: "This matches a dangerous pattern. Let me verify this is exactly what the user intended and explain the risks."

Next Steps