Content Moderation

Geetanjali implements multi-layer content moderation to maintain focus on genuine ethical guidance.

Design Principles


Defense Layers

User Input → [Frontend] → [Backend Blocklist] → Database → LLM → [Refusal Detection] → Response
                 ↓               ↓                                       ↓
            Instant UX       HTTP 422                        Policy Violation Response
Layer When Purpose
Frontend Client-side Instant feedback, reduce API calls
Backend Blocklist Pre-DB write Authoritative validation
LLM Refusal Post-LLM Catch LLM safety refusals

Backend Blocklist

Catches obvious violations before content reaches the database.

Applied to:

Violation Types

Type Description Example Blocked
explicit_sexual Sexual acts, anatomy, pornography -
explicit_violence Harm instructions, targeted violence -
profanity_abuse Direct abuse at reader/system “f*ck you”, “you’re an idiot”
spam_gibberish Repeated chars, no recognizable words aaaaaaa, asdf asdf asdf

Note: Contextual profanity is allowed. “My boss said this is bullshit” passes; “f*ck you” is blocked.

Response

HTTP 422 Unprocessable Entity with differentiated messages:

Violation Message
Spam/Gibberish “Please enter a clear description…”
Profanity/Abuse “Please rephrase without direct offensive language…”
Explicit “We couldn’t process this submission…”

Frontend Validation

Client-side validation provides instant feedback before API calls. Uses the obscenity library for obfuscation detection (f4ck, sh1t).

Applied to:

Frontend validation mirrors backend logic but is for UX only—backend is authoritative.


LLM Refusal Detection

Detects when the LLM refuses to process content due to built-in safety guidelines. Runs after LLM generation, before JSON parsing.

Detection Patterns

Matches phrases like:

Response

When refusal is detected, the case is marked policy_violation and returns an educational response:

{
  "executive_summary": "We weren't able to provide guidance for this request...",
  "options": [
    {"title": "Reflect on Your Underlying Concern", "..."},
    {"title": "Rephrase Your Dilemma", "..."},
    {"title": "Explore the Bhagavad Geeta Directly", "..."}
  ],
  "recommended_action": {"option": 2, "steps": ["..."]},
  "reflection_prompts": ["What ethical tension am I truly wrestling with?", "..."],
  "confidence": 0.0,
  "policy_violation": true
}

Configuration

# Master switch
CONTENT_FILTER_ENABLED=true

# Backend blocklist (explicit, spam, gibberish)
CONTENT_FILTER_BLOCKLIST_ENABLED=true

# Profanity/abuse detection (uses better-profanity library)
CONTENT_FILTER_PROFANITY_ENABLED=true

# LLM refusal detection
CONTENT_FILTER_LLM_REFUSAL_DETECTION=true

Disable all for testing: CONTENT_FILTER_ENABLED=false


Policy Violation UI

When a policy violation occurs, the UI adapts:

Element Normal Case Policy Violation
Status Badge “Completed” (green) “Unable to Process” (amber)
Completion Banner “Analysis Complete” “Unable to Provide Guidance”
Follow-up Input Visible Hidden
Share Button Enabled Disabled
Export Normal Includes notice

Extending Patterns

To add new blocklist patterns, edit backend/services/content_filter.py:

# Add to appropriate list
_EXPLICIT_VIOLENCE_PATTERNS = [
    # ... existing patterns ...
    r"\bnew_pattern_here\b",
]

Patterns are compiled at import time for performance. Changes require container restart.

Pattern Guidelines

  1. Use word boundaries (\b) to avoid partial matches
  2. Prefer specific patterns over broad ones
  3. Test against false positives before deploying
  4. Document the intent in comments

Logging

Content is never logged. Only metadata:

logger.warning(
    "Blocklist violation detected",
    extra={
        "violation_type": "profanity_abuse",
        "input_length": 42
    }
)

This enables monitoring violation rates without exposing user content.


Testing

Run content filter tests:

docker compose run --rm backend python -m pytest tests/test_content_filter.py -v

Key test cases: