Araña is Vector Pacifico's agentic content-discovery system. It autonomously expands its own source catalog, sanitizes every item through a mandatory integrity gate, deduplicates by SHA-256 hash, and relevance-scores against topic rules before feeding upstream. It processes over one million articles every day across 150,000+ sources globally — and grows that reach on its own initiative.
No signal reaches Miranda without passing through Araña first. It is the immutable agentic foundation the rest of the stack depends on.
Structured-feed-first. Araña prefers clean data streams over scraping. It currently tracks 27 topic domains across the region, with automated discovery continuously expanding the source catalog. Combined with global event-stream aggregation, Araña's reach spans 150,000+ sources across Latin America and the wider world.
Every item is sanitized before any downstream process touches it. If the sanitizer is down, the pipeline halts. This is an architectural rule, not a runtime check — integrity is enforced at the structural level.
Araña applies user-defined topic rules to score every sanitized item. Only content that clears the relevance threshold is surfaced upstream to Miranda — noise is filtered at the discovery layer, not after synthesis.
Four mandatory stages. Every item passes through every gate. No bypass paths exist.
RSS first. If a source publishes a structured feed, we consume the feed. Scraping is a fallback, not a preference — feeds are faster, cleaner, and less fragile.
Sanitizer is mandatory. There is no path around the sanitizer. If sanitization fails, the pipeline halts. Miranda never sees raw data — this is immutable.
Pluggable by design. Every component — cleaner, matcher, store — implements a base interface. Adding a new source, a new cleaning strategy, or a new relevance rule is a plug-in operation, not a rewrite.
Hash-anchored deduplication. Content-level SHA-256 hashing means the same story published across five outlets resolves to one record with five attributions — not five redundant items contaminating downstream analysis.
Araña's sanitized, deduplicated, relevance-scored output is Miranda's sole input. Miranda never touches raw data — that separation is the integrity guarantee.
Vista surfaces the full operational health of Araña's pipeline — source counts, sanitization rates, dedup ratios, relevance thresholds — so analysts always know the state of the data layer.