Data Miner Architecture
The Sentinel Scout Scraper is the core system responsible for acquiring, processing, and delivering web data.
Anyone can become a Scout AI Data Miner.
Sentinel Scout is designed so that everyday users can participate in global-scale data collection by simply running the software on their Android devices.
Unlike traditional scraping systems that rely heavily on centralized cloud proxies, Scout leverages mobile proxies. This approach is far more efficient:
Mobile IPs are considered more trustworthy by websites.
They face fewer restrictions and lower block rates compared to cloud datacenter IPs.
They have a higher chance of bypassing CAPTCHAs and modern bot-detection systems.
Currently, the Scout AI Data Miner software runs on Android, enabling users to turn their smartphones into active nodes in the Scout network. Each Android device acts as a proxy, contributes bandwidth, rotates IPs, and becomes part of the distributed infrastructure that powers AI data pipelines.
This combination of user-hosted Android nodes and a distributed scraper engine makes Sentinel Scout one of the most decentralized and resilient data acquisition systems available today.
Core Architecture
The Sentinel Scout Scraper is built on a distributed and modular architecture, combining Android-based mobile nodes with a high-performance backend system. Together, these components enable robust, scalable, and censorship-resistant data acquisition for AI and ML workloads.
1. Android-Based Data Miner
Each Android device runs the Scout AI Data Miner app, turning it into a mobile proxy node.
Devices host a local HTTP proxy server with randomized ports and credentials.
IP rotation is supported by toggling airplane mode, automatically refreshing the device’s IP for improved anonymity.
Nodes can connect to reverse tunneling servers, making local proxies globally accessible.
2. WebSocket & Heartbeat Connections
Persistent connections are maintained via WebSockets.
A heartbeat keepalive system ensures reliability by sending periodic ping messages.
If missed pings exceed a threshold, automatic reconnections are triggered.
This system keeps nodes stable even on unreliable mobile networks.
3. Scraper Engine
The backend scraper engine is responsible for data acquisition and processing:
Bot Bypass – User-Agent spoofing, header customization, behavioral emulation, and full JavaScript rendering.
Rotating Proxy Pool – Outgoing requests are distributed across a large proxy set, including Android mobile nodes, reducing block rates.
Headless Chrome – Executes JavaScript-heavy websites, interacts with dynamic content, and ensures accurate data capture.
Task Queue – Powered by Redis/RabbitMQ, supports retries, prioritization, and scalable worker pools.
Web Parser – Extracts clean structured data using CSS selectors, XPath, and tag-stripping rules (e.g., remove
<script>,<style>).
4. Location-Aware Task Distribution
Tasks can be targeted by country, city, or ISP, ensuring geo-relevant scraping.
The system routes tasks to the most suitable proxy/node for bypassing regional restrictions.
5. Storage System APIs (Coming Soon)
Scraped results will be stored via Storage APIs.
Planned integrations include cloud storage (AWS/GCP) and decentralized storage (Jackal Web3).
6. CAPTCHA Handling
Heuristic solving for simple CAPTCHAs.
Retry with new proxy & fingerprint if blocked.
Planned integrations with third-party solvers for advanced cases.
Monitoring & Management
Global Monitoring System – Tracks node health, uptime, errors, and resource usage.
Performance Metrics – Latency, throughput, success rate, and proxy effectiveness are logged.
Dashboards – Powered by Prometheus/Grafana for real-time observability.
Security & Authentication
API Key Manager – Protects all user-facing APIs, with keys managed in the dashboard.
Account Manager – Handles user registration, login, and access control.
Kepler Key-Based Authentication – A preferred login method using cryptographic signature validation.
PAESTO Token Exchange – Separates dashboard access tokens from scraping API tokens for better privilege control.
Why This Architecture Works
Decentralized by Design – Anyone with an Android device can contribute, reducing reliance on centralized data centers.
Resilient to Blocks – Mobile proxies face lower block rates, improving scraping success.
AI-Ready – Data is cleaned, structured, and optimized for direct ML/AI model ingestion.
Scalable & Secure – Built on distributed queues, proxy rotation, and robust authentication.
Last updated