Data Formats
Data Formats: HTML and JSON
Sentinel Scout supports multiple output formats so developers and AI agents can choose the structure best suited to their workflows. The output format is defined using the outputFileExtension parameter when submitting a scraping job.
Note: For full parameter options and supported tags, refer to the API Docs (Advanced) section.
1. HTML Format
Option:
"outputFileExtension": "EXTENSION_HTML"Description: Returns the cleaned HTML of the target page, with unwanted tags (defined in
tagsToStripOff) removed.Use cases:
Preserving the layout and structure of the original page.
Further parsing with custom HTML processing tools.
Building archives of web content with visual fidelity intact.
Example Request (HTML):
curl -X POST "https://api.scout.sentinel.co/api/v1/probe/sync" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/Machine_learning",
"countryCode": "US",
"outputFileExtension": "EXTENSION_HTML"
}'Example Response (HTML Snippet):
<!DOCTYPE html>
<html>
<head>
<title>Machine learning - Wikipedia</title>
</head>
<body>
<h1>Machine learning</h1>
<p>Machine learning (ML) is a field of artificial intelligence ...</p>
</body>
</html>2. JSON Format
Option:
"outputFileExtension": "EXTENSION_JSON"Description: Parses the scraped content into structured JSON when possible, identifying key-value pairs, tables, and text blocks.
Use cases:
Feeding data directly into ML pipelines.
Programmatic analysis (e.g., sentiment analysis, keyword extraction).
Building structured datasets for research.
Example Request (JSON):
curl -X POST "https://api.scout.sentinel.co/api/v1/probe/sync" \
-H "Authorization: Bearer <API_KEY>" \
-H "Content-Type: application/json" \
-d '{
"url": "https://en.wikipedia.org/wiki/Machine_learning",
"countryCode": "US",
"outputFileExtension": "EXTENSION_JSON"
}'Example Response (JSON Snippet):
{
"title": "Machine learning - Wikipedia",
"headings": [
"Definition",
"History",
"Applications"
],
"content": [
{
"section": "Introduction",
"text": "Machine learning (ML) is a field of artificial intelligence ..."
}
]
}Coming Soon: Markdown Format
Option:
"outputFileExtension": "EXTENSION_MARKDOWN"Description: Will output scraped data in Markdown, making it lightweight, easy to read, and simple to integrate into content systems.
Note: Always choose the format based on your downstream workflow. For archiving and visual fidelity, use HTML. For structured AI/ML pipelines, use JSON. For lightweight content workflows, Markdown support is on the roadmap.
Last updated